Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 131 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 131
Topic #: 1
[All Certified Data Engineer Professional Questions]

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

  • A. Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.
  • B. Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system.
  • C. Deduplicate records in each batch by pk_id and overwrite the target table.
  • D. Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
Huepig
3 weeks, 4 days ago
There is no right answer. Closest is B only after the CDC logs are ingested to a Bronze table and then use merge into a silver table. Why not D? Because CDF only works on delta tables and not on external CDC logs.
upvoted 2 times
...
HelixAbdu
4 months ago
The MERGE INTO statement in Delta Lake is a powerful feature designed to handle Change Data Capture (CDC) data efficiently. This approach meets both the auditing and analytical requirements. CDF is not enabled by default. So these data is not generated by it to handel them.
upvoted 1 times
practicioner
3 months, 1 week ago
I'd agree, but there is "a full record of all values that have ever been valid in the source system". After deleting records we can still use time-travel options... But after vacuuming audit team will be dissapinted
upvoted 1 times
...
...
Ati1362
5 months ago
Selected Answer: D
agree with D
upvoted 2 times
...
BrianNguyen95
5 months, 3 weeks ago
Selected Answer: D
Delta Lake provides built-in change data feed functionality. It captures changes (inserts, updates, deletes) and propagates them to dependent tables. By using Delta Lake, you can maintain historical records and propagate changes efficiently.
upvoted 2 times
...
Freyr
5 months, 3 weeks ago
Selected Answer: D
Correct Answer: D Delta Lake’s change data feed feature is specifically designed to handle CDC scenarios. It processes data from external systems, tracking all changes (inserts, updates, deletes) and maintaining a detailed history of these changes. This feature allows for keeping a comprehensive log while also ensuring the most recent state is correctly reflected in analytical tables.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...