Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 132 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 132
Topic #: 1
[All Certified Data Engineer Professional Questions]

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Which implementation can be used to efficiently update the described account_current table as part of each hourly batch job assuming there are millions of user accounts and tens of thousands of records processed hourly?

  • A. Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
  • B. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger available job to batch update newly detected files into the account_current table.
  • C. Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
  • D. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_login by user_id write a merge statement to update or insert the most recent value for each user_id.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
benni_ale
2 weeks, 5 days ago
Selected Answer: D
B or D ... associate course tells us to use auto loader when bilions of rows...
upvoted 1 times
...
RyanAck24
2 months ago
Selected Answer: D
D seems like the best option
upvoted 1 times
...
shaojunni
2 months ago
Selected Answer: B
A, D both wrong. They only take data from the latest update. It is too narrow. Same user_id can have several updates within an hour to update different fields. So use auto loader to apply all the updates within an hour is the only correct answer.
upvoted 1 times
...
fe3b2fc
3 months, 1 week ago
Selected Answer: A
Answer is A. You're meeting all the requirements with less overhead. It's only updating on the most recent record, so duplicates are handled. Answer D is too much overhead. They're doing a full table scan for all records, which as the question stated, is millions of records.
upvoted 1 times
Onobhas01
2 months, 3 weeks ago
User Id would be a better column to merge into with, username might not be distinct
upvoted 1 times
...
...
Freyr
5 months, 3 weeks ago
Selected Answer: D
Correct Answer: D Similar to Option A, but specifically designed around the user_id, which is the primary key. This approach ensures that the account_current is always up-to-date with the latest information per user based on the primary key, reducing the risk of duplicate information and ensuring the integrity of the data with respect to the unique identifier.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...