Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 14 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 14
Topic #: 1
[All Certified Data Engineer Professional Questions]

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

  • A. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.
  • B. Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
  • C. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.
  • D. Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.
  • E. Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
terrku
Highly Voted 8 months, 2 weeks ago
Selected Answer: B
Type 1 table means the behavior is overwriting.
upvoted 6 times
...
RafaelCFC
Highly Voted 10 months, 2 weeks ago
Selected Answer: C
My reasoning is thus: The application is based on batch processes, so A is wrong. Overwriting the table would destroy the Type 1 SCD behavior, so B is wrong. Comparing versions of account_history would not be efficient, as the whole data would be scanned, so D is wrong. 'username' is not a key column, so we have no guarantee that it's unique, thus de-duplicating by it can yield wrongly grouped sets of rows, so E is not a safe bet, with the information we know. C is the best option.
upvoted 6 times
...
Ananth4Sap
Most Recent 1 month ago
C is correct because B is wrong as it says filtering the max value of last updated + overwriting we will miss some valid records. two valid scenarios: 1-Filtering the max value of last updated +merging (Option-c) 2.use window function on last update, filter and then overwrite (no options)
upvoted 1 times
...
benni_ale
1 month, 1 week ago
Selected Answer: C
A. NO. Batch job required so AutoLoader and StructuredStreaming unecessarily complex solutions. B. NO. A full overwrite of the table is not efficient. C. YES. Seems it is filterning and merging on the id by using as less data as reasonable in the merge statement, why not? D. NO. Difference operation is very ineffecient for this purpose E. NO. Username is not key
upvoted 1 times
...
Dhusanth
3 months, 3 weeks ago
Selected Answer: C
C is correct
upvoted 1 times
...
faraaz132
3 months, 3 weeks ago
C is correct because: A record might have multiple changes and we need to select the most recent change that happened on that record. For that we will use max Log in date and rank it using window function, then we filter on rank=1 and use it for UPSERT operation.
upvoted 1 times
...
Karunakaran_R
5 months, 3 weeks ago
I think B ,Type 1 table must overwrite the data
upvoted 1 times
...
Freyr
6 months ago
C is correct. A Type 1 table means that it performs an "upsert" operation without maintaining history, based on the merge condition. This means that new records are inserted, and existing records are updated. As a result, the merge process does not retain historical records. Therefore, the correct answer is C.
upvoted 3 times
...
PrashantTiwari
9 months, 2 weeks ago
C is correct
upvoted 1 times
...
DAN_H
9 months, 4 weeks ago
Selected Answer: C
answer is C
upvoted 2 times
...
spaceexplorer
10 months ago
Selected Answer: C
answer is C
upvoted 1 times
...
kz_data
10 months, 2 weeks ago
Selected Answer: C
Correct answer is C
upvoted 1 times
...
ATLTennis
10 months, 3 weeks ago
Selected Answer: D
D is the most optimal way to identify the changes in the last data refresh
upvoted 1 times
AndreFR
3 months ago
doesn't work because it doesn't take into account requirement on user_id
upvoted 2 times
...
...
sturcu
1 year, 1 month ago
Selected Answer: E
We need to filter on last hours and deduplicate records, then merge. Do is not correct, filtering on max loggin_date makes no sense.
upvoted 2 times
sturcu
1 year, 1 month ago
If the "las log in" is the column that shows the lates version of the record then answer c is correct
upvoted 1 times
...
petrv
12 months ago
deduplication on username does not make sense, username is not PK.
upvoted 3 times
...
...
Eertyy
1 year, 2 months ago
correct answer is C
upvoted 1 times
...
thxsgod
1 year, 2 months ago
Selected Answer: C
Correct
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...