Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 95 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 95
Topic #: 1
[All Certified Data Engineer Professional Questions]

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:



Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.

If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

  • A. Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
  • B. All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
  • C. The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
  • D. Duplicate records arriving more than 2 hours apart will be dropped, but duplicates that arrive in the same batch may both be written to the orders table.
  • E. The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
alexvno
Highly Voted 11 months, 1 week ago
Selected Answer: A
Only A seems logical
upvoted 7 times
...
benni_ale
Most Recent 4 days ago
Per me è la cipolla
upvoted 1 times
...
cf56faf
1 week, 6 days ago
Selected Answer: E
Seems that E should be the correct answer. As time is the time it was queued in the *source_system*. And withWatermark ignores records that have a "time" more than 2 hours old.
upvoted 1 times
...
Ananth4Sap
4 weeks ago
When watermarking is set to 2 hours, the system will wait for up to 2 hours for late data to arrive. Any data that arrives within this 2-hour window will be considered for processing and deduplication. However, data that arrives later than 2 hours after the event time will be considered too late and will be discarded. This ensures that the state store does not grow indefinitely, but it also means that any records arriving more than 2 hours late will not be included in the orders table.
upvoted 1 times
...
m79590530
1 month ago
Selected Answer: E
Every Stream micro-batch is executed on all of the new data that arrived after the last run 2 hours ago by the .trigger(once=True) option. Deduplication is done for it based on the combined key fields values but all records older than 2 hours based on the 'time' field will be ignored thanks to the .withWatermark() option/function. So target table will have deduplicated data withOUT the late records arriving more than 2 hours later based on the 2 hours watermark buffer set for the readStream.
upvoted 1 times
...
shaojunni
1 month, 1 week ago
Selected Answer: E
Data arrive outside of watermark will be dropped.
upvoted 2 times
...
quaternion
3 months, 1 week ago
Selected Answer: E
Watermark("time", "2 hours") --> does'nt let records arriving more than 2 hours late to be written dropDuplicates --> removes duplicate records from the records that are read
upvoted 2 times
...
Isio05
5 months, 2 weeks ago
Selected Answer: A
It's A, rows are deduplicated only in 2hrs window, therefore final table may eventually contain duplicates
upvoted 3 times
...
QuangTrinh
5 months, 3 weeks ago
Selected Answer: E
Should be E. Watermarking (withWatermark("time", "2 hours")): This sets a 2-hour watermark on the time column. The watermark specifies the event time threshold for data completeness, meaning that data older than 2 hours will be considered late and may be dropped. Deduplication (dropDuplicates(["customer_id", "order_id"])): This operation removes duplicates based on the composite key (customer_id and order_id). However, it only works within the window defined by the watermark.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...