exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 145 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 145
Topic #: 1
[All Certified Data Engineer Professional Questions]

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:



Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.

If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

  • A. Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
  • B. All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
  • C. The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
  • D. The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
arekm
1 month ago
Selected Answer: A
A - two orders [customer_id, order_id] might be emitted with time column value that is 2 or more hours apart. So the watermark will not drop the new record, since it will have a new time value. From the composite key perspective, it is going to be a duplicate.
upvoted 1 times
...
UrcoIbz
1 month, 2 weeks ago
Selected Answer: A
dropDuplicates only deletes the duplicates of the processed batch. If we have records with same key in different batches, we will have duplicates in the final table. In addition, withWatermark, when is not used in a window, gets the MAX(eventTime) and uses the threshold to define the time range. As the time represent when the data has been queued in the source system, we can get records where the time we get older than 2 hours. pyspark.sql.DataFrame.dropDuplicates — PySpark 3.5.3 documentation pyspark.sql.DataFrame.withWatermark — PySpark 3.5.3 documentation
upvoted 2 times
...
vish9
2 months, 3 weeks ago
Selected Answer: D
The orders arriving 2 hours or later will be dropped. There is a chance that they can be processed, but still deduplication will happen.
upvoted 1 times
...
smashit
2 months, 4 weeks ago
There might be chance that same record for example A1,O1 comes in Batch B1 also comes in B2. we need to implement merge logic inside our target table or perform insert-only merge.
upvoted 1 times
...
Jugiboss
3 months, 2 weeks ago
Selected Answer: A
Watermark thresholds guarantee that records arriving within the specified threshold are processed according to the semantics of the defined query. Late-arriving records arriving outside the specified threshold might still be processed using query metrics, but this is not guaranteed.
upvoted 1 times
...
m79590530
3 months, 2 weeks ago
Selected Answer: D
The default write mode is 'append'. Duplicate will be resolved for each 2 hr window and .withWatermark() will drop/ignore the records delayed more than 2 hours apart.
upvoted 1 times
...
csrazdan
5 months ago
Selected Answer: A
The default write mode is append. Duplicate will be resolved for only 2 hr window but may still exist because of previous execution.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago