Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 106 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 106
Topic #: 1

[All Certified Data Engineer Professional Questions]

A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:

user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING

The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.

Which solution minimizes the compute costs to propagate this batch of data?

A. Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.
B. Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.
C. Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
D. Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
E. Reprocess all records in reviews_raw and overwrite the next table in the pipeline.

Show Suggested Answer

Suggested Answer: A 🗳️

by divingbell17 at Jan. 2, 2024, 4:09 a.m.

Comments

Submit Cancel

alexvno

Highly Voted 1 year ago

Selected Answer: A

Deduplication , so insert-only merge

upvoted 5 times

...

bacckom

Highly Voted 1 year, 2 months ago

Selected Answer: A

Should we consider deduplicate? For Time travel, I don't think it can be used to duplicate the target table.

upvoted 5 times

...

aarora

Most Recent 2 months, 1 week ago

Selected Answer: C

To minimize compute costs, the most efficient approach is to leverage Delta Lake’s version history to identify only the new records added since the previous ingestion and process those. Here’s why this solution works best: • Delta Lake versioning: Delta Lake tracks changes to the data through its transaction log. By comparing the latest version of the table with the previous version, you can identify only the records that were appended (new data for the previous date). • Efficient processing: By working only with the delta (new records), you avoid scanning the entire reviews_raw table, which reduces compute and storage I/O costs. • Accurate and optimized: This approach ensures no unnecessary reprocessing of older data while still capturing any delayed records. It works well for use cases involving deduplication and validation.

upvoted 1 times

...

Hienlv1

3 months, 2 weeks ago

Selected Answer: C

I think C is the correct answer, use the time travel feature to get the previous version and compare it to the current version to figure out which record needs to be inserted instead of a full scan during read like option A. The goal is to minimize compute costs while propagating only new records inserted into the reviews_raw table to the next table in the pipeline.

upvoted 1 times

...

Sriramiyer92

3 months, 3 weeks ago

Selected Answer: A

In case of D. The 48 hrs point just added to confuse us folks. A is enough.

upvoted 1 times

...

cales

5 months, 3 weeks ago

Selected Answer: B

"The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched." The deduplication will be performed in the following step. Answer B should fit better with cost minimization

upvoted 1 times

Sriramiyer92

3 months, 3 weeks ago

Keyword - batch Propagated = Movement. Does not necessarily mean "Stream". Also with Data streaming it would become an expensive affair.

upvoted 2 times

...

shaojunni

5 months, 4 weeks ago

Selected Answer: A

Batch read load full table, but guarantee no duplication with merge. Trigger Once only load new data, you have to run merge to guarantee no duplication in the whole target file. But B does not indicate that.

upvoted 1 times

...

RyanAck24

6 months, 2 weeks ago

Selected Answer: A

A is Correct

upvoted 1 times

...

shaojunni

6 months, 2 weeks ago

Selected Answer: B

B is correct, trigger once is the option in structured streaming for batch style job, but much more efficient.

upvoted 1 times

...

shaojunni

6 months, 2 weeks ago

B is correct, trigger once is the option in structured streaming for batch style job, but much more efficient.

upvoted 1 times

...

spaceexplorer

1 year, 2 months ago

Selected Answer: B

B is correct

upvoted 1 times

...

ranith

1 year, 2 months ago

B should be correct when looking at cost minimalization, a batch read would scan the whole reviews_raw table, this is unnecessary as historical data is not changed. If a review is delyaed to be approved by the moderator still it is inserted as a new record. Capturing the new data is sufficient.

upvoted 2 times

...

divingbell17

1 year, 3 months ago

Selected Answer: B

B should be correct. https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

upvoted 4 times

Istiaque

1 year, 3 months ago

It is a batch process.

upvoted 2 times

...

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 106 discussion

Comments

alexvno

bacckom

aarora

Hienlv1

Sriramiyer92

cales

Sriramiyer92

shaojunni

RyanAck24

shaojunni

shaojunni

spaceexplorer

ranith

divingbell17

Istiaque

SY0-701