Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 15 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 15
Topic #: 1

[All Certified Data Engineer Professional Questions]

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?

A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
B. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
C. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
D. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.
E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Show Suggested Answer

Suggested Answer: E 🗳️

by Eertyy at Aug. 27, 2023, 6:26 p.m.

Comments

Submit Cancel

Eertyy

Highly Voted 1 year, 10 months ago

E is right answer

upvoted 6 times

...

KadELbied

Most Recent 2 months, 1 week ago

Selected Answer: E

suretly E

upvoted 1 times

...

JoG1221

2 months, 3 weeks ago

Selected Answer: E

Option E aligns with Delta Lake best practices: Efficient updates using MERGE Precise tracking with Change Data Feed Streamlined ML inference on only updated records

upvoted 1 times

...

AlHerd

3 months, 2 weeks ago

Selected Answer: E

E. While both D and E look right D only adds a timestamp but doesn’t track whether the record content actually changed, leading to false positives.

upvoted 1 times

...

Tedet

4 months, 2 weeks ago

Selected Answer: D

Evaluation: Adding a current_timestamp() field to each record during the overwrite allows you to track when each record was written. This makes it easy to identify records that have been updated or inserted recently by filtering on this timestamp field (e.g., filtering for records written in the past 24 hours). This approach simplifies identifying recently changed records because you can easily filter for the most recent data and then run churn predictions only on those records. Conclusion: This is a simple and efficient solution. It allows you to track changes by using a timestamp, making it easy to filter and predict only on changed records without complex logic.

upvoted 1 times

...

arekm

6 months, 2 weeks ago

Selected Answer: E

A, B, and C don't make sense. Adding a timestamp with an overwrite logic that overwrites everything does not make sense - all records would have a timestamp from the last night. That would be not helpful in identifying what changed. E is correct. Only write changes, use CDF to identify the changes and apply the model.

upvoted 2 times

...

Sriramiyer92

7 months ago

Selected Answer: E

While both E and D are correct. E is more accurate, given the scenario

upvoted 1 times

...

janeZ

7 months, 1 week ago

Selected Answer: D

D is the right answer

upvoted 1 times

...

Melik3

11 months, 1 week ago

I don't understand why E is correct. With E we are updating only data needed but we are then doing prediction on the whole table which means that we are doing again predictions on not changing records which is not efficient

upvoted 1 times

Tedet

4 months, 2 weeks ago

You are 100pc correct Melik3. Reason being consequences of E are below. A merge statement ensures that only the records that have changed are updated, but it doesn’t directly address how to identify which records have changed within the last 24 hours. Using a change data feed can help track changes, but it may not be the most efficient method unless the infrastructure is set up for real-time change tracking. The complexity of managing and using the change data feed for just 24-hour changes might introduce unnecessary overhead. Conclusion: This is a good option, but it could be more complex to implement than simply adding a current_timestamp() field.

upvoted 1 times

...

benni_ale

8 months, 3 weeks ago

"write logic to make predictions on the CHANGED records identified by the change data feed". the only thing partially wrong about E is that it has never been stated that the table has a change data feed enables.

upvoted 1 times

...

leopedroso1

1 year, 4 months ago

E is the correct one. By removing overwrite with merge, this will lead to an UPSERT causing updating only the data needed ( When Matched Upate + When not mached insert clauses). Then, with the CDC the capability of identifying is also satisfied.

upvoted 2 times

...