Exam Certified Data Engineer Professional topic 1 question 72 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 72
Topic #: 1

[All Certified Data Engineer Professional Questions]

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.

Original query:

Proposed query:

Proposed query:

.start(“/item_agg”)

Which step must also be completed to put the proposed query into production?

A. Specify a new checkpointLocation
B. Increase the shuffle partitions to account for additional aggregates
C. Run REFRESH TABLE delta.'/item_agg'
D. Register the data in the "/item_agg" directory to the Hive metastore
E. Remove .option(‘mergeSchema’, ‘true’) from the streaming write

Show Suggested Answer

Suggested Answer: A 🗳️

by f728f7f at Dec. 21, 2023, 2:08 p.m.

Comments

Submit Cancel

f728f7f

Highly Voted 1 year ago

This question is broken. Proposed query cannot be identified.

upvoted 24 times

...

AlejandroU

Highly Voted 6 months, 4 weeks ago

Selected Answer: A

Below is the proposed query: df.groupBy("item") .agg(count("item").alias("total_count"), mean("sale_price").alias("avg_price"), count("promo_code = 'NEW MEMBER'") .alias("new member_promo")) writeStream .outputMode("complete") .option('mergeSchema', 'true') .option("checkpointLocation", "/item_agg/ checkpoint") .start("/item_agg") Answer A. When updating the schema of a streaming job by adding new fields (like the new_member_promo field), it’s important to use a new checkpoint location. This is because the existing checkpoint location is tied to the old schema, and adding a new field could lead to schema mismatch issues.

upvoted 5 times

OnlyPraveen

6 months, 3 weeks ago

Thank you! Also check Question #114 which has the Proposed Query image too.

upvoted 1 times

...

KadELbied

Most Recent 2 months, 1 week ago

Selected Answer: A

suretly A

upvoted 1 times

...

kino_1994

7 months ago

Selected Answer: A

Since the new field is a count (an aggregation), it is non-nullable, making the change incompatible with the existing schema. This requires a new checkpointLocation to avoid schema mismatch issues. Additionally, the "mergeSchema=true" option must remain enabled to allow Spark to handle the schema evolution properly. However, if the field were nullable and not an aggregation, it would be a backward-compatible change, allowing the checkpoint to remain unchanged, as happens with schema evolution in Kafka. In this case, the correct answer is A.

upvoted 2 times

...

Sriramiyer92

7 months, 1 week ago

Selected Answer: A

The given answer is correct. In case of addition of new cols (or changes) the checkpoint location also needs to change.

upvoted 1 times

...