Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 150 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 150
Topic #: 1

[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

A. return spark.readStream.table("bronze")
B. return spark.read.option("readChangeFeed", "true").table ("bronze")
C.
D.

Show Suggested Answer

Suggested Answer: A 🗳️

by MDWPartners at May 29, 2024, 7:46 p.m.

Comments

Submit Cancel

Freyr

Highly Voted 1 year, 1 month ago

Selected Answer: B

Correct Answer: B The Change Data Feed (CDF) feature in Delta Lake enables reading only the changes (inserts and updates) to a Delta table. This would allow the function to focus on new or modified data since the last trigger, making it ideal for processing only the new records that have not been processed yet. This directly meets the requirement for identifying and manipulating new records efficiently.

upvoted 7 times

arekm

6 months ago

It is missing the starting version. You either use readStream without the version or read (aka batch) with the version. In here in B the version is missing. Since the source table is append only, it is perfect for streaming, making A the right choice.

upvoted 2 times

...

practicioner

10 months, 3 weeks ago

We are ingesting data from the folder with a parquet in the bronze table. It doesn't make any sense to use the CDF feature for bronze table )

upvoted 1 times

practicioner

10 months, 3 weeks ago

I've changed my opinion. Yes, B looks as correct answer

upvoted 2 times

...

thierryb

Most Recent 3 months, 1 week ago

Selected Answer: A

any tables can become a stream in Databricks, therefore answer A. It cannot be answer B as the table named bronze does not have CDF enabled

upvoted 2 times

...

mohadjhamad

3 months, 2 weeks ago

Selected Answer: B

The function new_records() is meant to retrieve new records that have not yet been processed in the pipeline. Delta Lake provides a feature called the Change Data Feed (CDF), which allows you to track changes (inserts, updates, and deletes) in a Delta table. By using .option("readChangeFeed", "true"), the function can read only the new changes from the Delta table named "bronze". This ensures that only unprocessed records are returned, aligning with the pipeline's requirement.

upvoted 1 times

...

arekm

6 months ago

Selected Answer: A

A - append only table works for streaming, B is missing the starting version, C - timestamp during the insertion will be different from the one during the next step, D - that was a "like" type query (or maybe a substring on the file name == the directory), then yes - actual file names only start with the pattern presented, but are longer.

upvoted 1 times

...

UrcoIbz

6 months, 2 weeks ago

Selected Answer: A

Option A seems to be the correct answer. Option B seems not to be the right one, because is missing the version. Based on the documentation 'Change data feed also supports batch execution, which requires specifying a starting version'. https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed#batch Option D , in my opinion, is not correct, as the function definition above is not having any input parameter.

upvoted 2 times

...

m79590530

8 months, 2 weeks ago

Selected Answer: A

Correct answer is A as we have append-only mode writes which are ideal for simple Structured Streaming as a next step ;)

upvoted 3 times

...

shaojunni

8 months, 3 weeks ago

Selected Answer: A

delta table returns new records in streaming read.

upvoted 3 times

...

pk07

9 months, 1 week ago

Selected Answer: B

B. Set the skipChangeCommits flag to true on raw_iot Let's break down the requirements and explain why this is the best solution: Retain manually deleted or updated records in raw_iot: The skipChangeCommits flag, when set to true, tells Delta Live Tables (DLT) to ignore any manual changes (updates or deletes) made to the table outside of the pipeline. This means that even if records are manually deleted or updated in the raw_iot table, these changes won't be reflected in the table when the pipeline runs again. Recompute downstream bpm_stats table: By default, DLT will recompute downstream tables when their upstream dependencies change. Since bpm_stats is based on raw_iot, it will naturally be recomputed when the pipeline updates, without any special configuration. Why the other options are not correct: A. Setting pipelines.reset.allowed to false on raw_iot would prevent the table from being reset, but it wouldn't address the requirement to retain manually deleted or updated records.

upvoted 1 times

...

shaojunni

9 months, 1 week ago

Selected Answer: D

You have to know the CDF's current version and last processed the version in order to get not processed records. B does not provide those versions. It will just return content from the bronze table with CDF turned on. D is only possible solution.

upvoted 1 times

...

HelixAbdu

11 months, 1 week ago

I did not test it. But i think D is wrong as it filtering agenst directory path using ==

upvoted 2 times

...

MDWPartners

1 year, 1 month ago

Selected Answer: D

Seems D

upvoted 4 times

...