exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 150 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 150
Topic #: 1
[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:



The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

  • A. return spark.readStream.table("bronze")
  • B. return spark.read.option("readChangeFeed", "true").table ("bronze")
  • C.
  • D.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Freyr
Highly Voted 8 months, 1 week ago
Selected Answer: B
Correct Answer: B The Change Data Feed (CDF) feature in Delta Lake enables reading only the changes (inserts and updates) to a Delta table. This would allow the function to focus on new or modified data since the last trigger, making it ideal for processing only the new records that have not been processed yet. This directly meets the requirement for identifying and manipulating new records efficiently.
upvoted 7 times
arekm
1 month ago
It is missing the starting version. You either use readStream without the version or read (aka batch) with the version. In here in B the version is missing. Since the source table is append only, it is perfect for streaming, making A the right choice.
upvoted 1 times
...
practicioner
5 months, 3 weeks ago
We are ingesting data from the folder with a parquet in the bronze table. It doesn't make any sense to use the CDF feature for bronze table )
upvoted 1 times
practicioner
5 months, 3 weeks ago
I've changed my opinion. Yes, B looks as correct answer
upvoted 2 times
...
...
...
arekm
Most Recent 1 month ago
Selected Answer: A
A - append only table works for streaming, B is missing the starting version, C - timestamp during the insertion will be different from the one during the next step, D - that was a "like" type query (or maybe a substring on the file name == the directory), then yes - actual file names only start with the pattern presented, but are longer.
upvoted 1 times
...
UrcoIbz
1 month, 2 weeks ago
Selected Answer: A
Option A seems to be the correct answer. Option B seems not to be the right one, because is missing the version. Based on the documentation 'Change data feed also supports batch execution, which requires specifying a starting version'. https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed#batch Option D , in my opinion, is not correct, as the function definition above is not having any input parameter.
upvoted 2 times
...
m79590530
3 months, 2 weeks ago
Selected Answer: A
Correct answer is A as we have append-only mode writes which are ideal for simple Structured Streaming as a next step ;)
upvoted 3 times
...
shaojunni
3 months, 3 weeks ago
Selected Answer: A
delta table returns new records in streaming read.
upvoted 3 times
...
pk07
4 months, 1 week ago
Selected Answer: B
B. Set the skipChangeCommits flag to true on raw_iot Let's break down the requirements and explain why this is the best solution: Retain manually deleted or updated records in raw_iot: The skipChangeCommits flag, when set to true, tells Delta Live Tables (DLT) to ignore any manual changes (updates or deletes) made to the table outside of the pipeline. This means that even if records are manually deleted or updated in the raw_iot table, these changes won't be reflected in the table when the pipeline runs again. Recompute downstream bpm_stats table: By default, DLT will recompute downstream tables when their upstream dependencies change. Since bpm_stats is based on raw_iot, it will naturally be recomputed when the pipeline updates, without any special configuration. Why the other options are not correct: A. Setting pipelines.reset.allowed to false on raw_iot would prevent the table from being reset, but it wouldn't address the requirement to retain manually deleted or updated records.
upvoted 1 times
...
shaojunni
4 months, 2 weeks ago
Selected Answer: D
You have to know the CDF's current version and last processed the version in order to get not processed records. B does not provide those versions. It will just return content from the bronze table with CDF turned on. D is only possible solution.
upvoted 1 times
...
HelixAbdu
6 months, 1 week ago
I did not test it. But i think D is wrong as it filtering agenst directory path using ==
upvoted 2 times
...
MDWPartners
8 months, 1 week ago
Selected Answer: D
Seems D
upvoted 4 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago