Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 30 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 30
Topic #: 1
[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

  • A. return spark.readStream.table("bronze")
  • B. return spark.readStream.load("bronze")
  • C.
  • D. return spark.read.option("readChangeFeed", "true").table ("bronze")
  • E.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
AzureDE2522
Highly Voted 1 year ago
Selected Answer: D
# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html
upvoted 9 times
shaojunni
1 month, 2 weeks ago
readChangeFeed is disabled by default.
upvoted 1 times
...
t_d_v
3 months, 1 week ago
There is no stream in option D
upvoted 1 times
GHill1982
3 months ago
You can read Delta Lake Change Data Feed without using a stream. You can use batch queries to read the change data feed by setting the readChangeFeed option to true.
upvoted 2 times
...
...
...
Laraujo2022
Highly Voted 1 year ago
In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():
upvoted 8 times
...
cbj
Most Recent 1 month ago
Selected Answer: A
Others can't ensure data not being processed. e.g. if the code not run for one day and run next day, C or E will mis process one day's data.
upvoted 1 times
...
shaojunni
1 month, 2 weeks ago
Selected Answer: A
since "bronze" table is a delta table, readStream() only returns new data.
upvoted 2 times
...
pk07
1 month, 2 weeks ago
Selected Answer: E
If the job runs only once per day, then option E could indeed be a valid and effective solution. Here's why: Daily Execution: Since the job runs once per day, all records ingested on that day would be new and unprocessed. Source File Filtering: The filter condition col("source_file").like(f"/mnt/daily_batch/{year}/{month}/{day}") would select only the records that were ingested from the current day's batch file. Simplicity: This approach is straightforward and doesn't require maintaining additional state (like last processed version or timestamp). Reliability: As long as the daily batch files are consistently named and placed in the correct directory structure, this method will reliably capture all new records for that day.
upvoted 1 times
...
AndreFR
2 months ago
Selected Answer: A
A is correct by Elimination. As stated by Alaverdi in another comment. Reads delta table as a stream and processes only newly arrived records. B excluded because of incorrect syntax C excluded, will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp D excluded because of syntax error, should be : spark.read.option("readChangeFeed", "true").option("startingVersion", 1).table("bronze") E excluded, will be an empty result, because “source_file” give a filename, while f"/mnt /daily_batch/{year}/{month}/{day}" gives a folder name
upvoted 5 times
...
t_d_v
3 months, 1 week ago
Selected Answer: C
Actually it's hard to choose between C and E, as both are a bit incorrect: Option E - seems like it will be an empty result, as file name is compared with folder name Option C - seems like it will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp. On the other hand, if new_records method had an ingestion time param, then the task would be obvious. Also considering the very first line which imports current_timestamp, let me say it's C :))
upvoted 1 times
...
partha1022
3 months, 1 week ago
Selected Answer: D
D is correct
upvoted 1 times
...
faraaz132
3 months, 3 weeks ago
Selected Answer: E
Correct Answer : E Since, it selects only those records which have been loaded on the specified date and these records are not processed yet. This is what we want Not A : It reads all records even the ones previously processed since bronze table keeps historic data. Not D : It is no where mentioned that change data feed is enabled, nor is it present in the code snippet. This is where we have to be careful with self- assumption
upvoted 3 times
...
aiwithqasim
3 months, 3 weeks ago
Option D. return spark.read.option("readChangeFeed", "true").table ("bronze") The following code snippet is from https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ where the writer explained what will happen if we give "readChangeFeed", and "true". It will include all the details from the respective mentioned version. In our in option D starting version is not described it will pick the latest record. Please refer to the doc https://docs.databricks.com/en/delta/delta-change-data-feed.html and find "By default, the stream returns the latest snapshot of the table when the stream first starts as an INSERT and future changes as change data." ( spark.read.format("delta") .option("readChangeFeed", "true") .option("startingVersion", 0) .table("people") .show(truncate=False) )
upvoted 3 times
...
zhiva
5 months ago
Selected Answer: A
Both E and A can be correct but in the definition of the function there are no input parameters. This means we can't use them correctly in returned statement only with the given information in the question. This is why I vote for A
upvoted 2 times
...
imatheushenrique
5 months, 3 weeks ago
The E option makes more sense because all the partition would be filtered. Can't be the options that use CDF because theres no readChangeFeed option in dataframe read
upvoted 1 times
...
arik90
8 months ago
Selected Answer: E
Since the ingest_daily_batch function writes to the "bronze" table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.
upvoted 2 times
...
alexvno
8 months, 2 weeks ago
Selected Answer: E
Probable E, but still filename not specified only folder path
upvoted 1 times
...
vikram12apr
8 months, 3 weeks ago
Selected Answer: E
Please read the question again . it is asking to get the data from bronze table to the some downstream table. Now as its a append only daily nightly job the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.
upvoted 2 times
...
agreddy
9 months, 1 week ago
D is correct. https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ CDF can be enabled on non-streaming Delta table.. "delta" is default table format.
upvoted 1 times
...
ojudz08
9 months, 2 weeks ago
Selected Answer: D
the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions https://docs.databricks.com/en/delta/delta-change-data-feed.html
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...