A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

  • A. return spark.readStream.table("bronze")
  • B. return spark.readStream.load("bronze")
  • C.
  • D. return spark.read.option("readChangeFeed", "true").table ("bronze")
  • E.
Suggested Answer: A 🗳️


Highly Voted 1 year, 2 months ago
Selected Answer: D
# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html
upvoted 10 times
1 month ago
Answer D would require specifying the start and (optionally) the end version for reading data from CDF. So D does not seem to be correct.
upvoted 1 times
3 months, 4 weeks ago
readChangeFeed is disabled by default.
upvoted 2 times
5 months, 3 weeks ago
There is no stream in option D
upvoted 3 times
5 months, 2 weeks ago
You can read Delta Lake Change Data Feed without using a stream. You can use batch queries to read the change data feed by setting the readChangeFeed option to true.
upvoted 2 times
1 month ago
CDF without a stream requires a starting version at the minimum.
upvoted 1 times
Highly Voted 1 year, 2 months ago
In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():
upvoted 8 times
Most Recent 1 month ago
Selected Answer: A
You can read data from the delta table using structured streaming. You have 2 options: - without CDF - only process new rows (without updates and deletes) - with CDF - all changes to the data, i.e. insert, update, delete. Answer A uses the first option. However, in the question they talk about "new records". So using streaming for new records is OK. Answer A is correct.
upvoted 2 times
1 month ago
At first I thought of answer D. However, after checking in the docs I learned that starting version is a must while reading from CDF using batch pattern.
upvoted 1 times
1 month, 3 weeks ago
Selected Answer: E
New records will be filtered for D /
upvoted 1 times
1 month, 3 weeks ago
Selected Answer: D
New records will be filtered for D - example https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
upvoted 1 times
1 month, 3 weeks ago
Selected Answer: A
Answer A. A better approach would involve streaming directly from the Delta table (Option A), possibly along with using metadata like ingest_time to track new records more accurately. It might be better to rely on the streaming process itself rather than trying to filter based on the file path (option E).
upvoted 1 times
2 months ago
Selected Answer: E
Using the source_file metadata field allows you to filter new records ingested from specific files. E is the most robust and reliable option for tracking and working with new records in this batch ingestion pipeline.
upvoted 1 times
2 months, 1 week ago
Selected Answer: E
I tried myself but none really works
upvoted 1 times
3 months, 2 weeks ago
Selected Answer: A
Others can't ensure data not being processed. e.g. if the code not run for one day and run next day, C or E will mis process one day's data.
upvoted 2 times
3 months, 4 weeks ago
Selected Answer: A
since "bronze" table is a delta table, readStream() only returns new data.
upvoted 4 times
3 months, 4 weeks ago
Selected Answer: E
If the job runs only once per day, then option E could indeed be a valid and effective solution. Here's why: Daily Execution: Since the job runs once per day, all records ingested on that day would be new and unprocessed. Source File Filtering: The filter condition col("source_file").like(f"/mnt/daily_batch/{year}/{month}/{day}") would select only the records that were ingested from the current day's batch file. Simplicity: This approach is straightforward and doesn't require maintaining additional state (like last processed version or timestamp). Reliability: As long as the daily batch files are consistently named and placed in the correct directory structure, this method will reliably capture all new records for that day.
upvoted 2 times
4 months, 1 week ago
Selected Answer: A
A is correct by Elimination. As stated by Alaverdi in another comment. Reads delta table as a stream and processes only newly arrived records. B excluded because of incorrect syntax C excluded, will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp D excluded because of syntax error, should be : spark.read.option("readChangeFeed", "true").option("startingVersion", 1).table("bronze") E excluded, will be an empty result, because “source_file” give a filename, while f"/mnt /daily_batch/{year}/{month}/{day}" gives a folder name
upvoted 7 times
5 months, 3 weeks ago
Selected Answer: C
Actually it's hard to choose between C and E, as both are a bit incorrect: Option E - seems like it will be an empty result, as file name is compared with folder name Option C - seems like it will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp. On the other hand, if new_records method had an ingestion time param, then the task would be obvious. Also considering the very first line which imports current_timestamp, let me say it's C :))
upvoted 1 times
5 months, 3 weeks ago
Selected Answer: D
D is correct
upvoted 1 times
6 months ago
Selected Answer: E
Correct Answer : E Since, it selects only those records which have been loaded on the specified date and these records are not processed yet. This is what we want Not A : It reads all records even the ones previously processed since bronze table keeps historic data. Not D : It is no where mentioned that change data feed is enabled, nor is it present in the code snippet. This is where we have to be careful with self- assumption
upvoted 3 times
6 months, 1 week ago
Option D. return spark.read.option("readChangeFeed", "true").table ("bronze") The following code snippet is from https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ where the writer explained what will happen if we give "readChangeFeed", and "true". It will include all the details from the respective mentioned version. In our in option D starting version is not described it will pick the latest record. Please refer to the doc https://docs.databricks.com/en/delta/delta-change-data-feed.html and find "By default, the stream returns the latest snapshot of the table when the stream first starts as an INSERT and future changes as change data." ( spark.read.format("delta") .option("readChangeFeed", "true") .option("startingVersion", 0) .table("people") .show(truncate=False) )
upvoted 3 times
7 months, 2 weeks ago
Selected Answer: A
Both E and A can be correct but in the definition of the function there are no input parameters. This means we can't use them correctly in returned statement only with the given information in the question. This is why I vote for A
upvoted 2 times
Community vote distribution
A (35%)
C (25%)
B (20%)
