Exam Certified Data Engineer Associate All Questions

View all questions & answers for the Certified Data Engineer Associate exam

Exam Certified Data Engineer Associate topic 1 question 27 discussion

Actual exam question from Databricks's Certified Data Engineer Associate

Question #: 27
Topic #: 1

[All Certified Data Engineer Associate Questions]

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

A. Checkpointing and Write-ahead Logs
B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
C. Replayable Sources and Idempotent Sinks
D. Write-ahead Logs and Idempotent Sinks
E. Checkpointing and Idempotent Sinks

Show Suggested Answer

Suggested Answer: A 🗳️

by XiltroX at April 1, 2023, 6:18 p.m.

Comments

Submit Cancel

SoumyaHK

2 months, 1 week ago

Selected Answer: A

The response is A

upvoted 1 times

...

shanksund

4 months ago

Selected Answer: A

Idempotent sinks is for ensuring no duplicates, that is not what the question is asking

upvoted 2 times

...

avidlearner

4 months, 2 weeks ago

Selected Answer: A

The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. URL for reference

upvoted 4 times

...

MrCastro

4 months, 3 weeks ago

Selected Answer: A

A. Really don't understand people saying E. Question is: "... which of the following two approaches is used by Spark **** TO RECORD **** the offset range of the data being processed" TO RECORD is the key here. Idempotent sinks don't record anything. It's a feature to replay operations. RECORDS of the operations are created with Checkpoints and Write Ahead Logs.

upvoted 3 times

...

edaf08e

5 months ago

Selected Answer: E

E. Checkpointing and Idempotent Sinks

upvoted 1 times

...

SatuPatu

5 months, 1 week ago

Selected Answer: E

If failure by restarting and/or reprocessing then choose E If the worker running the task crashes then choose A

upvoted 1 times

...

Poutrata

7 months, 1 week ago

Selected Answer: E

E is correct

upvoted 1 times

...

NzmD

7 months, 2 weeks ago

Selected Answer: E

Correct answer is E

upvoted 2 times

...

806e7d2

7 months, 2 weeks ago

Selected Answer: E

In Structured Streaming, Spark uses the following two mechanisms to reliably track the progress of the stream and ensure fault tolerance: Checkpointing: Spark maintains metadata about the processing state, including the offset range of the data processed in each trigger. This metadata is stored in a reliable storage system like HDFS, AWS S3, or Azure Data Lake. If a failure occurs, Spark can recover and resume processing from the last recorded state in the checkpoint. Idempotent Sinks: Idempotent sinks ensure that output operations (e.g., writing data to storage or a database) can be re-executed without causing duplicate data or errors. By combining idempotent sinks with checkpointing, Spark ensures that reprocessing data due to a failure does not compromise data integrity.

upvoted 2 times

...

Colje

9 months, 1 week ago

Why the correct answer is E. Checkpointing and Idempotent Sinks: Checkpointing: Spark Structured Streaming uses checkpointing to track the state of the data being processed. Checkpoints allow the system to restart processing from where it left off in case of failure, ensuring reliability. Idempotent Sinks: Idempotent sinks ensure that reprocessing the same data multiple times (in case of a failure or restart) doesn’t lead to duplicate results. The sink can handle repeated writes of the same data without issues. Why A. Checkpointing and Write-ahead Logs is incorrect: Spark Structured Streaming does not use Write-ahead Logs (WAL) for tracking offsets or ensuring fault tolerance. While WALs are used in some systems for durability, Spark Structured Streaming relies on checkpointing and the concept of idempotent operations to ensure consistency and fault tolerance.

upvoted 3 times

...

CID2024

10 months ago

The correct answer is: E. Checkpointing and Idempotent Sinks In Structured Streaming, Spark uses checkpointing to reliably track the progress of the streaming data. Checkpointing saves the state of the streaming computation to a reliable storage system. Idempotent sinks ensure that even if data is reprocessed, the results remain consistent and correct, preventing duplicate data from being written.

upvoted 2 times

...

80370eb

10 months, 4 weeks ago

Selected Answer: A

Checkpointing: Spark saves metadata, including offsets, in a checkpoint directory, allowing it to recover from failures by replaying data starting from the last checkpoint. Write-ahead Logs (WAL): Spark writes information about the data being processed to a log before the data is written to the sink. This ensures that even if a failure occurs, Spark can recover and reprocess the data from the log.

upvoted 2 times

...

3fbc31b

11 months, 4 weeks ago

Selected Answer: A

A is the correct answer.

upvoted 1 times

...

squidy24

1 year, 1 month ago

Selected Answer: A

The answer is A "Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. ... Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs." - Apache Spark Structured Streaming Programming Guide

upvoted 3 times

...

keensolution

1 year, 1 month ago

Nice information and i hope best [url=https://keensolution.in/data-visualization-services/]Data visualization agencies in India[/url]

upvoted 1 times

...

bita7

1 year, 2 months ago

The answer is Checkpointing and idempotent sinks (E) How does structured streaming achieves end to end fault tolerance: • First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval. • Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink. Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition

upvoted 1 times

...

benni_ale

1 year, 2 months ago

Selected Answer: A

1 checkpointing and write ahead logs to record the offset range of data being processed 2 checkpointing and idempotent sinks achieve end to end fault tolerance

upvoted 3 times

...

Load full discussion...