Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Associate All Questions

View all questions & answers for the Certified Data Engineer Associate exam

Exam Certified Data Engineer Associate topic 1 question 27 discussion

Actual exam question from Databricks's Certified Data Engineer Associate
Question #: 27
Topic #: 1
[All Certified Data Engineer Associate Questions]

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

  • A. Checkpointing and Write-ahead Logs
  • B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
  • C. Replayable Sources and Idempotent Sinks
  • D. Write-ahead Logs and Idempotent Sinks
  • E. Checkpointing and Idempotent Sinks
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
NzmD
1 day, 9 hours ago
Selected Answer: E
Correct answer is E
upvoted 1 times
...
806e7d2
3 days ago
Selected Answer: E
In Structured Streaming, Spark uses the following two mechanisms to reliably track the progress of the stream and ensure fault tolerance: Checkpointing: Spark maintains metadata about the processing state, including the offset range of the data processed in each trigger. This metadata is stored in a reliable storage system like HDFS, AWS S3, or Azure Data Lake. If a failure occurs, Spark can recover and resume processing from the last recorded state in the checkpoint. Idempotent Sinks: Idempotent sinks ensure that output operations (e.g., writing data to storage or a database) can be re-executed without causing duplicate data or errors. By combining idempotent sinks with checkpointing, Spark ensures that reprocessing data due to a failure does not compromise data integrity.
upvoted 1 times
...
Colje
1 month, 3 weeks ago
Why the correct answer is E. Checkpointing and Idempotent Sinks: Checkpointing: Spark Structured Streaming uses checkpointing to track the state of the data being processed. Checkpoints allow the system to restart processing from where it left off in case of failure, ensuring reliability. Idempotent Sinks: Idempotent sinks ensure that reprocessing the same data multiple times (in case of a failure or restart) doesn’t lead to duplicate results. The sink can handle repeated writes of the same data without issues. Why A. Checkpointing and Write-ahead Logs is incorrect: Spark Structured Streaming does not use Write-ahead Logs (WAL) for tracking offsets or ensuring fault tolerance. While WALs are used in some systems for durability, Spark Structured Streaming relies on checkpointing and the concept of idempotent operations to ensure consistency and fault tolerance.
upvoted 3 times
...
CID2024
2 months, 3 weeks ago
The correct answer is: E. Checkpointing and Idempotent Sinks In Structured Streaming, Spark uses checkpointing to reliably track the progress of the streaming data. Checkpointing saves the state of the streaming computation to a reliable storage system. Idempotent sinks ensure that even if data is reprocessed, the results remain consistent and correct, preventing duplicate data from being written.
upvoted 2 times
...
80370eb
3 months, 2 weeks ago
Selected Answer: A
Checkpointing: Spark saves metadata, including offsets, in a checkpoint directory, allowing it to recover from failures by replaying data starting from the last checkpoint. Write-ahead Logs (WAL): Spark writes information about the data being processed to a log before the data is written to the sink. This ensures that even if a failure occurs, Spark can recover and reprocess the data from the log.
upvoted 2 times
...
3fbc31b
4 months, 2 weeks ago
Selected Answer: A
A is the correct answer.
upvoted 1 times
...
squidy24
6 months, 1 week ago
Selected Answer: A
The answer is A "Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. ... Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs." - Apache Spark Structured Streaming Programming Guide
upvoted 3 times
...
keensolution
6 months, 2 weeks ago
Nice information and i hope best [url=https://keensolution.in/data-visualization-services/]Data visualization agencies in India[/url]
upvoted 1 times
...
bita7
6 months, 3 weeks ago
The answer is Checkpointing and idempotent sinks (E) How does structured streaming achieves end to end fault tolerance: • First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval. • Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink. Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition
upvoted 1 times
...
benni_ale
6 months, 3 weeks ago
Selected Answer: A
1 checkpointing and write ahead logs to record the offset range of data being processed 2 checkpointing and idempotent sinks achieve end to end fault tolerance
upvoted 2 times
...
SerGrey
10 months, 3 weeks ago
Selected Answer: A
Correct is A
upvoted 1 times
...
juadaves
1 year, 2 months ago
The answer is Checkpointing and idempotent sinks How does structured streaming achieves end to end fault tolerance: First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval. Next, the streaming sinks are designed to be _idempotent_—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink. Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition.
upvoted 3 times
...
vctrhugo
1 year, 2 months ago
Selected Answer: A
A. Checkpointing and Write-ahead Logs To reliably track the exact progress of processing and handle failures in Spark Structured Streaming, Spark uses both checkpointing and write-ahead logs. Checkpointing allows Spark to periodically save the state of the streaming application to a reliable distributed file system, which can be used for recovery in case of failures. Write-ahead logs are used to record the offset range of data being processed, ensuring that the system can recover and reprocess data from the last known offset in the event of a failure.
upvoted 2 times
...
akk_1289
1 year, 4 months ago
A: The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. -- in the link search for "The engine uses " youll find the answer. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.
upvoted 2 times
...
Atnafu
1 year, 4 months ago
A. Checkpointing and Write-ahead Logs. Checkpointing is a process of periodically saving the state of the streaming computation to a durable storage system. This ensures that if the streaming computation fails, it can be restarted from the last checkpoint and resume processing from where it left off. Write-ahead logs are a type of log that records all changes made to a dataset. This allows Structured Streaming to recover from failures by replaying the write-ahead logs from the last checkpoint.
upvoted 3 times
...
mimzzz
1 year, 5 months ago
why i think both A E are correct? https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-streaming-exactly-once#:~:text=Use%20idempotent%20sinks
upvoted 2 times
ZSun
1 year, 5 months ago
spark handle streaming failure through: 1. track the progress/offset(This is option A) 2. fix failure(This is option E) But the question is "two approaches ... record the offset range" Therefore, A
upvoted 5 times
...
...
chays
1 year, 5 months ago
Selected Answer: A
Answer is A: The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...