exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 21 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 21
Topic #: 1
[All Certified Data Engineer Professional Questions]

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

  • A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
  • B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
  • C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
  • D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
  • E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
Show Suggested Answer Hide Answer
Suggested Answer: E 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
RafaelCFC
Highly Voted 1 year, 1 month ago
Selected Answer: E
I believe this is a case of the least bad option, not exactly the best option possible. - A is wrong because in Streaming you very rarely have any executors idle, as all cores are engaged in processing the window of data; - B is wrong because triggering every 30s will not meet the 10s target processing interval; - C is wrong in two manners: increasing shuffle partitions to any number above the number of available cores in the cluster will worsen performance in streaming; also, the checkpoint folder has no connection with trigger time. - D is wrong because, keeping all other things the same as described by the problem, keeping the trigger time as 10s will not change the underlying conditions of the delay (i.e.: too much data to be processed in a timely manner). E is the only option that might improve processing time.
upvoted 7 times
arekm
1 month ago
With one addition to A explanation - micro-batches are sequential by design.
upvoted 1 times
...
...
arekm
Most Recent 1 month ago
Selected Answer: E
Answer E, see explanation by RafaelCFC.
upvoted 1 times
...
ASRCA
1 month, 1 week ago
Selected Answer: A
Option A emphasizes utilizing idle executors to begin processing the next batch while longer-running tasks from previous batches finish. This approach can help maintain a steady flow of data processing and reduce the likelihood of bottlenecks.
upvoted 1 times
arekm
1 month ago
Structured streaming processes batches in sequence. It does so since it guarantees exactly once processing, see: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
upvoted 1 times
...
...
Thameur01
2 months ago
Selected Answer: B
If microbatch execution occasionally exceeds 30 seconds, a trigger interval of 5 seconds would cause multiple batches to queue up while the previous batch is still running. This would exacerbate the delays and potentially lead to backpressure and failure. B is the best option in this case. If we assume for sure that execution time should be less than 10s, then in that case a 5s interval will make more sense and E should be the best answer.
upvoted 1 times
...
wdeleersnyder
6 months ago
In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads. https://docs.databricks.com/en/structured-streaming/triggers.html Doesn't seem like E is a valid and recommended option given that it is deprecated.
upvoted 2 times
wdeleersnyder
6 months ago
Ooops, I mean, D.
upvoted 2 times
...
...
imatheushenrique
8 months, 1 week ago
Considering the best option for performance gain is: E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
upvoted 2 times
...
ojudz08
11 months, 4 weeks ago
Selected Answer: E
E is the answer. Enable the settings uses the 128 MB as the target file size https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
upvoted 2 times
...
DAN_H
1 year ago
Selected Answer: E
E is correct as A is wrong because in Streaming you very rarely have any executors idle
upvoted 2 times
...
kz_data
1 year ago
Selected Answer: E
I think is E is correct
upvoted 1 times
...
ervinshang
1 year, 1 month ago
Selected Answer: E
correct answer is E
upvoted 1 times
...
ofed
1 year, 3 months ago
Only C. Even if you trigger more frequently you decrease both load and time for this load. E doesn't change anything.
upvoted 1 times
...
sturcu
1 year, 3 months ago
Selected Answer: E
Changing trigger interval to "one" will cause this to be a "batch" and will not execute in microbranches. This will not help at all
upvoted 4 times
...
Eertyy
1 year, 4 months ago
correct answer is E
upvoted 1 times
...
azurearch
1 year, 5 months ago
sorry, the caveat is holding all other variables constant.. that means we are not allowed to change trigger intervals. is C the answer then
upvoted 1 times
...
azurearch
1 year, 5 months ago
what if in between those 5 seconds trigger interval if there are more records, that would still increase the time it takes to process.. i doubt E is correct. I will go with answer D. it is not to execute all queries within 10 secs. it is to execute trigger now batch every 10 seconds.
upvoted 1 times
...
azurearch
1 year, 5 months ago
A option also is about setting trigger interval to 5 seconds, just to understand.. why its not the answer
upvoted 1 times
...
cotardo2077
1 year, 5 months ago
Selected Answer: E
for sure E
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago