Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 21 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 21
Topic #: 1
[All Certified Data Engineer Professional Questions]

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

  • A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
  • B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
  • C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
  • D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
  • E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
Show Suggested Answer Hide Answer
Suggested Answer: E 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
wdeleersnyder
3 months, 3 weeks ago
In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads. https://docs.databricks.com/en/structured-streaming/triggers.html Doesn't seem like E is a valid and recommended option given that it is deprecated.
upvoted 2 times
wdeleersnyder
3 months, 3 weeks ago
Ooops, I mean, D.
upvoted 2 times
...
...
imatheushenrique
5 months, 3 weeks ago
Considering the best option for performance gain is: E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
upvoted 2 times
...
ojudz08
9 months, 2 weeks ago
Selected Answer: E
E is the answer. Enable the settings uses the 128 MB as the target file size https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
upvoted 2 times
...
DAN_H
9 months, 4 weeks ago
Selected Answer: E
E is correct as A is wrong because in Streaming you very rarely have any executors idle
upvoted 1 times
...
kz_data
10 months, 2 weeks ago
Selected Answer: E
I think is E is correct
upvoted 1 times
...
RafaelCFC
10 months, 3 weeks ago
Selected Answer: E
I believe this is a case of the least bad option, not exactly the best option possible. - A is wrong because in Streaming you very rarely have any executors idle, as all cores are engaged in processing the window of data; - B is wrong because triggering every 30s will not meet the 10s target processing interval; - C is wrong in two manners: increasing shuffle partitions to any number above the number of available cores in the cluster will worsen performance in streaming; also, the checkpoint folder has no connection with trigger time. - D is wrong because, keeping all other things the same as described by the problem, keeping the trigger time as 10s will not change the underlying conditions of the delay (i.e.: too much data to be processed in a timely manner). E is the only option that might improve processing time.
upvoted 3 times
...
ervinshang
11 months ago
Selected Answer: E
correct answer is E
upvoted 1 times
...
ofed
1 year ago
Only C. Even if you trigger more frequently you decrease both load and time for this load. E doesn't change anything.
upvoted 1 times
...
sturcu
1 year, 1 month ago
Selected Answer: E
Changing trigger interval to "one" will cause this to be a "batch" and will not execute in microbranches. This will not help at all
upvoted 4 times
...
Eertyy
1 year, 2 months ago
correct answer is E
upvoted 1 times
...
azurearch
1 year, 2 months ago
sorry, the caveat is holding all other variables constant.. that means we are not allowed to change trigger intervals. is C the answer then
upvoted 1 times
...
azurearch
1 year, 2 months ago
what if in between those 5 seconds trigger interval if there are more records, that would still increase the time it takes to process.. i doubt E is correct. I will go with answer D. it is not to execute all queries within 10 secs. it is to execute trigger now batch every 10 seconds.
upvoted 1 times
...
azurearch
1 year, 2 months ago
A option also is about setting trigger interval to 5 seconds, just to understand.. why its not the answer
upvoted 1 times
...
cotardo2077
1 year, 2 months ago
Selected Answer: E
for sure E
upvoted 2 times
...
Eertyy
1 year, 2 months ago
correct anwer is E
upvoted 2 times
...
asmayassineg
1 year, 3 months ago
correct answer is E. D means a job will need to acquire resources in 10s which is impossible without serverless
upvoted 4 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...