Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 73 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 73
Topic #: 1
[All Certified Data Engineer Professional Questions]

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

  • A. Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
  • B. Increase the number of shuffle partitions to maximize parallelism, since the trigger interval cannot be modified without modifying the checkpoint directory.
  • C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
  • D. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.
  • E. Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
pk07
1 month, 3 weeks ago
Selected Answer: C
E WRONG. Using trigger once would stop the stream after one execution, not meeting the requirement of continuous processing.
upvoted 1 times
...
practicioner
3 months, 1 week ago
Selected Answer: E
E is correct for two reasons: 1) we have been using the connection pool that allows us to start our job instantly 2) the questions are about reducing costs. Triggering one per 10 minutes allows not to use running VM (as in option C) and to keep the same SLA (due to 1) ) with lower cost for compute as well as for storage (fewer API calls which are not free )
upvoted 1 times
...
Er5
7 months, 3 weeks ago
required "to be processed in less than 10 minutes". C. "set the trigger interval to 10 minutes" means Process time + interval > 10 minutes E. "trigger once", "execute the query every 10 minutes"
upvoted 3 times
...
vikram12apr
8 months, 2 weeks ago
Selected Answer: E
default trigger time is 0.5 seconds Hence in a minute there are 120 triggers happens Each trigger consume 3 seconds to complete now 120*3 = 360 seconds = 6 minutes Hence the job is completing in 6 minutes Now there is buffer of 4 minutes which can be utilized in compute spin up but as we are using the spot instances which will further decrease the start up time I think E is correct option to decrease the cost.
upvoted 2 times
...
hidelux
8 months, 3 weeks ago
Selected Answer: E
The question indicates that they are using instance pools for fast startup time. option C would block a VM permanently which is not intended. E will grab a VM, run the job, and return it to the pool to be available for other jobs mentioned in the question.
upvoted 2 times
practicioner
3 months, 1 week ago
you are right. But we need to guarantee SLA and for this reason to block VM (with autoscaling) is a good practice
upvoted 1 times
...
...
spaceexplorer
10 months ago
Selected Answer: C
C is more effective than E as E will incur startup time for spinning new job cluster
upvoted 3 times
...
ranith
10 months ago
The default trigger interval is 500ms, but the question says it processes batches with 0 records and the avg time to process is 3s. If the requirement is to process under 10 minutes the best option here is to trigger every 3s.
upvoted 1 times
...
divingbell17
10 months, 4 weeks ago
Selected Answer: C
Both C and E meet the requirement to reduce cloud storage cost. E further reduces compute cost however reducing compute cost is not a requirement in the question.
upvoted 2 times
...
alexvno
11 months, 1 week ago
Selected Answer: C
For production -> records need to be processed in less than 10 minutes. So we need to schedule each 10 minutes
upvoted 3 times
...
aragorn_brego
1 year ago
Selected Answer: E
Given that there are frequent microbatches with 0 records being processed, it indicates that the job is polling the source too often. Using the "trigger once" option would allow each microbatch to process all available data and then stop. By scheduling the job to run every 10 minutes, you ensure that the system is not constantly checking for new data when there is none, thus reducing the number of read operations from the source storage and potentially reducing costs associated with those reads.
upvoted 3 times
Gulenur_GS
11 months, 3 weeks ago
in this case why not C? Processing trigger in 10 min ensures the same I guess..
upvoted 1 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...