Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 73 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 73
Topic #: 1

[All Certified Data Engineer Professional Questions]

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

A. Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
B. Increase the number of shuffle partitions to maximize parallelism, since the trigger interval cannot be modified without modifying the checkpoint directory.
C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
D. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.
E. Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

Show Suggested Answer

Suggested Answer: C 🗳️

by aragorn_brego at Nov. 21, 2023, 8:49 p.m.

Comments

Submit Cancel

KadELbied

1 month, 3 weeks ago

Selected Answer: C

suretly C

upvoted 1 times

...

AlejandroU

6 months, 1 week ago

Selected Answer: C

Answer C. Setting the trigger interval to 10 minutes (option C) directly aligns with the requirement to process records within a 10-minute window. It achieves the same reduction in processing frequency as option E but without the added complexity of job scheduling or reliance on trigger once. Using the trigger once option requires external orchestration (e.g., a scheduled Databricks job) to execute every 10 minutes. This adds operational overhead and potential delays due to job scheduling or startup times, especially in a shared workspace using instance pools.

upvoted 1 times

...

UrcoIbz

6 months, 2 weeks ago

Selected Answer: C

In my opinion, both C and E met the requirements. But the sentence says 'Holding all other variables constant'. This indicates me that E cannot be the solution, as new variables are introduced.

upvoted 2 times

...

benni_ale

6 months, 4 weeks ago

Selected Answer: E

The fact that the question mentions instance pools provisioned make me guess that we should go for trigger once option otherwise instance pools are useless.

upvoted 1 times

...

pk07

9 months ago

Selected Answer: C

E WRONG. Using trigger once would stop the stream after one execution, not meeting the requirement of continuous processing.

upvoted 2 times

...

practicioner

10 months, 2 weeks ago

Selected Answer: E

E is correct for two reasons: 1) we have been using the connection pool that allows us to start our job instantly 2) the questions are about reducing costs. Triggering one per 10 minutes allows not to use running VM (as in option C) and to keep the same SLA (due to 1) ) with lower cost for compute as well as for storage (fewer API calls which are not free )

upvoted 1 times

...

Er5

1 year, 2 months ago

required "to be processed in less than 10 minutes". C. "set the trigger interval to 10 minutes" means Process time + interval > 10 minutes E. "trigger once", "execute the query every 10 minutes"

upvoted 3 times

...

vikram12apr

1 year, 3 months ago

Selected Answer: E

default trigger time is 0.5 seconds Hence in a minute there are 120 triggers happens Each trigger consume 3 seconds to complete now 120*3 = 360 seconds = 6 minutes Hence the job is completing in 6 minutes Now there is buffer of 4 minutes which can be utilized in compute spin up but as we are using the spot instances which will further decrease the start up time I think E is correct option to decrease the cost.

upvoted 2 times

...

hidelux

1 year, 3 months ago

Selected Answer: E

The question indicates that they are using instance pools for fast startup time. option C would block a VM permanently which is not intended. E will grab a VM, run the job, and return it to the pool to be available for other jobs mentioned in the question.

upvoted 3 times

practicioner

10 months, 2 weeks ago

you are right. But we need to guarantee SLA and for this reason to block VM (with autoscaling) is a good practice

upvoted 1 times

...

spaceexplorer

1 year, 5 months ago

Selected Answer: C

C is more effective than E as E will incur startup time for spinning new job cluster

upvoted 3 times

...

ranith

1 year, 5 months ago

The default trigger interval is 500ms, but the question says it processes batches with 0 records and the avg time to process is 3s. If the requirement is to process under 10 minutes the best option here is to trigger every 3s.

upvoted 1 times

...

divingbell17

1 year, 6 months ago

Selected Answer: C

Both C and E meet the requirement to reduce cloud storage cost. E further reduces compute cost however reducing compute cost is not a requirement in the question.

upvoted 2 times

...

alexvno

1 year, 6 months ago

Selected Answer: C

For production -> records need to be processed in less than 10 minutes. So we need to schedule each 10 minutes

upvoted 3 times

...

aragorn_brego

1 year, 7 months ago

Selected Answer: E

Given that there are frequent microbatches with 0 records being processed, it indicates that the job is polling the source too often. Using the "trigger once" option would allow each microbatch to process all available data and then stop. By scheduling the job to run every 10 minutes, you ensure that the system is not constantly checking for new data when there is none, thus reducing the number of read operations from the source storage and potentially reducing costs associated with those reads.

upvoted 4 times

Gulenur_GS

1 year, 6 months ago

in this case why not C? Processing trigger in 10 min ensures the same I guess..

upvoted 1 times

...