Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 49 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 49
Topic #: 1
[All Certified Data Engineer Professional Questions]

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

  • A. Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs, all PySpark and Spark SQL logic should be refactored.
  • B. The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.
  • C. Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.
  • D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.
  • E. The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
benni_ale
2 days, 4 hours ago
Selected Answer: D
Answer: D. Explanation: Lazy Evaluation: Spark employs lazy evaluation, meaning transformations are not executed until an action (e.g., display(), count(), collect()) is called. Using display() triggers the execution of the transformations up to that point. Caching Effects: Repeatedly executing the same cell can lead to caching, where Spark stores intermediate results. This caching can cause subsequent executions to be faster, not reflecting the true performance of the code. Why not B: Production-Sized Data and Clusters: While using production-sized data and clusters (as mentioned in option B) can provide insights into performance, it's not the only way to troubleshoot execution times. Proper testing can often be conducted on smaller datasets and clusters, especially during the development phase.
upvoted 1 times
...
practicioner
3 months, 1 week ago
Selected Answer: B
B and D are correct. The question says "which statements" which suggests us that this is a question with multiple choices
upvoted 2 times
...
HelixAbdu
4 months ago
Both D and B are correct. But in real life some times clients dose not accept to gave you there production data to test easily. Also it says in B it is “the only way” ans this is not true for me So i will go with D
upvoted 4 times
RyanAck24
1 month, 3 weeks ago
I would add to this and say that this *could* be a multi-choice question (possibly) as practicioner mentions above. But if it isn't, I would go with D as well.
upvoted 1 times
...
...
ffsdfdsfdsfdsfdsf
8 months, 2 weeks ago
Selected Answer: B
These people voting D have no reading comprehension.
upvoted 4 times
...
alexvno
8 months, 2 weeks ago
Selected Answer: B
Close env size volumes as possible so results make sense
upvoted 2 times
...
halleysg
8 months, 3 weeks ago
Selected Answer: D
D is correct
upvoted 3 times
...
Curious76
9 months ago
Selected Answer: D
I will go with D
upvoted 1 times
...
agreddy
9 months ago
D is the correct answer A. Scala is the only language accurately tested using notebooks: Not true. Spark SQL and PySpark can be accurately tested in notebooks, and production performance doesn't solely depend on language choice. B. Production-sized data and clusters are necessary: While ideal, it's not always feasible for development. Smaller datasets and clusters can provide indicative insights. C. IDE and local Spark/Delta Lake: Local environments won't replicate production's scale and configuration fully. E. Jobs UI and Photon: True that Photon benefits scheduled jobs, but Jobs UI can track execution times regardless of Photon usage. However, Jobs UI runs might involve additional overhead compared to notebook cells. Option D addresses the specific limitations of using display() for performance measurement
upvoted 4 times
...
guillesd
9 months, 3 weeks ago
Selected Answer: B
Both B and D are correct statements. However, D is not an adjustment (see the question), it is just an afirmation which happens to be correct. B, however, is an adjustment, and it will definitely help with profiling.
upvoted 4 times
...
DAN_H
9 months, 3 weeks ago
Selected Answer: D
As B not talking about how to deal with display() function. We know that way to testing performance for the whole notebook need to avoid using display as it is way to test the code and display the data
upvoted 3 times
...
zzzzx
9 months, 4 weeks ago
B is correct
upvoted 1 times
...
spaceexplorer
10 months ago
Selected Answer: D
D is correct
upvoted 1 times
...
divingbell17
10 months, 4 weeks ago
Selected Answer: B
Calling display() forces a job to trigger - doesnt make sense display is used to display a df/table in tabular format, has nothing to do with a job trigger
upvoted 2 times
guillesd
9 months, 3 weeks ago
Actually they mean a spark job. This is true, whenever you call display, spark needs to execute the transformations up to this point to be able to collect the results.
upvoted 2 times
...
...
ervinshang
11 months, 1 week ago
D is correct
upvoted 1 times
...
rok21
11 months, 2 weeks ago
Selected Answer: B
B is correct
upvoted 1 times
...
sturcu
1 year, 1 month ago
Selected Answer: B
Yes D is a True statement. But it does not answer the question. The ask is for "which adjustments will get a more accurate measure of how code is likely to perform in production". Answer D just describes why the chosen approach is not correct. It does not provide a solution.
upvoted 1 times
sturcu
1 year, 1 month ago
D would be the answer if it was preceded by: We should avoid calling display() too often or clear the cache before running each cell.
upvoted 2 times
nedlo
4 weeks, 1 day ago
but there is no caching mentioned in question
upvoted 1 times
...
...
...
tkg13
1 year, 3 months ago
Is it not B?
upvoted 1 times
BrianNguyen95
1 year, 2 months ago
Option B one of possibility happening. Option D fully meaning
upvoted 3 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...