A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

  • A. Cmd 2
  • B. Cmd 3
  • C. Cmd 4
  • D. Cmd 5
  • E. Cmd 6
Suggested Answer: E 🗳️


Highly Voted 1 year, 2 months ago
Selected Answer: E
When scheduling a Databricks notebook as a job, it's generally recommended to remove or modify commands that involve displaying output, such as using the display() function. Displaying data using display() is an interactive feature designed for exploration and visualization within the notebook interface and may not work well in a production job context. The finalDF.explain() command, which provides the execution plan of the DataFrame transformations and actions, is often useful for debugging and optimizing queries. While it doesn't display interactive visualizations like display(), it can still be informative for understanding how Spark is executing the operations on your DataFrame.
upvoted 8 times
Most Recent 3 months, 1 week ago
Selected Answer: D
Cmd 5 (finalDF.explain()) is used for debugging and understanding the logical and physical plans of a DataFrame. It provides insights into how Spark plans to execute the query but does not produce output that is necessary for the scheduled job. Including this command in a scheduled job is unnecessary and could clutter the job logs without adding value to the final output.
upvoted 1 times
1 month ago
display() is more costly operation than finalDF.explain(). The DataFrame might contain millions of rows that you would be trying to print out each time the notebook is run.
upvoted 1 times
3 months, 2 weeks ago
Selected Answer: E
if i was multiple solutions than i would have gone for .explain method and print schema as well as they do not contribute in any sort of ETL operation but as a rule of thumb display should always be omitted first so -> E
upvoted 1 times
5 months, 3 weeks ago
Selected Answer: E
I agree with petrv and KhoaLe, but I will add that not displaying the finalDF would be wise as it could display and log PII data and that to me is why I choose E. Like hal2401 said, commands 2, 5 & 6 can be removed as they don't manipulate the data.
upvoted 1 times
11 months, 2 weeks ago
Selected Answer: E
perhaps it's a multi-choice question in exam. I'll select E and D. if single choice then E.
upvoted 1 times
12 months ago
Selected Answer: E
Looking through at all steps, Cmd 2,5,6 can be eliminated without impacting to the whole process. However, in terms of duration cost, Cmd 2 and 5 does not impact much as they only show the current results of logical query plan. In contrast, display() in Cmd6 is actually a transformation, which will take much time to run.
upvoted 2 times
1 year, 1 month ago
Selected Answer: E
No display()
upvoted 3 times
1 year, 2 months ago
Selected Answer: D
No actions on production scripts. D is best
upvoted 1 times
1 year, 2 months ago
in order to display a dataframe you also need to calculate it. So display also acts as an action.
upvoted 1 times
1 year, 3 months ago
Why not D?
upvoted 2 times
