Exam Certified Data Engineer Professional topic 1 question 66 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 66
Topic #: 1

[All Certified Data Engineer Professional Questions]

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

A. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
B. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
C. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
D. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Show Suggested Answer

Suggested Answer: E 🗳️

by sturcu at Oct. 24, 2023, 5:25 p.m.

Comments

Submit Cancel

aragorn_brego

Highly Voted 1 year, 7 months ago

Selected Answer: E

When using %sh in a Databricks notebook, the commands are executed in a shell environment on the driver node. This means that only the resources of the driver node are used, and the execution does not leverage the distributed computing capabilities of the worker nodes in the Spark cluster. This can result in slower performance, especially for data-intensive tasks, compared to an approach that distributes the workload across all nodes in the cluster using Spark.

upvoted 9 times

...

KadELbied

Most Recent 2 months, 1 week ago

Selected Answer: E

Suretly E

upvoted 1 times

...

robodog

11 months ago

Selected Answer: E

Option E correct

upvoted 1 times

...

Freyr

1 year, 1 month ago

Selected Answer: E

Option E: Correct. The %sh magic command in Databricks runs shell commands on the driver node only. This means the operations within %sh do not leverage the distributed nature of the Databricks cluster. Consequently, the Git clone, Python script execution, and file move operations are all performed on a single node (the driver), which explains why it takes a long time to process and move 1 GB of data. This approach does not utilize the parallel processing capabilities of the worker nodes or the optimization features of Databricks Spark. Option C: Incorrect. %sh does not inherently distribute any operations, but the issue here is broader than just file moving operations. Using %fs for file operations is a best practice, but it does not resolve the inefficiency of running all commands on the driver node.

upvoted 2 times

...