Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 66 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 66
Topic #: 1
[All Certified Data Engineer Professional Questions]

The following code has been migrated to a Databricks notebook from a legacy workload:



The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

  • A. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
  • B. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
  • C. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
  • D. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
  • E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.
Show Suggested Answer Hide Answer
Suggested Answer: E 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
aragorn_brego
Highly Voted 1 year ago
Selected Answer: E
When using %sh in a Databricks notebook, the commands are executed in a shell environment on the driver node. This means that only the resources of the driver node are used, and the execution does not leverage the distributed computing capabilities of the worker nodes in the Spark cluster. This can result in slower performance, especially for data-intensive tasks, compared to an approach that distributes the workload across all nodes in the cluster using Spark.
upvoted 8 times
...
robodog
Most Recent 3 months ago
Selected Answer: E
Option E correct
upvoted 1 times
...
Freyr
6 months ago
Selected Answer: E
Option E: Correct. The %sh magic command in Databricks runs shell commands on the driver node only. This means the operations within %sh do not leverage the distributed nature of the Databricks cluster. Consequently, the Git clone, Python script execution, and file move operations are all performed on a single node (the driver), which explains why it takes a long time to process and move 1 GB of data. This approach does not utilize the parallel processing capabilities of the worker nodes or the optimization features of Databricks Spark. Option C: Incorrect. %sh does not inherently distribute any operations, but the issue here is broader than just file moving operations. Using %fs for file operations is a best practice, but it does not resolve the inefficiency of running all commands on the driver node.
upvoted 1 times
...
Dileepvikram
1 year ago
E is the answer as the command is ran in the driver node and other nodes in the cluster are not used
upvoted 2 times
...
sturcu
1 year ago
Selected Answer: E
%sh run Bash commands on the driver node of the cluster. https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html
upvoted 3 times
...
sturcu
1 year, 1 month ago
you can use mv with %sh, but the syntax is not correct , it is missing the destination operand
upvoted 1 times
sturcu
1 year ago
I just noticed there is a space between the paths, so syntax is correct
upvoted 1 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...