Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 74 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 74
Topic #: 1
[All Certified Data Engineer Professional Questions]

Which statement describes the correct use of pyspark.sql.functions.broadcast?

  • A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
  • B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
  • C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
  • D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
  • E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Freyr
6 months ago
Selected Answer: D
Correct Answer: D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join. Reference: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html
upvoted 2 times
...
aragorn_brego
1 year ago
Selected Answer: D
The broadcast function in PySpark is used in the context of joins. When you mark a DataFrame with broadcast, Spark tries to send this DataFrame to all worker nodes so that it can be joined with another DataFrame without shuffling the larger DataFrame across the nodes. This is particularly beneficial when the DataFrame is small enough to fit into the memory of each node. It helps to optimize the join process by reducing the amount of data that needs to be shuffled across the cluster, which can be a very expensive operation in terms of computation and time.
upvoted 3 times
...
Dileepvikram
1 year ago
Answer is D
upvoted 1 times
...
PearApple
1 year ago
The answer is D
upvoted 1 times
...
hm358
1 year ago
Selected Answer: D
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.broadcast.html
upvoted 2 times
...
sturcu
1 year, 1 month ago
Selected Answer: D
Marks a DataFrame as small enough for use in broadcast joins.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...