Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Go to Exam

Exam Certified Data Engineer Professional topic 1 question 74 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 74
Topic #: 1

[All Certified Data Engineer Professional Questions]

Which statement describes the correct use of pyspark.sql.functions.broadcast?

A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Show Suggested Answer

Suggested Answer: D 🗳️

by sturcu at Oct. 25, 2023, 7:21 a.m.

Comments

Submit Cancel

KadELbied

2 months ago

Selected Answer: D

suretly D

upvoted 1 times

...

Correct Answer: D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join. Reference: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html

upvoted 3 times

...

aragorn_brego

1 year, 1 month ago

Selected Answer: D

The broadcast function in PySpark is used in the context of joins. When you mark a DataFrame with broadcast, Spark tries to send this DataFrame to all worker nodes so that it can be joined with another DataFrame without shuffling the larger DataFrame across the nodes. This is particularly beneficial when the DataFrame is small enough to fit into the memory of each node. It helps to optimize the join process by reducing the amount of data that needs to be shuffled across the cluster, which can be a very expensive operation in terms of computation and time.

upvoted 3 times

...

Dileepvikram

1 year, 2 months ago

Answer is D

upvoted 1 times

...

PearApple

1 year, 2 months ago

The answer is D

upvoted 1 times

...

hm358

1 year, 2 months ago

Selected Answer: D

https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.broadcast.html

upvoted 2 times

...

sturcu

1 year, 2 months ago

Selected Answer: D

Marks a DataFrame as small enough for use in broadcast joins.

upvoted 3 times

...

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 74 discussion

Comments

KadELbied

Freyr

aragorn_brego

Dileepvikram

PearApple

hm358

sturcu