Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 15 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 15
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?

A. Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
B. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
C. DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
E. DataFrame A should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.

Show Suggested Answer

Suggested Answer: D 🗳️

by Indiee at April 25, 2023, 7:01 p.m.

Comments

Submit Cancel

te1

4 months, 2 weeks ago

Selected Answer: B

Dataset B is smaller Dataset. It will be brodcasted to all worker nodes. So No shuffle for Dataset B.

upvoted 2 times

...

monibun

8 months, 3 weeks ago

Should be B: During the join, the intention of the shuffle would be to bring the same keys from both dataframes in same partition. Now, this would ideally require both of them to be shuffled. however, if smaller one is broadcasted, that would mean we have sent the entire smaller dataframe in each partition whereas the bigger one would still undergo a shuffle to get its similar keys in each partition. hence, the re-shuffle of just the smaller one is avoided.

upvoted 2 times

...

65bd33e

10 months, 3 weeks ago

Selected Answer: D

The correct answer is: D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A. Explanation: In a broadcast join, the smaller DataFrame (in this case, DataFrame B, which is 1 GB) is broadcasted to all worker nodes. This allows the larger DataFrame (DataFrame A, which is 128 GB) to be joined without shuffling its data across the cluster, which would be computationally expensive. Broadcasting the smaller DataFrame reduces the amount of data that needs to be shuffled, improving the efficiency of the join operation.

upvoted 1 times

...

atulrao

11 months, 3 weeks ago

The correct answer is: B. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself. Explanation: In Spark, a broadcast join is a specific type of join where one DataFrame is sent to every node in the cluster to avoid the costly network shuffle that can occur with large datasets in regular joins. Generally, the smaller DataFrame should be broadcasted to optimize performance. This is because broadcasting a smaller DataFrame requires less network bandwidth and memory usage across the cluster. Broadcasting DataFrame B (the smaller DataFrame at 1 GB) means that each node will have a local copy of DataFrame B, allowing them to perform the join operation locally with their respective partitions of DataFrame A without needing to shuffle DataFrame B across the network. This approach significantly reduces the amount of data that needs to be shuffled (since only DataFrame A is partitioned across the nodes), thereby improving the performance of the join operation.

upvoted 3 times

...

azurearch

1 year, 3 months ago

Correct answer is B. D is wrong. Being the larger dataset Dataframe A (128 GB) will get shuffled being the larger dataset. Dataframe A (1 GB) (if hint is specified in join), will be broadcasted hence it would not get shuffled.

upvoted 1 times

...

Ahlo

1 year, 4 months ago

answer D - With broadcast join, Spark broadcast the smaller DataFrame to all executors and the executor keeps this DataFrame in memory and the larger DataFrame is split and distributed across all executors so that Spark can perform a join without shuffling any data from the larger DataFrame as the data required for join colocated on every executor.

upvoted 1 times

...

mehroosali

1 year, 7 months ago

Selected Answer: B

It should really be B.

upvoted 3 times

...

thanab

1 year, 10 months ago

The correct answer is B. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself. A broadcast join is an optimization technique in the Spark SQL engine that is used to join two DataFrames. This technique is ideal for joining a large DataFrame with a smaller one. With broadcast join, Spark broadcasts the smaller DataFrame to all executors and the executor keeps this DataFrame in memory. The larger DataFrame is split and distributed across all executors so that Spark can perform a join without shuffling any data from the larger DataFrame as the data required for join colocated on every executor.

upvoted 1 times

...

thanab

1 year, 10 months ago

Option D is incorrect because it states that DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A. However, broadcasting DataFrame B will not eliminate the need for shuffling DataFrame A. Instead, broadcasting DataFrame B will eliminate the need for shuffling itself. In a broadcast join, the smaller DataFrame is broadcasted to all executors and kept in memory. The larger DataFrame is split and distributed across all executors so that Spark can perform a join without shuffling any data from the larger DataFrame as the data required for join colocated on every executor.

upvoted 2 times

...

eendee

1 year, 10 months ago

Selected Answer: D

https://sparkbyexamples.com/spark/broadcast-join-in-spark/ Spark Broadcast Join is an important part of the Spark SQL execution engine, With broadcast join, Spark broadcast the smaller DataFrame to all executors and the executor keeps this DataFrame in memory and the larger DataFrame is split and distributed across all executors so that Spark can perform a join without shuffling any data from the larger DataFrame as the data required for join colocated on every executor.

upvoted 3 times

...

Diws

1 year, 11 months ago

D should be correct. Broadcast join happens on smaller DataFrame to prevent the shuffling of larger DataFrame.

upvoted 2 times

...

TmData

2 years ago

Selected Answer: D

Option A is incorrect because not both DataFrames can be broadcasted. Only one of the DataFrames should be broadcasted to minimize shuffling. Option B is correct because DataFrame B is smaller and broadcasting it will eliminate the shuffling of DataFrame B, improving the join operation's efficiency. Option C is incorrect because DataFrame A is larger and shuffling DataFrame B is not a concern in this scenario. Option E is incorrect because DataFrame A is larger, and broadcasting it would not eliminate the shuffling of itself. The larger DataFrame typically undergoes shuffling in a broadcast join. Therefore, the correct option is D.

upvoted 3 times

...

ZSun

2 years ago

Selected Answer: B

B is correct.

upvoted 2 times

...

4be8126

2 years, 2 months ago

Selected Answer: D

The correct answer is D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A. A broadcast join is a technique where the smaller DataFrame is broadcast to all the worker nodes in the cluster, so that it can be joined with the larger DataFrame without requiring any shuffling of the larger DataFrame. This is generally more efficient than a shuffle join, which requires data to be shuffled across the network. In this scenario, DataFrame B is much smaller than DataFrame A, so it is more efficient to broadcast DataFrame B to all worker nodes in the cluster. This will eliminate the need for shuffling of DataFrame A, making the join more efficient.

upvoted 1 times

...

Indiee

2 years, 2 months ago

All the ANS are incorrect. The DAG will perform a sort merge join instead of BCJ. The size of a DF needed to be 10MB max for broadcast else it will cause a network overload.

upvoted 2 times

...