Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 54 discussion

The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.
Code block:
storesDF.join(broadcast(employeesDF), "storeId")

  • A. The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
  • B. There is never a need to call the broadcast() operation in Apache Spark 3.
  • C. The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
  • D. The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
  • E. Only one of the DataFrames is being broadcasted rather than both of the DataFrames.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
juliom6
1 year ago
Selected Answer: A
A si correct: # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html from pyspark.sql import types from pyspark.sql.functions import broadcast df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType()) df_small = spark.range(3) df.join(broadcast(df_small), df.value == df_small.id).show()
upvoted 1 times
...
4be8126
1 year, 6 months ago
Selected Answer: A
The answer is A. The logical error in the code block is that the larger DataFrame, employeesDF, is being broadcasted instead of the smaller DataFrame, storesDF. This defeats the purpose of a broadcast join, which is to optimize performance by broadcasting the smaller DataFrame to all the worker nodes, avoiding the need to shuffle data over the network. To perform a broadcast join efficiently, the smaller DataFrame should be broadcasted, which in this case is storesDF. The corrected code should be: broadcast(storesDF).join(employeesDF, "storeId")
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...