Exam Certified Associate Developer for Apache Spark topic 1 question 54 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 54
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.
Code block:
storesDF.join(broadcast(employeesDF), "storeId")

A. The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
B. There is never a need to call the broadcast() operation in Apache Spark 3.
C. The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
D. The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
E. Only one of the DataFrames is being broadcasted rather than both of the DataFrames.

Show Suggested Answer

Suggested Answer: A 🗳️

by 4be8126 at May 3, 2023, 11:48 a.m.

Comments

Submit Cancel

juliom6

5 months ago

Selected Answer: A

A si correct: # https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html from pyspark.sql import types from pyspark.sql.functions import broadcast df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType()) df_small = spark.range(3) df.join(broadcast(df_small), df.value == df_small.id).show()

upvoted 1 times

...

4be8126

11 months, 3 weeks ago

Selected Answer: A

The answer is A. The logical error in the code block is that the larger DataFrame, employeesDF, is being broadcasted instead of the smaller DataFrame, storesDF. This defeats the purpose of a broadcast join, which is to optimize performance by broadcasting the smaller DataFrame to all the worker nodes, avoiding the need to shuffle data over the network. To perform a broadcast join efficiently, the smaller DataFrame should be broadcasted, which in this case is storesDF. The corrected code should be: broadcast(storesDF).join(employeesDF, "storeId")

upvoted 2 times

...

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 54 discussion

Comments

juliom6

4be8126

SY0-701