Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 55 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 55
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a cross join between DataFrame storesDF and DataFrame employeesDF. Identify the error.
Code block:
storesDF.join(employeesDF, "cross")

A. A cross join is not implemented by the DataFrame.join() operations – the standalone CrossJoin() operation should be used instead.
B. There is no direct cross join in Spark, but it can be implemented by performing an outer join on all columns of both DataFrames.
C. A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.
D. There is no key column specified – the key column "storeId" should be the second argument.
E. A cross join is not implemented by the DataFrame.join() operations – the standalone join() operation should be used instead.

Show Suggested Answer

Suggested Answer: C 🗳️

by ronfun at April 9, 2023, 2:51 p.m.

Comments

Submit Cancel

mineoolee

6 months, 3 weeks ago

Selected Answer: D

it is wokring data = [ (0, 2, 1100746394), (1, 2, 1474410343) ] df = spark.createDataFrame( data, ['storeId','a', 'openDate'] ) _data = [ ('a', 2, 4444444444), ('c', 2, None), ('b', None, 2222222222) ] _df = spark.createDataFrame( _data, ['storeId','a', 'openDate'] ) df.join(_df, 'a', "cross").show()

upvoted 1 times

mineoolee

6 months, 3 weeks ago

also, df.join(_df, '"cross").show() is working

upvoted 1 times

Kalipe

6 months ago

it's wrong, it doesn't work or you obviously haven't try it

upvoted 1 times

...

oussa_ama

10 months, 3 weeks ago

Selected Answer: C

Cross Join in PySpark: A cross join (also known as a Cartesian product) returns the Cartesian product of the two DataFrames, meaning every row from the first DataFrame is paired with every row from the second DataFrame. In PySpark, the crossJoin() method is used specifically for this type of join.

upvoted 2 times

...

65bd33e

10 months, 4 weeks ago

Selected Answer: C

The correct identification of the error is: C. A cross join is not implemented by the DataFrame.join() operation – the DataFrame.crossJoin() operation should be used instead. Explanation: In Spark, to perform a cross join between two DataFrames, you should use the crossJoin() method, not the join() method with the "cross" argument.

upvoted 1 times

...

Ahlo

1 year, 4 months ago

Correct answer C from pyspark.sql import Row df = spark.createDataFrame( [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) df2 = spark.createDataFrame( [Row(height=80, name="Tom"), Row(height=85, name="Bob")]) df.crossJoin(df2.select("height")).select("age", "name", "height").show() https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crossJoin.html

upvoted 2 times

...

azure_bimonster

1 year, 5 months ago

Selected Answer: D

D is the answer here as key is missing. As per syntax, key is needed.

upvoted 2 times

...

juliom6

1 year, 7 months ago

Selected Answer: C

C is correct. # https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.crossJoin.html a = spark.createDataFrame([(1, 2), (3, 4)], ['column1', 'column2']) b = spark.createDataFrame([(5, 6), (7, 8)], ['column3', 'column4']) df = a.crossJoin(b) display(df)

upvoted 3 times

...

newusername

1 year, 8 months ago

Selected Answer: D

I know it looks confusing to have key column for cross join, but it ijoin method syntaxis: https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.join.html see example below : dataA = [Row(column1=1, column2=2), Row(column1=2, column2=4), Row(column1=3, column2=6)] dfA = spark.createDataFrame(dataA) # Sample data for DataFrame 'b' dataB = [Row(column1=1, column2=2), Row(column1=2, column2=5), Row(column1=3, column2=4)] dfB = spark.createDataFrame(dataB) joinedDF = dfA.join(dfB, on=None, how="cross") joinedDF.show() it is possible to do Cross join this way as well DataFrame.crossJoin() but answer C states that df.join () doesn't do cross, which is wrong.

upvoted 4 times

tmz1

5 months, 1 week ago

Totally agree. The stament in answer C "A cross join is not implemented by the DataFrame.join()operation" is incorrect. It is implemented and I have tested it. Results below: products_df = spark.table('products') orders_df = spark.table('orders') print(products_df.count()) -> 200 print(orders_df.count()) -> 2140 cross_joined_df = products_df.join(orders_df, None, "cross") print(cross_joined_df.count()) -> 428000

upvoted 1 times

...

4be8126

2 years, 2 months ago

Selected Answer: C

C. A cross join is not implemented by the DataFrame.join()operation – the DataFrame.crossJoin()operation should be used instead.

upvoted 2 times

...

peekaboo15

2 years, 2 months ago

cross join doesn't need a key. Answer is C

upvoted 2 times

4be8126

2 years, 2 months ago

No, the issue is not that the key column is missing. In a cross join, there is no key column to join on. The correct answer is C: a cross join is not implemented by the DataFrame.join() operation – the DataFrame.crossJoin() operation should be used instead.

upvoted 1 times

...

ronfun

2 years, 3 months ago

Key is missing. Answer is D.

upvoted 4 times

4be8126

2 years, 2 months ago

upvoted 1 times

ZSun

2 years, 1 month ago

completely wrong. join(other, on=None, how=None) Joins with another DataFrame, using the given join expression. [source] Parameters: other – Right side of the join on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. how – str, default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

upvoted 2 times

ZSun

2 years, 1 month ago

you can specify cross in dataframe.join( how = 'cross') the reason why this code block doesn't work, because the second parameter is on. You need to specify the key column and then use how = 'cross'. otherwise, the function will regard 'cross' for 'on' instead of 'how'

upvoted 2 times

newusername

1 year, 8 months ago

ZSun is as always right. 4be8126 - it is not a problem to use gpt, but check its answers. Otherwise do not post it anywhere.

upvoted 1 times

...