Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 46 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 46
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

Which of the following operations can be used to return a new DataFrame from DataFrame storesDF without inducing a shuffle?

A. storesDF.intersect()
B. storesDF.repartition(1)
C. storesDF.union()
D. storesDF.coalesce(1)
E. storesDF.rdd.getNumPartitions()

Show Suggested Answer

Suggested Answer: D 🗳️

by 4be8126 at May 3, 2023, 10:53 a.m.

Comments

Submit Cancel

mineoolee

7 months, 1 week ago

Selected Answer: D

C is not operation

upvoted 1 times

...

Though Union does not cause a shuffle, you need another dataframe to do union. in this question its limited to storesDF. coalesce(1) is the correct answer, as it does not cause shuffle rather combines multiple partitions into 1, i.e. reducing partitions = no shuffle. execute storedDF.coalesce(1) and check DAG

upvoted 3 times

...

Ahlo

10 months, 4 weeks ago

Answer C : union Narrow transformation - all transformation logic performed within one partition Wide transformations - transformation during which is needed shuffle/exchange, distribution of data to other partitions Union is narrow transaction

upvoted 1 times

...

newusername

1 year, 2 months ago

Selected Answer: C

union is the only operation from mentioned here that won't do shuffling. And as @ZSun mentioned, do not follow any of the 4be8126 answers, they are all blindly from GPT

upvoted 2 times

...

issibra

1 year, 4 months ago

C is the correct coalesce may induce a partial shuffle

upvoted 1 times

...

ZSun

1 year, 7 months ago

I think this question contains error, it should not be which one without shuffle, it should be which one cause shuffle. union is a narrow transformation, not causing shuffle. coalesce simply combine partitions together into one, not shuffle them. rdd.getNumPartitions just evaluate the number of partition of a dataframe, no shuffle. even for repartition(1), since there is only one partition in the end, it also not causing shuffle, it simply combine all partition together. Therefore, it should be A, this is the only one inducing a shuffle. or, B C D E without inducing a shuffle

upvoted 3 times

...

ZSun

1 year, 7 months ago

The Answer is C. Union rather than coalesce. Union is a narrow transformation. unlike wide transformationl, narrow transformation does not require shuffle. Coalesce is wide transformation, combine multiple partition to smaller number of partition. Don't this process require shuffling partition together? if you ask ChatGPT, it will tell you what 4be8126 comment.

upvoted 2 times

wlademaro

11 months, 1 week ago

The problem with Union answer is that it returns an error if we run it without arg.

upvoted 2 times

...

ZSun

1 year, 7 months ago

This is incorrect explanation, delete it

upvoted 6 times

...

4be8126

1 year, 8 months ago

Selected Answer: D

The correct answer is D. coalesce() can be used to return a new DataFrame with a reduced number of partitions, without inducing a shuffle. A shuffle is an expensive operation that involves the redistribution of data across a cluster, so it's important to minimize its use whenever possible. In this case, repartition() and union() both involve shuffles, while intersect() returns only the common rows between two DataFrames, and rdd.getNumPartitions() returns the number of partitions in the RDD underlying the DataFrame.

upvoted 3 times

...

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 46 discussion

Comments

mineoolee

azurearch

Ahlo

newusername

issibra

ZSun

ZSun

wlademaro

ZSun

4be8126