Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 46 discussion

Which of the following operations can be used to return a new DataFrame from DataFrame storesDF without inducing a shuffle?

  • A. storesDF.intersect()
  • B. storesDF.repartition(1)
  • C. storesDF.union()
  • D. storesDF.coalesce(1)
  • E. storesDF.rdd.getNumPartitions()
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
azurearch
8 months, 3 weeks ago
Though Union does not cause a shuffle, you need another dataframe to do union. in this question its limited to storesDF. coalesce(1) is the correct answer, as it does not cause shuffle rather combines multiple partitions into 1, i.e. reducing partitions = no shuffle. execute storedDF.coalesce(1) and check DAG
upvoted 3 times
...
Ahlo
9 months, 1 week ago
Answer C : union Narrow transformation - all transformation logic performed within one partition Wide transformations - transformation during which is needed shuffle/exchange, distribution of data to other partitions Union is narrow transaction
upvoted 1 times
...
newusername
1 year ago
Selected Answer: C
union is the only operation from mentioned here that won't do shuffling. And as @ZSun mentioned, do not follow any of the 4be8126 answers, they are all blindly from GPT
upvoted 2 times
...
issibra
1 year, 2 months ago
C is the correct coalesce may induce a partial shuffle
upvoted 1 times
...
ZSun
1 year, 5 months ago
I think this question contains error, it should not be which one without shuffle, it should be which one cause shuffle. union is a narrow transformation, not causing shuffle. coalesce simply combine partitions together into one, not shuffle them. rdd.getNumPartitions just evaluate the number of partition of a dataframe, no shuffle. even for repartition(1), since there is only one partition in the end, it also not causing shuffle, it simply combine all partition together. Therefore, it should be A, this is the only one inducing a shuffle. or, B C D E without inducing a shuffle
upvoted 2 times
...
ZSun
1 year, 5 months ago
The Answer is C. Union rather than coalesce. Union is a narrow transformation. unlike wide transformationl, narrow transformation does not require shuffle. Coalesce is wide transformation, combine multiple partition to smaller number of partition. Don't this process require shuffling partition together? if you ask ChatGPT, it will tell you what 4be8126 comment.
upvoted 2 times
ZSun
1 year, 5 months ago
This is incorrect explanation, delete it
upvoted 5 times
...
wlademaro
9 months, 2 weeks ago
The problem with Union answer is that it returns an error if we run it without arg.
upvoted 2 times
...
...
4be8126
1 year, 6 months ago
Selected Answer: D
The correct answer is D. coalesce() can be used to return a new DataFrame with a reduced number of partitions, without inducing a shuffle. A shuffle is an expensive operation that involves the redistribution of data across a cluster, so it's important to minimize its use whenever possible. In this case, repartition() and union() both involve shuffles, while intersect() returns only the common rows between two DataFrames, and rdd.getNumPartitions() returns the number of partitions in the RDD underlying the DataFrame.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...