Though Union does not cause a shuffle, you need another dataframe to do union. in this question its limited to storesDF. coalesce(1) is the correct answer, as it does not cause shuffle rather combines multiple partitions into 1, i.e. reducing partitions = no shuffle.
execute storedDF.coalesce(1) and check DAG
Answer C : union
Narrow transformation - all transformation logic performed within one partition
Wide transformations - transformation during which is needed shuffle/exchange, distribution of data to other partitions
Union is narrow transaction
union is the only operation from mentioned here that won't do shuffling. And as @ZSun mentioned, do not follow any of the 4be8126 answers, they are all blindly from GPT
I think this question contains error, it should not be which one without shuffle, it should be which one cause shuffle.
union is a narrow transformation, not causing shuffle.
coalesce simply combine partitions together into one, not shuffle them.
rdd.getNumPartitions just evaluate the number of partition of a dataframe, no shuffle.
even for repartition(1), since there is only one partition in the end, it also not causing shuffle, it simply combine all partition together.
Therefore, it should be A, this is the only one inducing a shuffle.
or, B C D E without inducing a shuffle
The Answer is C. Union rather than coalesce.
Union is a narrow transformation. unlike wide transformationl, narrow transformation does not require shuffle.
Coalesce is wide transformation, combine multiple partition to smaller number of partition. Don't this process require shuffling partition together?
if you ask ChatGPT, it will tell you what 4be8126 comment.
The correct answer is D. coalesce() can be used to return a new DataFrame with a reduced number of partitions, without inducing a shuffle.
A shuffle is an expensive operation that involves the redistribution of data across a cluster, so it's important to minimize its use whenever possible. In this case, repartition() and union() both involve shuffles, while intersect() returns only the common rows between two DataFrames, and rdd.getNumPartitions() returns the number of partitions in the RDD underlying the DataFrame.
upvoted 3 times
...
Log in to ExamTopics
Sign in:
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
azurearch
8 months, 3 weeks agoAhlo
9 months, 1 week agonewusername
1 year agoissibra
1 year, 2 months agoZSun
1 year, 5 months agoZSun
1 year, 5 months agoZSun
1 year, 5 months agowlademaro
9 months, 2 weeks ago4be8126
1 year, 6 months ago