Which of the following cluster configurations is most likely to experience an out-of-memory error in response to data skew in a single partition? Note: each configuration has roughly the same compute power using 100 GB of RAM and 200 cores.
A.
Scenario #4
B.
Scenario #5
C.
Scenario #6
D.
More information is needed to determine an answer.
The most likely scenario to experience an out-of-memory error in response to data skew in a single partition is:
C. Scenario #6: 12.5 GB Worker Node, 12.5 GB Executor. 1 Driver & 8 Executors.
Explanation:
Data skew refers to an uneven distribution of data across partitions. When there is significant skew in a single partition, it can lead to increased memory usage for that specific partition, potentially causing out-of-memory errors. The smaller the available memory per executor, the higher the likelihood of encountering such issues.
In this case, Scenario #6 has the smallest worker node and executor configuration, with only 12.5 GB of RAM available for each executor. With 8 executors, the total available memory is still 100 GB (similar to other scenarios), but the reduced memory per executor increases the risk of encountering out-of-memory errors when handling skewed data in a single partition.
D is correct. even though you have less executor memory in scenario 6, spark will still complete the process , it might take more time to do the shuffle neverthless.
Option A, Scenario #4, has larger worker nodes and executors compared to Scenario #6, reducing the likelihood of encountering out-of-memory errors due to data skew.
Option B, Scenario #5, also has larger worker nodes and executors compared to Scenario #6, providing more memory per executor and reducing the risk of out-of-memory errors.
Option D states that more information is needed to determine an answer, but based on the available information, Scenario #6 is the most likely to experience out-of-memory errors due to data skew in a single partition.
Option E, Scenario #1, has larger worker nodes and executors compared to Scenario #6, reducing the likelihood of out-of-memory errors due to data skew.
Data skew is when you have a few partitions oversized. But due to initial partitioning this large datasets needed to be processed by single threads so can cause OOM
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
TmData
Highly Voted 1 year, 2 months agoazurearch
Most Recent 6 months agoMohitsain
1 year, 2 months agoTmData
1 year, 2 months agoIndiee
1 year, 4 months agoDhruv_Ajmeri
1 year, 5 months ago