Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 14 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 14
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

Of the following situations, in which will it be most advantageous to store DataFrame df at the MEMORY_AND_DISK storage level rather than the MEMORY_ONLY storage level?

A. When all of the computed data in DataFrame df can fit into memory.
B. When the memory is full and it’s faster to recompute all the data in DataFrame df rather than read it from disk.
C. When it’s faster to recompute all the data in DataFrame df that cannot fit into memory based on its logical plan rather than read it from disk.
D. When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.
E. The storage level MENORY_ONLY will always be more advantageous because it’s faster to read data from memory than it is to read data from disk.

Show Suggested Answer

Suggested Answer: D 🗳️

by sousouka at March 29, 2023, 10:59 p.m.

Comments

Submit Cancel

sousouka

Highly Voted 1 year, 3 months ago

D. When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan.

upvoted 9 times

...

All other explanation is either wrong or misleading. To understand the question, you need to understand the difference between Memory_only and Memory_and_Disk 1. Memory_and_Disk, which is the default mode for cache ro persist. That means, if the data size is larger than the memory, it will store the extra data in disk. next time when we n eed to read data, we will read data firstly from memory, and then read from disk. 2. Memory_Only means, if the data size is larger than memory, it will not store the extra data. next time we read data, we will read from memory first and then recompute the extra data which cannot store in memory. PS. Mr. 4be8126 is wrong about raising error when out of memory. Therefore, the difference/balance between Memory_only and memory_and_disk lay in how they handle the extra data out of memory. which is option D, if it is faster to read data from disk is faster than recompute it, then memory_and_disk.

upvoted 7 times

...

newusername

Most Recent 8 months ago

Selected Answer: D

D is correct

upvoted 1 times

...

astone42

11 months ago

Selected Answer: D

D is correct

upvoted 1 times

...

singh100

11 months, 1 week ago

D. It is faster to read the computed data from disk instead of recomputing it based on its logical plan when the recomputation is costly and time-consuming.

upvoted 1 times

...

SonicBoom10C9

1 year, 1 month ago

Selected Answer: D

If it's faster to read from memory and can fit in, then there is no reason to use Memory_and_disk, Memory_only is sufficient. Also, if it's faster to compute than read from disk, that's what you would do. The only options is when it's too big to fit in memory and too expensive to recompute, so reading from disk (or rather caching from disk into memory on the fly) is faster.

upvoted 1 times

...

4be8126

1 year, 2 months ago

Selected Answer: D

The most advantageous situation to store a DataFrame at the MEMORY_AND_DISK storage level instead of the MEMORY_ONLY storage level is option D - when it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan. This is because the MEMORY_ONLY storage level only stores data in memory, which can result in an out-of-memory error if the data exceeds the available memory. On the other hand, the MEMORY_AND_DISK storage level will spill data to disk if there is not enough memory available, allowing more data to be processed without errors. In situations where the computed data can fit entirely into memory, it is best to use the MEMORY_ONLY storage level as it will be faster than reading from disk. However, when there is not enough memory to store all the computed data, it may be necessary to use the MEMORY_AND_DISK storage level.

upvoted 1 times

...

sly75

1 year, 2 months ago

Yes but what about the link with the question ? I would say B too :)

upvoted 1 times

...

Indiee

1 year, 2 months ago

Answer is D. This is the whole idea behind caching

upvoted 2 times

...

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 14 discussion

Comments

sousouka

ZSun

newusername

astone42

singh100

SonicBoom10C9

4be8126

sly75

Indiee