Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 45 discussion

The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()

  • A. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
  • B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
  • C. The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached.
  • D. DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table.
  • E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
sofiess
1 month, 1 week ago
A because the default behavior of cache() is MEMORY_AND_DISK, and if you want MEMORY_ONLY, you must specify it explicitly
upvoted 1 times
...
azurearch
8 months, 3 weeks ago
E is wrong. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) note the use of 'only' here, cache can also store in disk if required. B is also wrong, there is no condition to set storagelevel prior to calling cache() correct answer is A.
upvoted 1 times
...
juliom6
1 year ago
Selected Answer: E
E is correct! from pyspark.sql.types import IntegerType from pyspark import StorageLevel storesDF = spark.createDataFrame([2023, 2024], IntegerType()) print(storesDF.persist(StorageLevel.MEMORY_ONLY).storageLevel)
upvoted 1 times
...
juadaves
1 year, 1 month ago
E cache() -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`). .. versionadded:: 1.3.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Notes ----- The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
upvoted 1 times
...
singh100
1 year, 3 months ago
E is correct. You cannot set StorageLevel Memory_only with cache(), if memory available then it keeps everything into memory else it will spill to disk. To keep everything into Memory you need to use Persist() with Storage Level Memory only.
upvoted 4 times
...
ItsAB
1 year, 4 months ago
there are two options here: B and E. Who chose B => you can't explicitly set the storage level, it's a read-only property, so the correct answer is E.
upvoted 1 times
...
Jtic
1 year, 6 months ago
Selected Answer: E
E B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache(). This option is incorrect. The storage level does not need to be set via storesDF.storageLevel prior to calling cache(). The cache() operation can be used directly on the DataFrame without explicitly setting the storage level. E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead. This option is the correct answer. The error in the code block is that the cache() operation is used instead of persist(). While cache() caches DataFrames at the default MEMORY_AND_DISK level, persist() provides more flexibility by allowing different storage levels to be specified, such as MEMORY_ONLY for caching only in memory. Therefore, persist() should be used instead of cache() to achieve the desired caching behavior.
upvoted 1 times
...
4be8126
1 year, 6 months ago
Selected Answer: B
B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache(). The storage level of a DataFrame cache can be specified as an argument to the cache() operation, but if the storage level has not been specified, the default MEMORY_AND_DISK level is used. Therefore, option A is incorrect. Option C is incorrect because caching and checkpointing are different operations in Spark. Caching stores a DataFrame in memory or on disk, while checkpointing saves a DataFrame to a reliable storage system like HDFS, which is necessary for iterative computations. Option D is incorrect because DataFrames can be cached in memory or on disk using the cache() operation. Option E is incorrect because cache() is the recommended method for caching DataFrames in Spark, and it supports caching at all storage levels, including MEMORY_ONLY. The persist() operation can be used to specify a storage level, but cache() is simpler and more commonly used.
upvoted 1 times
ZSun
1 year, 5 months ago
Wrong explanation. you can call cache() or persist() without set storage level, it will use default Memoery_and_disk. You clearly misunderstand the question itself. storesDF.cache().count() is a workable code, but fail the requirement. This is the issue. The question asked "only in memory", that means, if the data size is out of the memory, i do not want to store it in disk, but rather recompute. Therefore, you need to specifically set the storage level as "MEMORY ONLY". A is correct
upvoted 3 times
...
...
peekaboo15
1 year, 7 months ago
The answer should be E. See this post for reference https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist
upvoted 2 times
4be8126
1 year, 6 months ago
No, option E is incorrect. The cache() method is the appropriate method to cache a DataFrame in Spark's memory, and it can cache DataFrames at the MEMORY_ONLY level if that's what is desired. The persist() method is a more general-purpose method that allows the user to specify other storage levels (such as MEMORY_AND_DISK), but it is not required for this task.
upvoted 1 times
juadaves
1 year, 1 month ago
You should use storesDF.persist(StorageLevel.MEMORY_ONLY).count()
upvoted 1 times
...
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...