The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error. Code block: storesDF.cache().count()
A.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
B.
The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
C.
The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached.
D.
DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table.
E.
The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.
E is wrong. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default)
note the use of 'only' here, cache can also store in disk if required.
B is also wrong, there is no condition to set storagelevel prior to calling cache()
correct answer is A.
E is correct!
from pyspark.sql.types import IntegerType
from pyspark import StorageLevel
storesDF = spark.createDataFrame([2023, 2024], IntegerType())
print(storesDF.persist(StorageLevel.MEMORY_ONLY).storageLevel)
E
cache() -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`).
.. versionadded:: 1.3.0
.. versionchanged:: 3.4.0
Supports Spark Connect.
Notes
-----
The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
E is correct. You cannot set StorageLevel Memory_only with cache(), if memory available then it keeps everything into memory else it will spill to disk. To keep everything into Memory you need to use Persist() with Storage Level Memory only.
E
B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
This option is incorrect. The storage level does not need to be set via storesDF.storageLevel prior to calling cache(). The cache() operation can be used directly on the DataFrame without explicitly setting the storage level.
E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.
This option is the correct answer. The error in the code block is that the cache() operation is used instead of persist(). While cache() caches DataFrames at the default MEMORY_AND_DISK level, persist() provides more flexibility by allowing different storage levels to be specified, such as MEMORY_ONLY for caching only in memory. Therefore, persist() should be used instead of cache() to achieve the desired caching behavior.
B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
The storage level of a DataFrame cache can be specified as an argument to the cache() operation, but if the storage level has not been specified, the default MEMORY_AND_DISK level is used. Therefore, option A is incorrect.
Option C is incorrect because caching and checkpointing are different operations in Spark. Caching stores a DataFrame in memory or on disk, while checkpointing saves a DataFrame to a reliable storage system like HDFS, which is necessary for iterative computations.
Option D is incorrect because DataFrames can be cached in memory or on disk using the cache() operation.
Option E is incorrect because cache() is the recommended method for caching DataFrames in Spark, and it supports caching at all storage levels, including MEMORY_ONLY. The persist() operation can be used to specify a storage level, but cache() is simpler and more commonly used.
Wrong explanation. you can call cache() or persist() without set storage level, it will use default Memoery_and_disk.
You clearly misunderstand the question itself. storesDF.cache().count() is a workable code, but fail the requirement. This is the issue.
The question asked "only in memory", that means, if the data size is out of the memory, i do not want to store it in disk, but rather recompute. Therefore, you need to specifically set the storage level as "MEMORY ONLY".
A is correct
No, option E is incorrect. The cache() method is the appropriate method to cache a DataFrame in Spark's memory, and it can cache DataFrames at the MEMORY_ONLY level if that's what is desired. The persist() method is a more general-purpose method that allows the user to specify other storage levels (such as MEMORY_AND_DISK), but it is not required for this task.
You should use storesDF.persist(StorageLevel.MEMORY_ONLY).count()
upvoted 1 times
...
...
...
Log in to ExamTopics
Sign in:
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
sofiess
1 month, 1 week agoazurearch
8 months, 3 weeks agojuliom6
1 year agojuadaves
1 year, 1 month agosingh100
1 year, 3 months agoItsAB
1 year, 4 months agoJtic
1 year, 6 months ago4be8126
1 year, 6 months agoZSun
1 year, 5 months agopeekaboo15
1 year, 7 months ago4be8126
1 year, 6 months agojuadaves
1 year, 1 month ago