Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 45 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 45
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()

A. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache().
B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache().
C. The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached.
D. DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table.
E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead.

Show Suggested Answer

Suggested Answer: B 🗳️

by peekaboo15 at April 13, 2023, 8:53 p.m.

Comments

Submit Cancel

singh100

Highly Voted 1 year, 6 months ago

E is correct. You cannot set StorageLevel Memory_only with cache(), if memory available then it keeps everything into memory else it will spill to disk. To keep everything into Memory you need to use Persist() with Storage Level Memory only.

upvoted 5 times

...

sofiess

Most Recent 4 months ago

A because the default behavior of cache() is MEMORY_AND_DISK, and if you want MEMORY_ONLY, you must specify it explicitly

upvoted 1 times

...

azurearch

11 months, 1 week ago

E is wrong. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) note the use of 'only' here, cache can also store in disk if required. B is also wrong, there is no condition to set storagelevel prior to calling cache() correct answer is A.

upvoted 1 times

...

juliom6

1 year, 3 months ago

Selected Answer: E

E is correct! from pyspark.sql.types import IntegerType from pyspark import StorageLevel storesDF = spark.createDataFrame([2023, 2024], IntegerType()) print(storesDF.persist(StorageLevel.MEMORY_ONLY).storageLevel)

upvoted 1 times

...

juadaves

1 year, 3 months ago

E cache() -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`). .. versionadded:: 1.3.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Notes ----- The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.

upvoted 1 times

...

ItsAB

1 year, 7 months ago

there are two options here: B and E. Who chose B => you can't explicitly set the storage level, it's a read-only property, so the correct answer is E.

upvoted 1 times

...

Jtic

1 year, 8 months ago

Selected Answer: E

E B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache(). This option is incorrect. The storage level does not need to be set via storesDF.storageLevel prior to calling cache(). The cache() operation can be used directly on the DataFrame without explicitly setting the storage level. E. The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead. This option is the correct answer. The error in the code block is that the cache() operation is used instead of persist(). While cache() caches DataFrames at the default MEMORY_AND_DISK level, persist() provides more flexibility by allowing different storage levels to be specified, such as MEMORY_ONLY for caching only in memory. Therefore, persist() should be used instead of cache() to achieve the desired caching behavior.

upvoted 1 times

...

4be8126

1 year, 9 months ago

Selected Answer: B

B. The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache(). The storage level of a DataFrame cache can be specified as an argument to the cache() operation, but if the storage level has not been specified, the default MEMORY_AND_DISK level is used. Therefore, option A is incorrect. Option C is incorrect because caching and checkpointing are different operations in Spark. Caching stores a DataFrame in memory or on disk, while checkpointing saves a DataFrame to a reliable storage system like HDFS, which is necessary for iterative computations. Option D is incorrect because DataFrames can be cached in memory or on disk using the cache() operation. Option E is incorrect because cache() is the recommended method for caching DataFrames in Spark, and it supports caching at all storage levels, including MEMORY_ONLY. The persist() operation can be used to specify a storage level, but cache() is simpler and more commonly used.

upvoted 1 times

ZSun

1 year, 8 months ago

Wrong explanation. you can call cache() or persist() without set storage level, it will use default Memoery_and_disk. You clearly misunderstand the question itself. storesDF.cache().count() is a workable code, but fail the requirement. This is the issue. The question asked "only in memory", that means, if the data size is out of the memory, i do not want to store it in disk, but rather recompute. Therefore, you need to specifically set the storage level as "MEMORY ONLY". A is correct

upvoted 3 times

...

peekaboo15

1 year, 10 months ago

The answer should be E. See this post for reference https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist

upvoted 2 times

4be8126

1 year, 9 months ago

No, option E is incorrect. The cache() method is the appropriate method to cache a DataFrame in Spark's memory, and it can cache DataFrames at the MEMORY_ONLY level if that's what is desired. The persist() method is a more general-purpose method that allows the user to specify other storage levels (such as MEMORY_AND_DISK), but it is not required for this task.

upvoted 1 times

juadaves

1 year, 3 months ago

You should use storesDF.persist(StorageLevel.MEMORY_ONLY).count()

upvoted 1 times

...

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 45 discussion

Comments

singh100

sofiess

azurearch

juliom6

juadaves

ItsAB

Jtic

4be8126

ZSun

peekaboo15

4be8126

juadaves

SY0-701