Exam Certified Machine Learning Associate topic 1 question 29 discussion

Actual exam question from Databricks's Certified Machine Learning Associate

Question #: 29
Topic #: 1

[All Certified Machine Learning Associate Questions]

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

A. import pyspark.pandas as ps
df = ps.DataFrame(spark_df)
B. import pyspark.pandas as ps
df = ps.to_pandas(spark_df)
C. spark_df.to_sql()
D. import pandas as pd
df = pd.DataFrame(spark_df)
E. spark_df.to_pandas()

Show Suggested Answer

Suggested Answer: A 🗳️

by 68c6a4b at June 19, 2024, 4:51 a.m.

Comments

Submit Cancel

Shuttle

3 weeks ago

Selected Answer: A

A converts to a pandas like dataframe, supporting pandas functions, but still allowing the distributed Spark framework. E converts to a real pandas dataframe, where you can no longer use the distributed Spark features.

upvoted 2 times

...

jackttt

1 month ago

Selected Answer: A

A use pandas API under spark, E use regular Pandas which doesn't support distributed processing

upvoted 2 times

...

smonov

8 months ago

Selected Answer: E

It's E

upvoted 1 times

...

ricorosol

9 months, 2 weeks ago

E. is the closest answer, the correct method name is toPandas(). pyspark.sql.DataFrame.toPandas DataFrame.toPandas() → PandasDataFrameLike

upvoted 2 times

...

rajneesharora

1 year ago

A is correct

upvoted 1 times

...

68c6a4b

1 year ago

It's not A. E. spark_df.to_pandas() Here's why: The to_pandas() method is a built-in method of the PySpark DataFrame API. It converts a Spark DataFrame to a pandas DataFrame. By calling spark_df.to_pandas(), the data scientist can convert the Spark DataFrame spark_df to a pandas DataFrame, allowing them to use the familiar pandas API for further feature engineering. The resulting pandas DataFrame will be stored in memory on the driver node, so this approach is suitable when the data size is relatively small and can fit in the memory of the driver.

upvoted 3 times

rajneesharora

1 year ago

E is not correct as to_pandas would convert into pandas DF, while what is given is a Spark DF

upvoted 2 times

...