Which of the following code blocks returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF? A sample of storesDF is displayed below:
A.
storesDF.withColumn("productCategories", explode(col("productCategories")))
B.
storesDF.withColumn("productCategories", split(col("productCategories")))
C.
storesDF.withColumn("productCategories", col("productCategories").explode())
D.
storesDF.withColumn("productCategories", col("productCategories").split())
E.
storesDF.withColumn("productCategories", explode("productCategories"))
Both option A and E work with spark 3.5.1.
But A is better for backward compatibility.
See code example below:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
spark = SparkSession.builder.appName("MyApp").getOrCreate()
data = [
(0, ["value 1", "value 2", "value 3"]),
(1, ["value 1", "value 2", "value 3"]),
(2, ["value 1", "value 2", "value 3"]),
]
storesDF = spark.createDataFrame(data, ["storeID", "productCategories"])
storesDF.withColumn("productCategories", explode(col("productCategories"))).show() # A.
storesDF.withColumn("productCategories", explode("productCategories")).show() # E.
While the Explode function allows for a str or Column input, this requires the col() wrapper because it is used in a withColumn() call, where the 2nd parameter requires the column object.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn
Option A is correct: storesDF.withColumn("productCategories", explode(col("productCategories"))).
Explanation:
The explode function is used to transform a column of arrays or maps into multiple rows, one for each element in the array or map. In this case, productCategories is a column with arrays of strings.
The withColumn function is used to add a new column or update an existing column. The first argument is the name of the new or existing column, and the second argument is the expression that defines the values for the column.
upvoted 1 times
...
Log in to ExamTopics
Sign in:
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
jds0
4 months agobettermakeme
7 months, 3 weeks agoarturffsi
8 months, 3 weeks agonewusername
1 year, 2 months agonewusername
1 year, 2 months agonewusername
1 year agoNickWerbung
1 year, 4 months agomhaskins
1 year, 6 months ago4be8126
1 year, 7 months ago