Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 27 discussion

Actual exam question from Databricks's Certified Associate Developer for Apache Spark

Question #: 27
Topic #: 1

[All Certified Associate Developer for Apache Spark Questions]

Which of the following code blocks returns a new DataFrame with column storeDescription where the pattern "Description: " has been removed from the beginning of column storeDescription in DataFrame storesDF?
A sample of DataFrame storesDF is below:

A. storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: "))
B. storesDF.withColumn("storeDescription", col("storeDescription").regexp_replace("^Description: ", ""))
C. storesDF.withColumn("storeDescription", regexp_extract(col("storeDescription"), "^Description: ", ""))
D. storesDF.withColumn("storeDescription", regexp_replace("storeDescription", "^Description: ", ""))
E. storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))

Show Suggested Answer

Suggested Answer: E 🗳️

by TC007 at April 3, 2023, 5:06 p.m.

Comments

Submit Cancel

jds0

11 months, 4 weeks ago

Selected Answer: E

Both D and E work with Spark 3.5.1 but E is better for backward compatibility See code below: from pyspark.sql import SparkSession from pyspark.sql.functions import col, regexp_replace spark = SparkSession.builder.appName("MyApp").getOrCreate() data = [ (0, "Description: Store 0"), (1, "Description: Store 1"), (2, "Description: Store 2"), ] storesDF = spark.createDataFrame(data, ["storeID", "StoreDescription"]) storesDF.withColumn("storeDescription", regexp_replace(col("StoreDescription"), "Description: ", "")).show() storesDF.withColumn("storeDescription", regexp_replace("StoreDescription", "Description: ", "")).show()

upvoted 1 times

...

arturffsi

1 year, 4 months ago

Both D and E are correct according to the new version

upvoted 1 times

...

azure_bimonster

1 year, 5 months ago

Selected Answer: E

E is most likely correct in this scenario

upvoted 1 times

...

newusername

1 year, 10 months ago

Both work: from pyspark.sql import SparkSession from pyspark.sql.functions import regexp_replace,regexp_extract, col spark = SparkSession.builder.appName("test").getOrCreate() data = [ (1, "Description: This is a tech store. Description: This"), (2, "Description: This is a grocery store."), (3, "Description: This is a book store."), ] storesDF = spark.createDataFrame(data, ["storeID", "storeDescription"]) storesDF.show(truncate=False) #Case D print ("Case D") storesDF = storesDF.withColumn("storeDescription", regexp_replace("storeDescription", "^Description: ", "")) storesDF.show(truncate=False) #Case E print ("Case E") storesDF = storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", "")) storesDF.show(truncate=False)

upvoted 3 times

...

Dgohel

1 year, 11 months ago

regexp_replace(str, regexp, rep [, position] ) This is what Databricks documentation says. You guys can debate between D and E but actually question clearly says to remove from the begging of the string. And if you take answer D it takes whole only one constant string “storeDescription” to match pattern and will return empty string after Description for each row. So if you have debate between D, E then E is the correct answer.

upvoted 2 times

...

zozoshanky

1 year, 11 months ago

E is the answer tested

upvoted 2 times

...

NickWerbung

2 years ago

Both D and E are correct.

upvoted 1 times

...

SonicBoom10C9

2 years, 2 months ago

Selected Answer: E

It's between D and E, and D is wrong as there is no replacement string expression (which is a required argument/parameter). Thus, E wins as the correct option.

upvoted 1 times

ZSun

2 years, 1 month ago

this is completely wrong explanation. Both D and E has replacement expression, the only difference is how they call the replaced column. Both D and E are correct, but D works for Pyspark 2.0. D and E both work Pyspark 3.0+. Period!

upvoted 7 times

...

ZSun

2 years, 1 month ago

I think what you really mean, "there is no replacement string expression", is for option A. The only difference between A and E, is about the claim of replacement string expression

upvoted 1 times

...

sly75

2 years, 2 months ago

Selected Answer: E

Correct answer is E indeed - According to the pyspark doc, the syntax is regexp_replace(str, pattern, replacement) -> it means that it's not a function of the column object - storeDescription is a String field https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace

upvoted 2 times

...

pierre_grns

2 years, 2 months ago

Selected Answer: D

Correct answer is D. First, regexp_replace/regexp_extract are from sql.functions. They cannot be applied directly after a column Object => B is incorrect. Second, regexp_replace/regexp_extract accept a STRING Object as a first argument to specify the column. Check the documentation there : https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions => A, C, E are incorrects.

upvoted 2 times

sly75

2 years, 2 months ago

Almost right but it's not about "String object" but "String value". So the correct answer is indeed the answer E ;)

upvoted 2 times

...

4be8126

2 years, 2 months ago

Selected Answer: E

The correct answer is option E: storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", "")). This code block uses the withColumn() function to create a new column called storeDescription. It uses the regexp_replace() function to replace the pattern "^Description: " at the beginning of the string in the storeDescription column with an empty string. This effectively removes the pattern from the beginning of the string in each row of the column.

upvoted 4 times

4be8126

2 years, 2 months ago

The correct code block that returns a new DataFrame with column storeDescription where the pattern "Description: " has been removed from the beginning of column storeDescription in DataFrame storesDF is: A. storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ")) This code uses the regexp_replace function to replace the pattern "^Description: " (which matches the string "Description: " at the beginning of the string) with an empty string in the column storeDescription. The resulting DataFrame will have the modified storeDescription column. Option B has a syntax error because the regexp_replace function should be called on the column using the dot notation instead of passing it as the second argument. Option C uses the regexp_extract function, which extracts a substring matching a regular expression pattern. It doesn't remove the pattern from the string. Option D has a syntax error because the column name is not wrapped in the col function. Option E is the same as option A, except that it uses the col function unnecessarily.

upvoted 1 times

...

4be8126

2 years, 2 months ago

Selected Answer: A

Option A is correct: storesDF.withColumn("productCategories", explode(col("productCategories"))). Explanation: The explode function is used to transform a column of arrays or maps into multiple rows, one for each element in the array or map. In this case, productCategories is a column with arrays of strings. The withColumn function is used to add a new column or update an existing column. The first argument is the name of the new or existing column, and the second argument is the expression that defines the values for the column.

upvoted 1 times

sly75

2 years, 2 months ago

You got the wrong question :°

upvoted 2 times

...

ronfun

2 years, 3 months ago

Both D and E are correct answer.

upvoted 2 times

...

TC007

2 years, 3 months ago

Selected Answer: D

This should actually be D sorry for the wrong answer. refer to this, https://sparkbyexamples.com/pyspark/pyspark-replace-column-values/

upvoted 3 times

...

TC007

2 years, 3 months ago

Selected Answer: A

The regexp_replace function is used to remove the pattern "Description: " from the beginning of the column storeDescription. The ^ symbol indicates the beginning of the string, and the pattern "Description: " is replaced with an empty string. This results in a new DataFrame with column storeDescription where the pattern "Description: " has been removed from the beginning of each cell in that column.

upvoted 1 times

4be8126

2 years, 2 months ago

Option A is incorrect because the regexp_replace function requires two arguments: the column to be transformed and the regular expression pattern to be replaced. In the given code block, only the regular expression pattern is provided, but not the column to be transformed. The correct syntax to use regexp_replace on a DataFrame column is regexp_replace(col(column_name), pattern, replacement), where col(column_name) specifies the DataFrame column to be transformed, pattern specifies the regular expression pattern to be replaced, and replacement specifies the new string to replace the matched pattern. Therefore, the correct code block to remove the pattern "Description: " from the beginning of the storeDescription column in DataFrame storesDF is: storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))

upvoted 2 times

...