Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Associate Developer for Apache Spark All Questions

View all questions & answers for the Certified Associate Developer for Apache Spark exam

Exam Certified Associate Developer for Apache Spark topic 1 question 30 discussion

Which of the following operations fails to return a DataFrame with no duplicate rows?

  • A. DataFrame.dropDuplicates()
  • B. DataFrame.distinct()
  • C. DataFrame.drop_duplicates()
  • D. DataFrame.drop_duplicates(subset = None)
  • E. DataFrame.drop_duplicates(subset = "all")
Show Suggested Answer Hide Answer
Suggested Answer: E 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
4be8126
Highly Voted 1 year, 7 months ago
Selected Answer: E
A. DataFrame.dropDuplicates(): This method returns a new DataFrame with distinct rows based on all columns. It should return a DataFrame with no duplicate rows. B. DataFrame.distinct(): This method returns a new DataFrame with distinct rows based on all columns. It should also return a DataFrame with no duplicate rows. C. DataFrame.drop_duplicates(): This is an alias for DataFrame.dropDuplicates(). It should also return a DataFrame with no duplicate rows. D. DataFrame.drop_duplicates(subset=None): This method returns a new DataFrame with distinct rows based on all columns. It should return a DataFrame with no duplicate rows. E. DataFrame.drop_duplicates(subset="all"): This method attempts to drop duplicates based on all columns but returns an error, because "all" is not a valid argument for the subset parameter. So this operation fails to return a DataFrame with no duplicate rows. Therefore, the correct answer is E.
upvoted 10 times
...
TC007
Highly Voted 1 year, 8 months ago
Selected Answer: E
Option E is incorrect as "all" is not a valid value for the subset parameter in the drop_duplicates() method. The correct value should be a column name or a list of column names to be used as the subset to identify duplicate rows. All other options (A, B, C, and D) can be used to return a DataFrame with no duplicate rows. The dropDuplicates(), distinct(), and drop_duplicates() methods are all equivalent and return a new DataFrame with distinct rows. The drop_duplicates() method also accepts a subset parameter to specify the columns to use for identifying duplicates, and when the subset parameter is not specified, all columns are used. Therefore, both option A and C are valid, and option D is also valid as it is equivalent to drop_duplicates() with no subset parameter.
upvoted 10 times
smd_
3 months, 3 weeks ago
bro the question asks ( no duplicate rows ) ,that means the correct answer should be able to return rows with duplication. and (E) does that so. Focus on question.
upvoted 1 times
...
...
jds0
Most Recent 4 months ago
Selected Answer: E
It's E, see code below: # Drop duplicates in a DataFrame from pyspark.sql import SparkSession from pyspark.sql.functions import col from pyspark.errors import PySparkTypeError spark = SparkSession.builder.appName("MyApp").getOrCreate() data = [ (0, 43161), (0, 43161), (1, 51200), (2, None), (2, None), (3, 78367), (4, None), ] storesDF = spark.createDataFrame(data, ["storeID", "sqft"]) try: storesDF.dropDuplicates().show() except PySparkTypeError as e: print(e) try: storesDF.distinct().show() except PySparkTypeError as e: print(e) try: storesDF.drop_duplicates().show() except PySparkTypeError as e: print(e) try: storesDF.drop_duplicates(subset=None).show() except PySparkTypeError as e: print(e) try: storesDF.drop_duplicates(subset="all").show() except PySparkTypeError as e: print(e)
upvoted 2 times
...
dbdantas
7 months, 2 weeks ago
Selected Answer: E
the answer is E
upvoted 1 times
...
dbdantas
7 months, 2 weeks ago
Selected Answer: E
E PySparkTypeError: [NOT_LIST_OR_TUPLE] Argument `subset` should be a list or tuple, got str.
upvoted 1 times
...
azurearch
8 months, 3 weeks ago
DataFrame.drop_duplicates(subset = "all") - this is specific to pandas
upvoted 1 times
...
azurearch
8 months, 3 weeks ago
Option E . df.drop_duplicates(subset = "all") returns error SparkTypeError: [NOT_LIST_OR_TUPLE] Argument `subset` should be a list or tuple, got str.
upvoted 2 times
...
cookiemonster42
1 year, 3 months ago
Selected Answer: B
B is the right one, as TC007 said, the argument for drop_duplicates is a subset of columns: DataFrame.dropDuplicates(subset: Optional[List[str]] = None) → pyspark.sql.dataframe.DataFrame[source] Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. For a static batch DataFrame, it just drops duplicate rows. For a streaming DataFrame, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark() to limit how late the duplicate data can be and the system will accordingly limit the state. In addition, data older than watermark will be dropped to avoid any possibility of duplicates. drop_duplicates() is an alias for dropDuplicates(). Parameters subsetList of column names, optional List of columns to use for duplicate comparison (default All columns).
upvoted 1 times
cookiemonster42
1 year, 3 months ago
OMG, I got it all wrong, the answer is E :)
upvoted 3 times
...
...
ItsAB
1 year, 4 months ago
the correct answer is E
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...