An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

  • A. Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
  • B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
  • C. Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
  • D. Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will fail.
  • E. Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.
1 year, 4 months ago
B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. Explanation: In the provided code, the .dropDuplicates(["customer_id","order_id"]) operation is performed on the data loaded from the Parquet files. This operation ensures that only unique records, based on the composite key of "customer_id" and "order_id," are retained in the DataFrame before writing to the "orders" table. However, this operation does not consider duplicates that may already exist in the "orders" table. It only filters duplicates from the current batch of data. If there are duplicates in the "orders" table from previous batches, they will remain in the table. So, newly written records will not have duplicates within the batch being written, but duplicates from previous batches may still exist in the target table.
1 week, 5 days ago
The question doesn't say orders already exists. Arekm's answer is more correct
1 month ago
Selected Answer: B
Selected Answer: B
No duplicates in the current batch - that is obvious. The duplicates may happen since the source occasionally produces duplicates hours apart. This means that one record can be generated by the source and processed on day 1, the duplicate on day 2. Since there is no logic checking if the corresponding record exists in the target - you get the duplicates there given we use append mode.
1 month, 4 weeks ago
Selected Answer: B
yeah B is the correct answer cause in the current code it will look for duplicates in the currentDF based on composite keys and not for the duplicates which are already in the target table. if we want to insert for the rows which are not there in target table then we can make use of Merge Into statement of databricks.
4 months, 2 weeks ago
Selected Answer: B
Append method does not take in consideration any key in the target table, it simply add all rows of the input table to the target table.
7 months, 2 weeks ago
Yes it should be B
8 months ago
B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. Using merge this problem would not happen
11 months ago
Selected Answer: B
B is the right answer. The above code only remove duplicates from the batch that is processed, no logic is applied to already saved records.
1 year ago
Selected Answer: B
B is correct
1 year, 1 month ago
Selected Answer: B
Answer B
1 year, 1 month ago
Selected Answer: B
B is correct
1 year, 2 months ago
correct B
1 year, 3 months ago
Selected Answer: B
1 year, 3 months ago
Correct. B
1 year, 5 months ago
Selected Answer: B
