exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 17 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 17
Topic #: 1
[All Certified Data Engineer Professional Questions]

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?

  • A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
  • B. Z-order indices calculated on the table are preventing file compaction
  • C. Bloom filter indices calculated on the table are preventing file compaction
  • D. Databricks has autotuned to a smaller target file size based on the overall size of data in the table
  • E. Databricks has autotuned to a smaller target file size based on the amount of data in each partition
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
cotardo2077
Highly Voted 1 year, 7 months ago
Selected Answer: A
https://docs.databricks.com/en/delta/tune-file-size.html#autotune-table 'Autotune file size based on workload'
upvoted 13 times
meatpoof
3 months ago
Your source doesn't support your answer. It doesn't mention anything about autotuning to increase the speed of merges
upvoted 1 times
...
...
JoG1221
Most Recent 1 week, 4 days ago
Selected Answer: E
Databricks has autotuned to a smaller target file size based on the amount of data in each partition
upvoted 1 times
...
JoG1221
1 week, 4 days ago
Selected Answer: D
File size tuning is not based on total table size or merge ops, but on partition-level dynamics.
upvoted 1 times
JoG1221
1 week, 4 days ago
Answer is E
upvoted 1 times
...
...
kishanu
3 weeks, 2 days ago
Selected Answer: E
Databricks Auto Optimize and Auto Compaction features are designed to optimize file sizes dynamically for better performance and efficiency in Delta Lake. These features do not use a fixed target file size like 1 GB, but instead autotune file sizes based on partition-level characteristics. In this case: Each partition has at least 1 GB of data, and the overall table is large (10+ TB), but... You see many small files <64 MB, which seems suboptimal at first. However, Databricks may intentionally use smaller file sizes within partitions when: The data change rate is high (as in a streaming CDC feed). Smaller file sizes help with faster read times, reduced shuffle, and quicker MERGE operations during structured streaming. The amount of new data added per batch or microbatch is small, leading to many smaller files, especially when auto compaction determines this improves job performance at runtime. This makes option E the most accurate description of what's happening.
upvoted 2 times
...
AlHerd
1 month ago
Selected Answer: A
An always-on Structured Streaming job that applies updates from a Change Data Capture (CDC) feed uses frequent MERGE operations to apply changes (inserts, updates, deletes) to the Delta table. Because these MERGE operations are constant and high-frequency, Databricks may autotune to a smaller target file size to reduce the duration and overhead of each merge. This behaviour is described explicitly in the documentation. So, with this in view, the correct answer is A
upvoted 1 times
...
EZZALDIN
1 month ago
Selected Answer: E
The primary goal of Auto Optimize and Auto Compaction in a streaming job isn’t specifically to reduce MERGE duration. Instead, these features adjust file sizes based on the incremental volume of data being ingested in each micro‐batch within a partition. Even though each partition contains around 1 GB of data (from the original OPTIMIZE), the streaming job writes small batches that are compacted into smaller files (often under 64 MB) because that’s the amount of new data per batch. So, Option E is more accurate: Databricks auto-tunes the target file size based on the amount of data in each partition (from each micro-batch), not specifically to speed up MERGE operations.
upvoted 1 times
...
Tedet
2 months ago
Selected Answer: E
Option E is more accurate because Delta Lake’s Auto Optimize and Auto Compaction are designed to adjust file sizes based on the streaming data partitioning, which inherently leads to smaller files over time. The system auto-tunes file sizes as new, incremental data is ingested and partitioned. Option A is plausible, but optimizing file sizes for MERGE operations is not the core focus of Auto Optimize in this case. The system’s auto-tuning mechanism is more about managing file sizes based on the streaming data's partition size and maintaining efficient reads/writes, rather than directly optimizing for MERGE performance.
upvoted 3 times
...
Tedet
2 months ago
Selected Answer: A
Options Behavior auto (recommended) Tunes target file size while respecting other autotuning functionality. Requires Databricks Runtime 10.4 LTS or above. legacy Alias for true. Requires Databricks Runtime 10.4 LTS or above. true Use 128 MB as the target file size. No dynamic sizing. false Turns off auto compaction. Can be set at the session level to override auto compaction for all Delta tables modified in the workload.
upvoted 1 times
...
rollno1
2 months, 1 week ago
Selected Answer: E
MERGE operations are not the main update mechanism in this scenario—it’s an incremental stream update, not batch MERGE. Larger partitions often result in smaller file sizes because: Frequent incremental writes cause small batch updates. Compaction happens at the partition level, not globally.
upvoted 1 times
...
Melik3
8 months, 3 weeks ago
Selected Answer: A
It is important here to understand the difference between the partition size and the data files. the partition size is 1GB which is caused by OPTIMIZE and also expected. In each partition are data files. Databricks did an attuning to these datafile and resized them to a small size to be able to do MERGE statements efficiently that's why A is the correct answer
upvoted 4 times
...
imatheushenrique
11 months ago
One of the purposes of a optimize execution is the gain in merge oprations, so: A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
upvoted 1 times
...
RiktRikt007
1 year, 2 months ago
how A is correct ? While Databricks does have autotuning capabilities, it primarily considers the table size. In this case, the table is over 10 TB, which would typically lead to a target file size of 1 GB, not under 64 MB.
upvoted 2 times
...
PrashantTiwari
1 year, 2 months ago
The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB. Correct answer is A
upvoted 2 times
...
AziLa
1 year, 3 months ago
correct ans is A
upvoted 1 times
...
Jay_98_11
1 year, 3 months ago
Selected Answer: A
A is correct
upvoted 2 times
...
kz_data
1 year, 3 months ago
Selected Answer: A
correct answer is A
upvoted 1 times
...
BIKRAM063
1 year, 5 months ago
Selected Answer: A
Auto Optimize reduces file size less than 128MB to facilitate quick merge
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago