Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 17 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 17
Topic #: 1

[All Certified Data Engineer Professional Questions]

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?

A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
B. Z-order indices calculated on the table are preventing file compaction
C. Bloom filter indices calculated on the table are preventing file compaction
D. Databricks has autotuned to a smaller target file size based on the overall size of data in the table
E. Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Show Suggested Answer

Suggested Answer: A 🗳️

by Eertyy at Aug. 30, 2023, 11:40 a.m.

Comments

Submit Cancel

cotardo2077

Highly Voted 1 year, 10 months ago

Selected Answer: A

https://docs.databricks.com/en/delta/tune-file-size.html#autotune-table 'Autotune file size based on workload'

upvoted 13 times

Vitality

3 weeks, 3 days ago

MERGE operations overall benefit from fewer, larger files, so reducing file size to improve MERGE performance would be counterintuitive. E can be the only correct answer.

upvoted 1 times

...

meatpoof

5 months, 1 week ago

Your source doesn't support your answer. It doesn't mention anything about autotuning to increase the speed of merges

upvoted 2 times

...

Vitality

Most Recent 3 weeks, 3 days ago

Selected Answer: E

Definitely E. Auto Compaction is tuning file sizes based on the data volume per partition. This helps optimize performance for streaming workloads, where smaller files can reduce latency and improve the efficiency of incremental updates.

upvoted 1 times

...

KadELbied

1 month, 4 weeks ago

Selected Answer: A

Suretly A

upvoted 1 times

...

JoG1221

2 months, 2 weeks ago

Selected Answer: E

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

upvoted 1 times

...

JoG1221

2 months, 2 weeks ago

Selected Answer: D

File size tuning is not based on total table size or merge ops, but on partition-level dynamics.

upvoted 1 times

JoG1221

2 months, 2 weeks ago

Answer is E

upvoted 1 times

...

kishanu

2 months, 3 weeks ago

Selected Answer: E

Databricks Auto Optimize and Auto Compaction features are designed to optimize file sizes dynamically for better performance and efficiency in Delta Lake. These features do not use a fixed target file size like 1 GB, but instead autotune file sizes based on partition-level characteristics. In this case: Each partition has at least 1 GB of data, and the overall table is large (10+ TB), but... You see many small files <64 MB, which seems suboptimal at first. However, Databricks may intentionally use smaller file sizes within partitions when: The data change rate is high (as in a streaming CDC feed). Smaller file sizes help with faster read times, reduced shuffle, and quicker MERGE operations during structured streaming. The amount of new data added per batch or microbatch is small, leading to many smaller files, especially when auto compaction determines this improves job performance at runtime. This makes option E the most accurate description of what's happening.

upvoted 2 times

...

AlHerd

3 months, 1 week ago

Selected Answer: A

An always-on Structured Streaming job that applies updates from a Change Data Capture (CDC) feed uses frequent MERGE operations to apply changes (inserts, updates, deletes) to the Delta table. Because these MERGE operations are constant and high-frequency, Databricks may autotune to a smaller target file size to reduce the duration and overhead of each merge. This behaviour is described explicitly in the documentation. So, with this in view, the correct answer is A

upvoted 1 times

...

EZZALDIN

3 months, 1 week ago

Selected Answer: E

The primary goal of Auto Optimize and Auto Compaction in a streaming job isn’t specifically to reduce MERGE duration. Instead, these features adjust file sizes based on the incremental volume of data being ingested in each micro‐batch within a partition. Even though each partition contains around 1 GB of data (from the original OPTIMIZE), the streaming job writes small batches that are compacted into smaller files (often under 64 MB) because that’s the amount of new data per batch. So, Option E is more accurate: Databricks auto-tunes the target file size based on the amount of data in each partition (from each micro-batch), not specifically to speed up MERGE operations.

upvoted 1 times

...

Tedet

4 months ago

Selected Answer: E

Option E is more accurate because Delta Lake’s Auto Optimize and Auto Compaction are designed to adjust file sizes based on the streaming data partitioning, which inherently leads to smaller files over time. The system auto-tunes file sizes as new, incremental data is ingested and partitioned. Option A is plausible, but optimizing file sizes for MERGE operations is not the core focus of Auto Optimize in this case. The system’s auto-tuning mechanism is more about managing file sizes based on the streaming data's partition size and maintaining efficient reads/writes, rather than directly optimizing for MERGE performance.

upvoted 3 times

...

Tedet

4 months ago

Selected Answer: A

Options Behavior auto (recommended) Tunes target file size while respecting other autotuning functionality. Requires Databricks Runtime 10.4 LTS or above. legacy Alias for true. Requires Databricks Runtime 10.4 LTS or above. true Use 128 MB as the target file size. No dynamic sizing. false Turns off auto compaction. Can be set at the session level to override auto compaction for all Delta tables modified in the workload.

upvoted 1 times

...

rollno1

4 months, 2 weeks ago

Selected Answer: E

MERGE operations are not the main update mechanism in this scenario—it’s an incremental stream update, not batch MERGE. Larger partitions often result in smaller file sizes because: Frequent incremental writes cause small batch updates. Compaction happens at the partition level, not globally.

upvoted 1 times

...

Melik3

11 months ago

Selected Answer: A

It is important here to understand the difference between the partition size and the data files. the partition size is 1GB which is caused by OPTIMIZE and also expected. In each partition are data files. Databricks did an attuning to these datafile and resized them to a small size to be able to do MERGE statements efficiently that's why A is the correct answer

upvoted 4 times

...

imatheushenrique

1 year, 1 month ago

One of the purposes of a optimize execution is the gain in merge oprations, so: A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

upvoted 1 times

...

RiktRikt007

1 year, 4 months ago

how A is correct ? While Databricks does have autotuning capabilities, it primarily considers the table size. In this case, the table is over 10 TB, which would typically lead to a target file size of 1 GB, not under 64 MB.

upvoted 2 times

...

PrashantTiwari

1 year, 4 months ago

The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB. Correct answer is A

upvoted 2 times

...

AziLa

1 year, 5 months ago

correct ans is A

upvoted 1 times

...

Jay_98_11

1 year, 5 months ago

Selected Answer: A

A is correct

upvoted 2 times

...

Load full discussion...