Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 36 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 36
Topic #: 1

[All Certified Data Engineer Professional Questions]

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20
Which statement describes how data will be filtered?

A. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
C. The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
E. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Show Suggested Answer

Suggested Answer: D 🗳️

by sturcu at Oct. 16, 2023, 12:04 p.m.

Comments

Submit Cancel

Enduresoul

Highly Voted 1 year, 7 months ago

Selected Answer: D

D is correct. A partition can include multiple files. And the statistics are collected for each file.

upvoted 12 times

...

KadELbied

Most Recent 1 month, 4 weeks ago

Selected Answer: D

suretly D

upvoted 1 times

...

AlejandroU

6 months, 3 weeks ago

Selected Answer: B

Answer B. Single Comparison Filter (e.g., latitude > 66.3): File skipping is highly efficient because Delta can use min/max statistics to directly eliminate files that don't meet the condition. Range Filters (e.g., longitude < 20 AND longitude > -20): File skipping is still possible but less efficient, because Delta has to evaluate whether any records in the file might meet the condition, even if the min and max values of the column in the file overlap with the filter range. So in summary, file skipping works best with single comparisons like latitude > 66.3 but is less effective with range filters like longitude < 20 AND longitude > -20.

upvoted 1 times

...

Sriramiyer92

6 months, 3 weeks ago

Selected Answer: D

Do not get confused between option c and d. Given answer is correct.

upvoted 1 times

...

hebied

7 months ago

Selected Answer: D

D is more suitable

upvoted 1 times

...

AndreFR

10 months, 2 weeks ago

Selected Answer: D

Min and max values of each parquet file are stored in Delta Logs Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this information (minimum and maximum values) at query time to skip unnecessary files in order to speed up the queries. https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-data

upvoted 2 times

...

AziLa

1 year, 5 months ago

Correct Ans is D

upvoted 2 times

...

Quadronoid

1 year, 8 months ago

Selected Answer: C

I guess C option is right since transaction log contains information about max/min values of first 32 columns, it can be used in order to filter files.

upvoted 1 times

Quadronoid

1 year, 8 months ago

I reread the question and thing that I made a mistake, in option C there is information about row-level statistics, but, I guess, statistics in Delta Log it is more less about columns. So, now D looks fine for me.

upvoted 4 times

...

sturcu

1 year, 8 months ago

Selected Answer: D

D is Correct

upvoted 3 times

...