exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 36 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 36
Topic #: 1
[All Certified Data Engineer Professional Questions]

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20
Which statement describes how data will be filtered?

  • A. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
  • B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
  • C. The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
  • D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
  • E. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Enduresoul
Highly Voted 1 year, 2 months ago
Selected Answer: D
D is correct. A partition can include multiple files. And the statistics are collected for each file.
upvoted 10 times
...
AlejandroU
Most Recent 1 month, 3 weeks ago
Selected Answer: B
Answer B. Single Comparison Filter (e.g., latitude > 66.3): File skipping is highly efficient because Delta can use min/max statistics to directly eliminate files that don't meet the condition. Range Filters (e.g., longitude < 20 AND longitude > -20): File skipping is still possible but less efficient, because Delta has to evaluate whether any records in the file might meet the condition, even if the min and max values of the column in the file overlap with the filter range. So in summary, file skipping works best with single comparisons like latitude > 66.3 but is less effective with range filters like longitude < 20 AND longitude > -20.
upvoted 1 times
...
Sriramiyer92
1 month, 3 weeks ago
Selected Answer: D
Do not get confused between option c and d. Given answer is correct.
upvoted 1 times
...
hebied
2 months, 1 week ago
Selected Answer: D
D is more suitable
upvoted 1 times
...
AndreFR
5 months, 2 weeks ago
Selected Answer: D
Min and max values of each parquet file are stored in Delta Logs Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this information (minimum and maximum values) at query time to skip unnecessary files in order to speed up the queries. https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-data
upvoted 2 times
...
AziLa
1 year ago
Correct Ans is D
upvoted 2 times
...
Quadronoid
1 year, 3 months ago
Selected Answer: C
I guess C option is right since transaction log contains information about max/min values of first 32 columns, it can be used in order to filter files.
upvoted 1 times
Quadronoid
1 year, 3 months ago
I reread the question and thing that I made a mistake, in option C there is information about row-level statistics, but, I guess, statistics in Delta Log it is more less about columns. So, now D looks fine for me.
upvoted 4 times
...
...
sturcu
1 year, 3 months ago
Selected Answer: D
D is Correct
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago