Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 36 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 36
Topic #: 1
[All Certified Data Engineer Professional Questions]

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20
Which statement describes how data will be filtered?

  • A. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
  • B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
  • C. The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
  • D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
  • E. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
Enduresoul
Highly Voted 1 year ago
Selected Answer: D
D is correct. A partition can include multiple files. And the statistics are collected for each file.
upvoted 7 times
...
AndreFR
Most Recent 3 months ago
Selected Answer: D
Min and max values of each parquet file are stored in Delta Logs Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this information (minimum and maximum values) at query time to skip unnecessary files in order to speed up the queries. https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-data
upvoted 1 times
...
AziLa
10 months ago
Correct Ans is D
upvoted 2 times
...
Quadronoid
1 year ago
Selected Answer: C
I guess C option is right since transaction log contains information about max/min values of first 32 columns, it can be used in order to filter files.
upvoted 1 times
Quadronoid
1 year ago
I reread the question and thing that I made a mistake, in option C there is information about row-level statistics, but, I guess, statistics in Delta Log it is more less about columns. So, now D looks fine for me.
upvoted 3 times
...
...
sturcu
1 year, 1 month ago
Selected Answer: D
D is Correct
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...