Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 67 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 67
Topic #: 1
[All Certified Data Engineer Professional Questions]

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

  • A. Delta Lake statistics are not optimized for free text fields with high cardinality.
  • B. Text data cannot be stored with Delta Lake.
  • C. ZORDER ON review will need to be run to see performance gains.
  • D. The Delta log creates a term matrix for free text fields to support selective filtering.
  • E. Delta Lake statistics are only collected on the first 4 columns in a table.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
aragorn_brego
Highly Voted 1 year ago
Selected Answer: A
Delta Lake uses statistics and data skipping to improve query performance, but these optimizations are most effective for columns with low to medium cardinality (i.e., columns with a limited set of distinct values). Free-form text fields like the review column typically have high cardinality, meaning each value in the column (each review text) is unique or nearly unique. Consequently, statistics on such columns do not significantly improve the performance of queries searching for specific keywords within the text.
upvoted 5 times
...
Dileepvikram
Most Recent 1 year ago
answer is A
upvoted 2 times
...
mouad_attaqi
1 year, 1 month ago
Selected Answer: A
A is correct
upvoted 2 times
...
sturcu
1 year, 1 month ago
Selected Answer: A
Collecting statistics on long strings is an expensive operation
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...