Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 31 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 31
Topic #: 1
[All Certified Data Engineer Professional Questions]

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.
The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

  • A. The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
  • B. Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
  • C. Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
  • D. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
  • E. Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
RafaelCFC
Highly Voted 10 months, 3 weeks ago
Selected Answer: D
A is wrong, because Tungsten is a project around improving Spark's efficiency on memory and CPU usage; B is wrong because Parquet does not support file editing, it only supports overwrite and create operations by itself; C is wrong because completely automating schema declaration for tables will incur in reduced previsibility for data types and data quality; E is false because unlucky sampling can yield bad inferences by Spark;
upvoted 10 times
...
hal2401me
Most Recent 8 months, 2 weeks ago
from my exam today, both C & D are no longer available, so they can't be correct. E & A are available. E states "always accurate" so I hesitate to choose it. There is a new option stating like "delta lake indexes first 32column in delta log for Z order and optimization"(not sure I remember exactly, it looks statementfully correct). and I chosed this "new" option. Because, this should impact the schema decision by putting high-usage field in the first 32 columns.
upvoted 2 times
...
guillesd
9 months, 3 weeks ago
Selected Answer: D
Only answer that makes sense
upvoted 1 times
...
AziLa
10 months ago
Correct Ans is D
upvoted 1 times
...
sturcu
1 year, 1 month ago
Selected Answer: D
correct
upvoted 2 times
...
hammer_1234_h
1 year, 2 months ago
D is correct. we can use `schema hint` to enforce the schema information that we know and expect on an inferred schema.
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...