Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 136 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 136
Topic #: 1
[All Certified Data Engineer Professional Questions]

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

  • A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
  • D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
cf56faf
1 week, 6 days ago
Selected Answer: D
It's D, because A primarily affects the reading of the data
upvoted 1 times
...
Jugiboss
1 month ago
Selected Answer: A
A does not shuffle while D shuffles
upvoted 2 times
...
m79590530
1 month ago
Selected Answer: A
Answer A as narrow transformations like union, filter and map do not cause shuffle across partitions.
upvoted 1 times
...
Colje
1 month, 1 week ago
Selected Answer: D
The correct answer is D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB * 1024 * 1024 / 512), and then write to Parquet. Explanation: In this case, the goal is to write a 1 TB dataset to Parquet with a target file size of 512 MB without incurring the overhead of data shuffling. To achieve optimal performance, we must balance the number of partitions to match the file size requirements while avoiding expensive shuffle operations. Narrow transformations: These transformations (such as map, filter) don’t require shuffling the data, which keeps the operation efficient. Repartition to 2,048 partitions: Given that the desired part-file size is 512 MB and the total dataset size is 1 TB, repartitioning the dataset into 2,048 partitions ensures that each partition will be approximately 512 MB in size, which matches the target file size. This avoids shuffle operations and allows for an efficient write.
upvoted 1 times
...
pk07
2 months ago
Selected Answer: D
Not A because spark.sql.files.maxPartitionBytes primarily affects the reading of data, not the writing. It determines the maximum size of a partition when reading files, not when writing them.
upvoted 2 times
...
shaojunni
2 months ago
Selected Answer: C
A, D will not prevent shuffling data. C using coalesce to reduce shuffling data.
upvoted 1 times
...
03355a2
5 months ago
Selected Answer: A
best performance without shuffling data
upvoted 3 times
...
hpkr
5 months, 2 weeks ago
Selected Answer: D
option D
upvoted 1 times
...
Freyr
5 months, 3 weeks ago
Selected Answer: D
Correct Answer D: Repartition to 2,048 partitions and write to Parquet This option directly controls the number of output files by repartitioning the data into 2,048 partitions, assuming that 1TB/512MB per file roughly translates to 2,048 files. Repartitioning the data involves shuffling, but it's a deliberate shuffle designed to achieve a specific partitioning beneficial for writing. After repartitioning, the data is written to Parquet files, each expected to be approximately 512 MB if the data is uniformly distributed across partitions.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...