exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 136 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 136
Topic #: 1
[All Certified Data Engineer Professional Questions]

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

  • A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
  • D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
RandomForest
2 weeks, 6 days ago
Selected Answer: D
Correct answer is D: Why Not Other Options?: A. Set spark.sql.files.maxPartitionBytes: This configuration controls how many bytes Spark reads per input partition during a file scan, not the output file size. It does not help in controlling Parquet file sizes during writing. B. Set spark.sql.shuffle.partitions and sort data: While sorting data can optimize performance in some cases, it introduces unnecessary overhead for this scenario. Additionally, spark.sql.shuffle.partitions controls the number of shuffle partitions, not directly the output partitioning of the data. C. Use spark.sql.adaptive.advisoryPartitionSizeInBytes: Adaptive Query Execution (AQE) optimizes queries at runtime, but this configuration does not directly control Parquet file sizes. It dynamically adjusts partition sizes for shuffle stages, not for the write output.
upvoted 1 times
...
_lene_
3 weeks, 2 days ago
Selected Answer: A
arekm explanation
upvoted 1 times
...
arekm
1 month ago
Selected Answer: A
Definitely A - no repartitioning and subsequent shuffle (which the question is asking about). The parameter defines how many bytes per partition to read, tasks will read in those chunks, since only narrow operations performed (per definition - no shuffle), we just write what we read. The target files size is 512MBs and we did not shuffle.
upvoted 2 times
...
temple1305
2 months ago
Selected Answer: C
I think, "execute the narrow transformations, coalesce to" is key words here - because coalesce is not cause shuffling.
upvoted 1 times
...
cf56faf
2 months, 3 weeks ago
Selected Answer: D
It's D, because A primarily affects the reading of the data
upvoted 2 times
...
Jugiboss
3 months, 2 weeks ago
Selected Answer: A
A does not shuffle while D shuffles
upvoted 2 times
...
m79590530
3 months, 2 weeks ago
Selected Answer: A
Answer A as narrow transformations like union, filter and map do not cause shuffle across partitions.
upvoted 2 times
...
Colje
3 months, 3 weeks ago
Selected Answer: D
The correct answer is D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB * 1024 * 1024 / 512), and then write to Parquet. Explanation: In this case, the goal is to write a 1 TB dataset to Parquet with a target file size of 512 MB without incurring the overhead of data shuffling. To achieve optimal performance, we must balance the number of partitions to match the file size requirements while avoiding expensive shuffle operations. Narrow transformations: These transformations (such as map, filter) don’t require shuffling the data, which keeps the operation efficient. Repartition to 2,048 partitions: Given that the desired part-file size is 512 MB and the total dataset size is 1 TB, repartitioning the dataset into 2,048 partitions ensures that each partition will be approximately 512 MB in size, which matches the target file size. This avoids shuffle operations and allows for an efficient write.
upvoted 1 times
arekm
1 month ago
All true, but not a correct answer. We are looking for a solution without shuffle/repartition/coalesce.
upvoted 1 times
...
...
pk07
4 months, 1 week ago
Selected Answer: D
Not A because spark.sql.files.maxPartitionBytes primarily affects the reading of data, not the writing. It determines the maximum size of a partition when reading files, not when writing them.
upvoted 2 times
...
shaojunni
4 months, 2 weeks ago
Selected Answer: C
A, D will not prevent shuffling data. C using coalesce to reduce shuffling data.
upvoted 1 times
...
03355a2
7 months, 1 week ago
Selected Answer: A
best performance without shuffling data
upvoted 3 times
...
hpkr
7 months, 4 weeks ago
Selected Answer: D
option D
upvoted 1 times
...
Freyr
8 months, 1 week ago
Selected Answer: D
Correct Answer D: Repartition to 2,048 partitions and write to Parquet This option directly controls the number of output files by repartitioning the data into 2,048 partitions, assuming that 1TB/512MB per file roughly translates to 2,048 files. Repartitioning the data involves shuffling, but it's a deliberate shuffle designed to achieve a specific partitioning beneficial for writing. After repartitioning, the data is written to Parquet files, each expected to be approximately 512 MB if the data is uniformly distributed across partitions.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago