Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 70 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 70
Topic #: 1
[All Certified Data Engineer Professional Questions]

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

  • A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
  • D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
  • E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
aragorn_brego
Highly Voted 1 year ago
Selected Answer: A
This strategy aims to control the size of the output Parquet files without shuffling the data. The spark.sql.files.maxPartitionBytes parameter sets the maximum size of a partition that Spark will read. By setting it to 512 MB, you are aligning the read partition size with the desired output file size. Since the transformations are narrow (meaning they do not require shuffling), the number of partitions should roughly correspond to the number of output files when writing out to Parquet, assuming the data is evenly distributed and there is no data expansion during processing.
upvoted 8 times
...
Def21
Highly Voted 10 months ago
Selected Answer: D
D is the only one that does the trick. Note, we can not do shuffling. Wrong answers: A: spark.sql.files.maxPartitionBytes is about reading, not writing.(The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. ) B: spark.sql.adaptive.advisoryPartitionSizeInBytes takes effect while shuffling and sorting does not make sense (The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.) C: Would work but spark.sql.adaptive.advisoryPartitionSizeInBytes would need shuffling. E. spark.sql.shuffle.partitions (Configures the number of partitions to use when shuffling data for joins or aggregations.) is not about writing.
upvoted 6 times
azurefan777
2 weeks, 2 days ago
Answer D is wrong -> repartition does perform shuffling in Spark. When you use repartition, Spark redistributes the data across the specified number of partitions, which requires moving data between nodes to achieve the new partitioning. Answer A should be correct
upvoted 1 times
...
...
nedlo
Most Recent 4 weeks ago
Selected Answer: A
I though D, but default num of partitions is 200, so you cant do coalesce (2048) (you cant increase numb of partitions through coalesce), so its not possible to do it without repartitioning and shuffle. Only A can be done without Shuffle
upvoted 1 times
...
sdas1
2 months, 1 week ago
Option A spark.sql.files.maxPartitionBytes controls the maximum size of partitions during reading on the Spark cluster, and that reducing this value could lead to more partitions and thus potentially more output files. The key point is that it works best when no shuffles occur, which aligns with the scenario of having narrow transformations only.
upvoted 2 times
sdas1
2 months, 1 week ago
Given that no shuffle occurs and you're aiming to control the file sizes during output, adjusting spark.sql.files.maxPartitionBytes could help indirectly by determining the partition size for reading. Since the number of input partitions can influence the size of the output files when no shuffle occurs, the partition size may closely match the size of the files being written out.
upvoted 1 times
sdas1
2 months, 1 week ago
If the transformations remain narrow, then Spark won't repartition the data unless explicitly instructed to do so (e.g., through a repartition or coalesce operation). In this case, using spark.sql.files.maxPartitionBytes to adjust the read partition size to 512 MB could indirectly control the number of output files and ensure they align with the target file size.
upvoted 1 times
sdas1
2 months, 1 week ago
Thus, Option A is also a valid strategy: Set spark.sql.files.maxPartitionBytes to 512 MB, process the data with narrow transformations, and write to Parquet. By reducing the value of spark.sql.files.maxPartitionBytes, you ensure more partitions are created during the read phase, leading to output files closer to the desired size, assuming the transformations are narrow and no shuffling occurs.
upvoted 1 times
...
...
...
...
vikram12apr
8 months, 2 weeks ago
Selected Answer: A
D is not correct as it will create 2048 target files of 0.5 MB each Only A will do the job as it will read this file in 2 partition ( 1 TB = 512*2 MB) and as we are not doing any shuffling(not mentioned in option) it will create those many partition file i.e 2 part files
upvoted 1 times
hal2401me
8 months, 2 weeks ago
hey, 1TB=1000GB=1^6MB.
upvoted 4 times
...
...
hal2401me
8 months, 3 weeks ago
Selected Answer: D
ChatGPT says D: This strategy directly addresses the desired part-file size by repartitioning the data. It avoids shuffling during narrow transformations. Recommended for achieving the desired part-file size without unnecessary shuffling.
upvoted 1 times
...
Curious76
8 months, 4 weeks ago
Selected Answer: D
D is mot suitable.
upvoted 1 times
...
vctrhugo
9 months, 2 weeks ago
Selected Answer: A
This approach ensures that each partition will be approximately the target part-file size, which can improve the efficiency of the data write. It also avoids the need for a shuffle operation, which can be expensive in terms of performance.
upvoted 3 times
...
adenis
9 months, 4 weeks ago
Selected Answer: C
С is correct
upvoted 1 times
...
spaceexplorer
10 months ago
Selected Answer: A
Rest of the answers trigger shuffles
upvoted 2 times
...
divingbell17
10 months, 3 weeks ago
Selected Answer: A
A is correct. The question states Which strategy will yield the best performance without shuffling data. The other options involve shuffling either manually or through AQE
upvoted 2 times
...
911land
11 months, 1 week ago
C is correct answer
upvoted 1 times
...
alexvno
11 months, 1 week ago
Selected Answer: A
- spark.sql.files.maxPartitionBytes: 128MB (The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.)
upvoted 1 times
...
petrv
11 months, 4 weeks ago
Selected Answer: C
Here's a breakdown of the reasons: spark.sql.adaptive.advisoryPartitionSizeInBytes: This configuration parameter is designed to provide advisory partition sizes for the adaptive query execution framework. It can help in controlling the partition sizes without triggering unnecessary shuffling. coalesce(2048): Coalescing to a specific number of partitions after the narrow transformations allows you to control the number of output files without triggering a shuffle. This helps achieve the target part-file size without incurring the overhead of a full shuffle. Setting a specific target: The strategy outlines the goal of achieving a target part-file size of 512 MB, which aligns with the requirement.
upvoted 3 times
...
ocaj90
1 year ago
obviously D. It allows you to control both the number of partitions and the final part-file size, which aligns with the requirements. Option B shuffles partitions, which is not allowed.
upvoted 1 times
...
sturcu
1 year, 1 month ago
Selected Answer: B
The number of output files saved to the disk is equal to the number of partitions in the Spark executors when the write operation is performed.
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...