Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 70 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 70
Topic #: 1

[All Certified Data Engineer Professional Questions]

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Show Suggested Answer

Suggested Answer: A 🗳️

by sturcu at Oct. 24, 2023, 6:02 p.m.

Comments

Submit Cancel

aragorn_brego

Highly Voted 1 year, 7 months ago

Selected Answer: A

This strategy aims to control the size of the output Parquet files without shuffling the data. The spark.sql.files.maxPartitionBytes parameter sets the maximum size of a partition that Spark will read. By setting it to 512 MB, you are aligning the read partition size with the desired output file size. Since the transformations are narrow (meaning they do not require shuffling), the number of partitions should roughly correspond to the number of output files when writing out to Parquet, assuming the data is evenly distributed and there is no data expansion during processing.

upvoted 10 times

...

Def21

Highly Voted 1 year, 5 months ago

Selected Answer: D

D is the only one that does the trick. Note, we can not do shuffling. Wrong answers: A: spark.sql.files.maxPartitionBytes is about reading, not writing.(The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. ) B: spark.sql.adaptive.advisoryPartitionSizeInBytes takes effect while shuffling and sorting does not make sense (The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.) C: Would work but spark.sql.adaptive.advisoryPartitionSizeInBytes would need shuffling. E. spark.sql.shuffle.partitions (Configures the number of partitions to use when shuffling data for joins or aggregations.) is not about writing.

upvoted 6 times

arekm

6 months, 2 weeks ago

D does repartition, which the question says we should try to avoid.

upvoted 1 times

...

carlosmps

7 months, 1 week ago

spark.sql.files.maxPartitionBytes is not just for reading files

upvoted 1 times

...

azurefan777

8 months, 1 week ago

Answer D is wrong -> repartition does perform shuffling in Spark. When you use repartition, Spark redistributes the data across the specified number of partitions, which requires moving data between nodes to achieve the new partitioning. Answer A should be correct

upvoted 4 times

...

KadELbied

Most Recent 2 months, 2 weeks ago

Selected Answer: B

I found this question in anothers Exam test and all of them it's looks like B

upvoted 1 times

...

AlejandroU

6 months, 4 weeks ago

Selected Answer: D

Answer D. Explicitly repartitioning to 2,048 partitions ensures that the output files are close to the desired size of 512 MB, provided the data distribution is relatively even. Repartitioning directly addresses the problem by controlling the number of partitions, which directly affects the output file size Why not option A ? Misinterpretation of spark.sql.files.maxPartitionBytes in Option A: The assessment incorrectly states that this configuration controls the maximum size of files when writing to Parquet. This setting controls the size of partitions when reading data, not during writing.

upvoted 1 times

AlejandroU

6 months, 4 weeks ago

Given the requirement to avoid shuffling, Option A is the most suitable choice. By setting spark.sql.files.maxPartitionBytes to 512 MB, you influence the partitioning during the read phase, which can help in achieving the desired file sizes during the write operation. However, it's important to note that this approach may not guarantee exact file sizes, and some variability may occur. If achieving precise file sizes is critical and shuffling is permissible, Option D would be the preferred strategy.

upvoted 2 times

...

temple1305

7 months, 1 week ago

Selected Answer: C

spark.sql.adaptive.advisoryPartitionSizeInBytes - The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition. And then we do coalesce - without shuffle - so have to work!

upvoted 1 times

...

nedlo

8 months, 3 weeks ago

Selected Answer: A

I though D, but default num of partitions is 200, so you cant do coalesce (2048) (you cant increase numb of partitions through coalesce), so its not possible to do it without repartitioning and shuffle. Only A can be done without Shuffle

upvoted 2 times

...

sdas1

10 months ago

Option A spark.sql.files.maxPartitionBytes controls the maximum size of partitions during reading on the Spark cluster, and that reducing this value could lead to more partitions and thus potentially more output files. The key point is that it works best when no shuffles occur, which aligns with the scenario of having narrow transformations only.

upvoted 2 times

sdas1

10 months ago

Given that no shuffle occurs and you're aiming to control the file sizes during output, adjusting spark.sql.files.maxPartitionBytes could help indirectly by determining the partition size for reading. Since the number of input partitions can influence the size of the output files when no shuffle occurs, the partition size may closely match the size of the files being written out.

upvoted 1 times

sdas1

10 months ago

If the transformations remain narrow, then Spark won't repartition the data unless explicitly instructed to do so (e.g., through a repartition or coalesce operation). In this case, using spark.sql.files.maxPartitionBytes to adjust the read partition size to 512 MB could indirectly control the number of output files and ensure they align with the target file size.

upvoted 1 times

sdas1

10 months ago

Thus, Option A is also a valid strategy: Set spark.sql.files.maxPartitionBytes to 512 MB, process the data with narrow transformations, and write to Parquet. By reducing the value of spark.sql.files.maxPartitionBytes, you ensure more partitions are created during the read phase, leading to output files closer to the desired size, assuming the transformations are narrow and no shuffling occurs.

upvoted 1 times

...

vikram12apr

1 year, 4 months ago

Selected Answer: A

D is not correct as it will create 2048 target files of 0.5 MB each Only A will do the job as it will read this file in 2 partition ( 1 TB = 512*2 MB) and as we are not doing any shuffling(not mentioned in option) it will create those many partition file i.e 2 part files

upvoted 1 times

hal2401me

1 year, 4 months ago

hey, 1TB=1000GB=1^6MB.

upvoted 4 times

...

hal2401me

1 year, 4 months ago

Selected Answer: D

ChatGPT says D: This strategy directly addresses the desired part-file size by repartitioning the data. It avoids shuffling during narrow transformations. Recommended for achieving the desired part-file size without unnecessary shuffling.

upvoted 1 times

...

Curious76

1 year, 4 months ago

Selected Answer: D

D is mot suitable.

upvoted 1 times

...

vctrhugo

1 year, 5 months ago

Selected Answer: A

This approach ensures that each partition will be approximately the target part-file size, which can improve the efficiency of the data write. It also avoids the need for a shuffle operation, which can be expensive in terms of performance.

upvoted 3 times

...

adenis

1 year, 5 months ago

Selected Answer: C

С is correct

upvoted 1 times

...

spaceexplorer

1 year, 5 months ago

Selected Answer: A

Rest of the answers trigger shuffles

upvoted 2 times

...

divingbell17

1 year, 6 months ago

Selected Answer: A

A is correct. The question states Which strategy will yield the best performance without shuffling data. The other options involve shuffling either manually or through AQE

upvoted 2 times

...

911land

1 year, 7 months ago

C is correct answer

upvoted 1 times

...

alexvno

1 year, 7 months ago

Selected Answer: A

- spark.sql.files.maxPartitionBytes: 128MB (The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.)

upvoted 1 times

...

petrv

1 year, 7 months ago

Selected Answer: C

Here's a breakdown of the reasons: spark.sql.adaptive.advisoryPartitionSizeInBytes: This configuration parameter is designed to provide advisory partition sizes for the adaptive query execution framework. It can help in controlling the partition sizes without triggering unnecessary shuffling. coalesce(2048): Coalescing to a specific number of partitions after the narrow transformations allows you to control the number of output files without triggering a shuffle. This helps achieve the target part-file size without incurring the overhead of a full shuffle. Setting a specific target: The strategy outlines the goal of achieving a target part-file size of 512 MB, which aligns with the requirement.

upvoted 3 times

...

Load full discussion...