Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 28 discussion

Actual exam question from Microsoft's DP-203

Question #: 28
Topic #: 1

You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?

A. Convert the files to JSON
B. Convert the files to Avro
C. Compress the files
D. Merge the files

Show Suggested Answer

Suggested Answer: D 🗳️

by alexleonvalencia at Dec. 9, 2021, 11:29 p.m.

Comments

Submit Cancel

VeroDon

Highly Voted 7 months ago

You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and decrease performance. so B is the correct answer. Convert to AVRO

upvoted 51 times

auwia

1 year, 10 months ago

Option B: Convert the files to Avro (WRONG FOR ME) While converting the files to Avro is a valid option for optimizing data storage and processing, it may not be the most suitable choice in this specific scenario. Avro is a binary serialization format that is efficient for compact storage and fast data processing. It provides schema evolution support and is widely used in big data processing frameworks like Apache Hadoop and Apache Spark. However, in the given scenario, the files are already in CSV format. Converting them to Avro would require additional processing and potentially introduce complexity. Avro is better suited for scenarios where data is generated or consumed by systems that natively support Avro or for cases where schema evolution is a critical requirement. On the other hand, merging the files (Option D) is a more straightforward and common approach to optimize batch processing. It helps reduce the overhead associated with managing a large number of small files, improves data scanning efficiency, and enhances overall processing performance. Merging files is a recommended practice to achieve better performance and cost efficiency in scenarios where file sizes vary.

upvoted 17 times

...

Bouhdy

7 months, 2 weeks ago

Avro is often used when schema evolution and efficient serialization are needed, but merging files is the primary solution for optimizing batch processing when dealing with a large number of small files. ASNWER IS D !

upvoted 1 times

...

Massy

2 years, 11 months ago

I can understand why you say not merge, but why avro?

upvoted 4 times

anks84

2 years, 7 months ago

Because we need to ensure files stored in the container are optimized for batch processing. converting the files to AVRO would be suitable for optimized for batch processing. So, the answer is "Convert to AVRO"

upvoted 2 times

...

bhrz

2 years, 7 months ago

The information about the file size is already given which is between 5KB to 5GB. So option D seems to be correct.

upvoted 3 times

...

Load full discussion...

...

Canary_2021

Highly Voted 7 months ago

Selected Answer: D

If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest

upvoted 27 times

...

Jolyboy

Most Recent 3 weeks, 1 day ago

Selected Answer: B

I choose B, To optimize the files stored in your Azure Data Lake Storage Gen2 container for batch processing, the best option is to convert the files to Avro (Option B). Here's why: Avro is a binary format that is highly efficient for both storage and processing. It supports schema evolution, which is beneficial for handling changes in data structure over time1. Compression: Avro files can be compressed, reducing storage costs and improving read/write performance. Splittable: Avro files are splittable, which means they can be processed in parallel, enhancing batch processing efficiency. While compressing the files (Option C) can also help reduce storage size and improve transfer speeds, converting to Avro provides additional benefits like schema support and better performance for large-scale data processing.

upvoted 1 times

...

Januaz

4 weeks, 1 day ago

Selected Answer: D

Copilot says Nerge the files

upvoted 1 times

...

IMadnan

2 months ago

Selected Answer: D

merging the files (Option D)

upvoted 1 times

...

samianae

2 months, 3 weeks ago

Selected Answer: D

so Confusing for me C and D are seem correct for me

upvoted 1 times

...

RMK2000

3 months, 2 weeks ago

Selected Answer: D

Larger files lead to better performance and reduced costs. Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100 GB in size. Increasing file size can also reduce transaction costs. Read and write operations are billed in 4 megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see Azure Data Lake Storage pricing. https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#file-size

upvoted 1 times

...

moize

4 months, 2 weeks ago

Selected Answer: D

Pour optimiser les fichiers CSV stockés dans Azure Data Lake Storage Gen2 pour le traitement par lots, la meilleure option est de fusionner les fichiers (option D). Fusionner les fichiers permet de réduire le nombre de petits fichiers, ce qui améliore les performances de traitement par lots en réduisant la surcharge liée à la gestion de nombreux petits fichiers

upvoted 1 times

...

EmnCours

4 months, 3 weeks ago

Selected Answer: D

Correct Answer: D

upvoted 1 times

...

roopansh.gupta2

7 months ago

Selected Answer: D

D. Merge the files Explanation: File Merging: Merging small files into larger ones helps to reduce the overhead associated with processing many small files, which can negatively impact performance. Large files are typically more efficient for batch processing in distributed systems like Azure Data Lake and Azure Databricks because they minimize the number of tasks and the amount of metadata the system needs to manage. Compressing the Files (Option C Incorrect): While compression can reduce storage costs and improve I/O efficiency, it doesn't address the issue of small file sizes, which can still lead to inefficiencies in distributed processing. Converting to Avro (Option B Incorrect) or JSON (Option A Incorrect): Converting to a different file format like Avro or JSON could be beneficial for specific use cases, especially where schema evolution or specific query optimizations are required. However, this does not address the fundamental issue of optimizing file size for batch processing. Therefore, merging the files (Option D) is the most direct and effective way to optimize for batch processing in this scenario.

upvoted 1 times

...

d39f475

10 months, 3 weeks ago

Compress the files is my choise

upvoted 1 times

...

Dusica

11 months, 3 weeks ago

pay attention to the "number of events that occur per hour." That means it is streaming and you can group small files with streaming engines - it is D

upvoted 1 times

...

f214eb2

1 year ago

Selected Answer: A

Convert to AVRO is Legit

upvoted 1 times

...

Charley92

1 year ago

Selected Answer: C

C. Compress the files Compression is beneficial for batch processing as it can significantly reduce the file size, which leads to faster transfer rates and can improve performance during batch processing tasks. It’s particularly effective for large files, making it easier to handle and process them efficiently

upvoted 1 times

...

da257c2

1 year ago

option d

upvoted 1 times

...

Khadija10

1 year, 2 months ago

Selected Answer: C

The answer is compress the files - we can't merge because we don't know how many files we will receive, and the performance will be decreased if the size of the merged files is too large. - converting the files to Avro will require additional processing and if we receive too many files it may cause complexity. - compressing the CSV files is the best choice in our scenario, compressing the files is a common practice for optimizing batch processing. It helps reduce storage space, minimize data transfer times, and improve overall performance.

upvoted 3 times

...

alphilla

1 year, 2 months ago

Crazy how OPTION D (which is wrong) has such a high ammount of votes. I'm going for C.compress files. Guys you don't know how many files you have, you can't design a system where this kind of randomness can make your system to fail.

upvoted 5 times

...

Load full discussion...

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 28 discussion

Comments

VeroDon

auwia

Bouhdy

Massy

anks84

bhrz

Canary_2021

Jolyboy

Januaz

IMadnan

samianae

RMK2000

moize

EmnCours

roopansh.gupta2

d39f475

Dusica

f214eb2

Charley92

da257c2

Khadija10

alphilla

SY0-701