exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 28 discussion

Actual exam question from Microsoft's DP-203
Question #: 28
Topic #: 1
[All DP-203 Questions]

You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?

  • A. Convert the files to JSON
  • B. Convert the files to Avro
  • C. Compress the files
  • D. Merge the files
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
VeroDon
Highly Voted 7 months ago
You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and decrease performance. so B is the correct answer. Convert to AVRO
upvoted 51 times
auwia
1 year, 10 months ago
Option B: Convert the files to Avro (WRONG FOR ME) While converting the files to Avro is a valid option for optimizing data storage and processing, it may not be the most suitable choice in this specific scenario. Avro is a binary serialization format that is efficient for compact storage and fast data processing. It provides schema evolution support and is widely used in big data processing frameworks like Apache Hadoop and Apache Spark. However, in the given scenario, the files are already in CSV format. Converting them to Avro would require additional processing and potentially introduce complexity. Avro is better suited for scenarios where data is generated or consumed by systems that natively support Avro or for cases where schema evolution is a critical requirement. On the other hand, merging the files (Option D) is a more straightforward and common approach to optimize batch processing. It helps reduce the overhead associated with managing a large number of small files, improves data scanning efficiency, and enhances overall processing performance. Merging files is a recommended practice to achieve better performance and cost efficiency in scenarios where file sizes vary.
upvoted 17 times
...
Bouhdy
7 months, 2 weeks ago
Avro is often used when schema evolution and efficient serialization are needed, but merging files is the primary solution for optimizing batch processing when dealing with a large number of small files. ASNWER IS D !
upvoted 1 times
...
Massy
2 years, 11 months ago
I can understand why you say not merge, but why avro?
upvoted 4 times
anks84
2 years, 7 months ago
Because we need to ensure files stored in the container are optimized for batch processing. converting the files to AVRO would be suitable for optimized for batch processing. So, the answer is "Convert to AVRO"
upvoted 2 times
...
...
bhrz
2 years, 7 months ago
The information about the file size is already given which is between 5KB to 5GB. So option D seems to be correct.
upvoted 3 times
...
...
Canary_2021
Highly Voted 7 months ago
Selected Answer: D
If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest
upvoted 27 times
...
Jolyboy
Most Recent 3 weeks, 1 day ago
Selected Answer: B
I choose B, To optimize the files stored in your Azure Data Lake Storage Gen2 container for batch processing, the best option is to convert the files to Avro (Option B). Here's why: Avro is a binary format that is highly efficient for both storage and processing. It supports schema evolution, which is beneficial for handling changes in data structure over time1. Compression: Avro files can be compressed, reducing storage costs and improving read/write performance. Splittable: Avro files are splittable, which means they can be processed in parallel, enhancing batch processing efficiency. While compressing the files (Option C) can also help reduce storage size and improve transfer speeds, converting to Avro provides additional benefits like schema support and better performance for large-scale data processing.
upvoted 1 times
...
Januaz
4 weeks, 1 day ago
Selected Answer: D
Copilot says Nerge the files
upvoted 1 times
...
IMadnan
2 months ago
Selected Answer: D
merging the files (Option D)
upvoted 1 times
...
samianae
2 months, 3 weeks ago
Selected Answer: D
so Confusing for me C and D are seem correct for me
upvoted 1 times
...
RMK2000
3 months, 2 weeks ago
Selected Answer: D
Larger files lead to better performance and reduced costs. Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100 GB in size. Increasing file size can also reduce transaction costs. Read and write operations are billed in 4 megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see Azure Data Lake Storage pricing. https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#file-size
upvoted 1 times
...
moize
4 months, 2 weeks ago
Selected Answer: D
Pour optimiser les fichiers CSV stockés dans Azure Data Lake Storage Gen2 pour le traitement par lots, la meilleure option est de fusionner les fichiers (option D). Fusionner les fichiers permet de réduire le nombre de petits fichiers, ce qui améliore les performances de traitement par lots en réduisant la surcharge liée à la gestion de nombreux petits fichiers
upvoted 1 times
...
EmnCours
4 months, 3 weeks ago
Selected Answer: D
Correct Answer: D
upvoted 1 times
...
roopansh.gupta2
7 months ago
Selected Answer: D
D. Merge the files Explanation: File Merging: Merging small files into larger ones helps to reduce the overhead associated with processing many small files, which can negatively impact performance. Large files are typically more efficient for batch processing in distributed systems like Azure Data Lake and Azure Databricks because they minimize the number of tasks and the amount of metadata the system needs to manage. Compressing the Files (Option C Incorrect): While compression can reduce storage costs and improve I/O efficiency, it doesn't address the issue of small file sizes, which can still lead to inefficiencies in distributed processing. Converting to Avro (Option B Incorrect) or JSON (Option A Incorrect): Converting to a different file format like Avro or JSON could be beneficial for specific use cases, especially where schema evolution or specific query optimizations are required. However, this does not address the fundamental issue of optimizing file size for batch processing. Therefore, merging the files (Option D) is the most direct and effective way to optimize for batch processing in this scenario.
upvoted 1 times
...
d39f475
10 months, 3 weeks ago
Compress the files is my choise
upvoted 1 times
...
Dusica
11 months, 3 weeks ago
pay attention to the "number of events that occur per hour." That means it is streaming and you can group small files with streaming engines - it is D
upvoted 1 times
...
f214eb2
1 year ago
Selected Answer: A
Convert to AVRO is Legit
upvoted 1 times
...
Charley92
1 year ago
Selected Answer: C
C. Compress the files Compression is beneficial for batch processing as it can significantly reduce the file size, which leads to faster transfer rates and can improve performance during batch processing tasks. It’s particularly effective for large files, making it easier to handle and process them efficiently
upvoted 1 times
...
da257c2
1 year ago
option d
upvoted 1 times
...
Khadija10
1 year, 2 months ago
Selected Answer: C
The answer is compress the files - we can't merge because we don't know how many files we will receive, and the performance will be decreased if the size of the merged files is too large. - converting the files to Avro will require additional processing and if we receive too many files it may cause complexity. - compressing the CSV files is the best choice in our scenario, compressing the files is a common practice for optimizing batch processing. It helps reduce storage space, minimize data transfer times, and improve overall performance.
upvoted 3 times
...
alphilla
1 year, 2 months ago
Crazy how OPTION D (which is wrong) has such a high ammount of votes. I'm going for C.compress files. Guys you don't know how many files you have, you can't design a system where this kind of randomness can make your system to fail.
upvoted 5 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago