exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 28 discussion

Actual exam question from Microsoft's DP-203
Question #: 28
Topic #: 1
[All DP-203 Questions]

You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?

  • A. Convert the files to JSON
  • B. Convert the files to Avro
  • C. Compress the files
  • D. Merge the files
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
VeroDon
Highly Voted 5 months ago
You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and decrease performance. so B is the correct answer. Convert to AVRO
upvoted 51 times
auwia
1 year, 8 months ago
Option B: Convert the files to Avro (WRONG FOR ME) While converting the files to Avro is a valid option for optimizing data storage and processing, it may not be the most suitable choice in this specific scenario. Avro is a binary serialization format that is efficient for compact storage and fast data processing. It provides schema evolution support and is widely used in big data processing frameworks like Apache Hadoop and Apache Spark. However, in the given scenario, the files are already in CSV format. Converting them to Avro would require additional processing and potentially introduce complexity. Avro is better suited for scenarios where data is generated or consumed by systems that natively support Avro or for cases where schema evolution is a critical requirement. On the other hand, merging the files (Option D) is a more straightforward and common approach to optimize batch processing. It helps reduce the overhead associated with managing a large number of small files, improves data scanning efficiency, and enhances overall processing performance. Merging files is a recommended practice to achieve better performance and cost efficiency in scenarios where file sizes vary.
upvoted 16 times
...
Bouhdy
5 months, 3 weeks ago
Avro is often used when schema evolution and efficient serialization are needed, but merging files is the primary solution for optimizing batch processing when dealing with a large number of small files. ASNWER IS D !
upvoted 1 times
...
Massy
2 years, 10 months ago
I can understand why you say not merge, but why avro?
upvoted 4 times
anks84
2 years, 5 months ago
Because we need to ensure files stored in the container are optimized for batch processing. converting the files to AVRO would be suitable for optimized for batch processing. So, the answer is "Convert to AVRO"
upvoted 2 times
...
...
bhrz
2 years, 5 months ago
The information about the file size is already given which is between 5KB to 5GB. So option D seems to be correct.
upvoted 3 times
...
...
Canary_2021
Highly Voted 5 months ago
Selected Answer: D
If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest
upvoted 27 times
...
IMadnan
Most Recent 1 week, 2 days ago
Selected Answer: D
merging the files (Option D)
upvoted 1 times
...
samianae
3 weeks, 6 days ago
Selected Answer: D
so Confusing for me C and D are seem correct for me
upvoted 1 times
...
RMK2000
1 month, 3 weeks ago
Selected Answer: D
Larger files lead to better performance and reduced costs. Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100 GB in size. Increasing file size can also reduce transaction costs. Read and write operations are billed in 4 megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see Azure Data Lake Storage pricing. https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#file-size
upvoted 1 times
...
moize
2 months, 3 weeks ago
Selected Answer: D
Pour optimiser les fichiers CSV stockés dans Azure Data Lake Storage Gen2 pour le traitement par lots, la meilleure option est de fusionner les fichiers (option D). Fusionner les fichiers permet de réduire le nombre de petits fichiers, ce qui améliore les performances de traitement par lots en réduisant la surcharge liée à la gestion de nombreux petits fichiers
upvoted 1 times
...
EmnCours
2 months, 4 weeks ago
Selected Answer: D
Correct Answer: D
upvoted 1 times
...
roopansh.gupta2
5 months ago
Selected Answer: D
D. Merge the files Explanation: File Merging: Merging small files into larger ones helps to reduce the overhead associated with processing many small files, which can negatively impact performance. Large files are typically more efficient for batch processing in distributed systems like Azure Data Lake and Azure Databricks because they minimize the number of tasks and the amount of metadata the system needs to manage. Compressing the Files (Option C Incorrect): While compression can reduce storage costs and improve I/O efficiency, it doesn't address the issue of small file sizes, which can still lead to inefficiencies in distributed processing. Converting to Avro (Option B Incorrect) or JSON (Option A Incorrect): Converting to a different file format like Avro or JSON could be beneficial for specific use cases, especially where schema evolution or specific query optimizations are required. However, this does not address the fundamental issue of optimizing file size for batch processing. Therefore, merging the files (Option D) is the most direct and effective way to optimize for batch processing in this scenario.
upvoted 1 times
...
d39f475
9 months ago
Compress the files is my choise
upvoted 1 times
...
Dusica
9 months, 4 weeks ago
pay attention to the "number of events that occur per hour." That means it is streaming and you can group small files with streaming engines - it is D
upvoted 1 times
...
f214eb2
10 months, 1 week ago
Selected Answer: A
Convert to AVRO is Legit
upvoted 1 times
...
Charley92
10 months, 2 weeks ago
Selected Answer: C
C. Compress the files Compression is beneficial for batch processing as it can significantly reduce the file size, which leads to faster transfer rates and can improve performance during batch processing tasks. It’s particularly effective for large files, making it easier to handle and process them efficiently
upvoted 1 times
...
da257c2
10 months, 2 weeks ago
option d
upvoted 1 times
...
Khadija10
1 year ago
Selected Answer: C
The answer is compress the files - we can't merge because we don't know how many files we will receive, and the performance will be decreased if the size of the merged files is too large. - converting the files to Avro will require additional processing and if we receive too many files it may cause complexity. - compressing the CSV files is the best choice in our scenario, compressing the files is a common practice for optimizing batch processing. It helps reduce storage space, minimize data transfer times, and improve overall performance.
upvoted 3 times
...
alphilla
1 year, 1 month ago
Crazy how OPTION D (which is wrong) has such a high ammount of votes. I'm going for C.compress files. Guys you don't know how many files you have, you can't design a system where this kind of randomness can make your system to fail.
upvoted 5 times
...
Joanna0
1 year, 1 month ago
Selected Answer: B
Binary Serialization: Avro uses a compact binary format, making it more efficient in terms of storage and transmission compared to plain text formats like CSV. This can be advantageous for batch processing scenarios, especially when dealing with large volumes of data. Schema Evolution: Avro supports schema evolution, allowing you to change the schema of your data without requiring modifications to the entire dataset or affecting backward compatibility. This flexibility is beneficial in scenarios where your data schema may evolve over time. Compression: While Avro itself is a binary format that provides some level of compression, you can further enhance compression by applying additional compression algorithms. This is particularly useful when dealing with large files, and it helps to reduce storage costs and improve data transfer efficiency.
upvoted 1 times
...
ll94
1 year, 1 month ago
Selected Answer: C
A. Convert the files to JSON => no sense B. Convert the files to Avro => my understanding is that the format of the file csv is given, so no C. Compress the files => for batch processing it's a win and it's the only option that you can assume true given the available information D. Merge the files => this can be true but not knowing how many files there is big issue
upvoted 5 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago