Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 40 discussion

Actual exam question from Microsoft's DP-203

Question #: 40
Topic #: 1

You are implementing a batch dataset in the Parquet format.
Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool.
You need to minimize storage costs for the solution.
What should you do?

A. Use Snappy compression for the files.
B. Use OPENROWSET to query the Parquet files.
C. Create an external table that contains a subset of columns from the Parquet files.
D. Store all data as string in the Parquet files.

Show Suggested Answer

Suggested Answer: A 🗳️

by vijju23 at Dec. 9, 2021, 11:55 p.m.

Comments

Submit Cancel

m2shines

Highly Voted 3 years, 6 months ago

Answer should be A, because this talks about minimizing storage costs, not querying costs

upvoted 75 times

Homer23

1 year, 3 months ago

I found this comparison of compression methods, which explained that A should not be the answer. https://www.linkedin.com/pulse/comparison-compression-methods-parquet-file-format-saurav-mohapatra/ "BROTLI : This is a relatively new codec which offers very high compression ratio , but with lower compression and decompression speeds. This codec is useful when storage space is a major constraint. This technique also offers parallel processing that other methods don't."

upvoted 2 times

...

assU2

3 years, 5 months ago

Isn't snappy a default compressionCodec for parquet in azure? https://docs.microsoft.com/en-us/azure/data-factory/format-parquet

upvoted 24 times

jongert

1 year, 6 months ago

Very confused at first, after thinking about it and rereading this is what I found: It says we are implementing the batch process in parquet format, so we should think about a situation where we write the file and specify snappy compression as an argument explicitly. The phrasing is very confusing I have to say, but if you argue from a 'query externally' perspective, then B and C would yield the same benefit. Therefore, A makes the most sense and connects best with the question.

upvoted 2 times

...

Aslam208

Highly Voted 3 years, 6 months ago

C is the correct answer, as an external table with a subset of columns with parquet files would be cost-effective.

upvoted 23 times

Massy

3 years, 2 months ago

in serverless sql pool you don't create a copy of the data, so how could be cost effective?

upvoted 2 times

Bro111

2 years, 7 months ago

Don't forget that there is Transaction cost part of storage cost, so taking a subset of columns will lower transaction cost consequently storage cost.

upvoted 1 times

...

RehanRajput

3 years, 1 month ago

This is not correct. 1. External tables are are not saved in the database. (This is why they're external) 2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool

upvoted 5 times

Aditya0891

3 years ago

well there is a possibility to create an external table and load only the required columns using openrowset in serverless sql pool to a different container in ADLS. Remember serverless sql pool does support cetas with openrowset but dedicated pool doesn't support loading data using openrowset. So basically the solution could be load the required columns using cetas using openrowset to a differnet container and delete the source data from previous container after loading the filtered data to a different container in ADLS

upvoted 2 times

Aditya0891

3 years ago

check this https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-cetas. Answer C is correct

upvoted 4 times

...

PrasadMP

Most Recent 3 months, 1 week ago

Selected Answer: C

C is correct answer

upvoted 1 times

...

IMadnan

4 months, 3 weeks ago

Selected Answer: A

Parquet is a columnar storage format that inherently provides compression benefits. Applying Snappy compression on top of Parquet's internal compression will further reduce the storage footprint of the files in Azure Data Lake Storage Gen2. Snappy is a well-suited compression codec for analytical workloads, offering a good balance between compression ratio and decompression speed, which is important for Azure Synapse Analytics serverless SQL pool to efficiently query the data. Options B, C, and D do not directly address minimizing storage costs for the Parquet files themselves. Option B is about query access, Option C is about query efficiency but not storage, and Option D is counterproductive to storage cost minimization.

upvoted 1 times

...

moize

7 months, 1 week ago

Selected Answer: A

A. Utilisez la compression Snappy pour les fichiers Cette approche permet de réduire la taille des fichiers Parquet, ce qui minimise les coûts de stockage dans Azure Data Lake Storage Gen2 tout en restant compatible avec Azure Synapse Analytics.

upvoted 1 times

...

EmnCours

7 months, 1 week ago

Selected Answer: A

https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs-legacy

upvoted 1 times

...

a85becd

10 months, 1 week ago

Selected Answer: A

Using Snappy compression (Option A) is specifically designed to reduce the size of Parquet files, thereby directly minimizing storage costs.

upvoted 1 times

...

Danweo

1 year ago

Selected Answer: C

The question is confusing but I believe it is C, because you can use CETAS to store this external table in Gen2 (this is the storage solution), from there you will query it using serverless SQL pool.

upvoted 1 times

...

Dusica

1 year, 2 months ago

A, B and C they are all acceptable, D is just stupid But pay attention to "You need to minimize storage costs for the solution" that means snappy parquet compresson - A is correct

upvoted 2 times

...

dgerok

1 year, 2 months ago

Selected Answer: A

Use Snappy compression for the files is the only answer, which is about minimizing cost of storage. While one is using serverless SQL pool, the external tables are available, which are the only metadata...

upvoted 3 times

...

Elanche

1 year, 3 months ago

Using Snappy compression for the Parquet files helps minimize storage costs while still maintaining good compression efficiency. Snappy is a compression library that offers a good balance between compression ratio and processing speed. By compressing the data using Snappy, you can significantly reduce the amount of storage required for your dataset. Option B, using OPENROWSET to query the Parquet files, doesn't directly impact storage costs. It's a method for querying data but doesn't address storage optimization. Option C, creating an external table with a subset of columns, may help reduce query costs by minimizing the amount of data that needs to be processed during queries. However, it doesn't directly address storage costs. Option D, storing all data as strings in the Parquet files, would likely increase storage costs rather than minimize them. Storing data as strings without appropriate compression would result in larger file sizes compared to using efficient compression algorithms like Snappy.

upvoted 6 times

...

ankeshpatel2112

1 year, 3 months ago

A. Use Snappy compression for the files.

upvoted 2 times

...

Zen9nez

1 year, 4 months ago

The answer is C - Parquet has default SNAPPY compression which cannot be overwritten so why would I apply SNAPPY again?

upvoted 3 times

...

s_unsworth

1 year, 4 months ago

Selected Answer: A

Further information required for this question. There isn't enough information to go off as to what is being asked. The initial question is in regards to storage which would result in using the snappy compression answer. If you are asking about querying the data then this should be clearly defined in the question. If someone was to create a User Story with regards to this (As a Manager I want to store data in the data lake at the reduced cost) then you wouldn't be providing them with an External table. You would give them information on storage.

upvoted 2 times

...

Joanna0

1 year, 6 months ago

Selected Answer: A

Snappy compression can reduce the size of Parquet files by up to 70%. This can save you a significant amount of money on storage costs.

upvoted 1 times

...

kkk5566

1 year, 10 months ago

Selected Answer: A

using compression

upvoted 2 times

...

kkk5566

1 year, 10 months ago

To minimize storage costs for the solution, you should use Snappy compression for the files. Snappy is a fast and efficient data compression and decompression library that can be used to compress Parquet files. This will help reduce the size of the data files and minimize storage costs in Azure Data Lake Storage Gen2. So, the correct answer is A. Use Snappy compression for the files

upvoted 1 times

...

Load full discussion...