Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 8 discussion

Actual exam question from Microsoft's DP-203

Question #: 8
Topic #: 1

HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: Preserver hierarchy -
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.

Box 2: Parquet -
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.
Parquet supports the schema property.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-factory/format-parquet

by alain2 at May 18, 2021, 8:55 a.m.

Comments

Submit Cancel

alain2

Highly Voted 9 months, 2 weeks ago

1. Merge Files 2. Parquet https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance

upvoted 171 times

edba

3 years, 6 months ago

just want to add a bit more reference regarding copyBehavior in ADF plus info mentioned in Best Practice doc, so it shall be MergeFile first. https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink

upvoted 12 times

...

Ameenymous

4 years, 1 month ago

The smaller the files, the negative the performance so Merge and Parquet seems to be the right answer.

upvoted 25 times

...

kilowd

3 years, 8 months ago

Larger files lead to better performance and reduced costs. Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). S

upvoted 10 times

...

ThiruthuvaRajan

Highly Voted 4 years, 1 month ago

It should be 1)Merge Files - Question clearly says "initially ingested as 10 small json files". There is no hint on hierarchy or partition information. so clearly we need to merge these files for better performance 2) Parquet -> Always gives better performance for columnar based data

upvoted 12 times

...

Sathya_sree

Most Recent 3 months, 2 weeks ago

Copy Behavior → Merge Files Since there are 10 small JSON files, merging them into a single, optimized file will improve query performance. Azure Synapse Analytics serverless SQL pools perform better with fewer, larger files than many small files. Flatten hierarchy is not suitable since we are not restructuring directories. Preserve hierarchy is not needed because we want better query performance, not maintaining the folder structure. Sink File Type → Parquet Parquet is an efficient columnar format, designed for fast analytical queries. It supports schema inference, which is required for querying in Synapse serverless SQL pools. CSV and JSON are not optimal because they do not support columnar storage and would result in slower queries. TXT is not structured and would not allow schema inference. Thus, the best selections are "Merge files" for Copy behavior and "Parquet" for Sink file type to achieve the fastest query times

upvoted 1 times

...

Pey1nkh

4 months, 2 weeks ago

Azure Synapse serverless SQL pools perform best with fewer, larger files in Parquet format - so 1-Merge Files 2- Parquet

upvoted 1 times

...

KowshikKvt

4 months, 3 weeks ago

Q says "You need to move the files to a different folder". so preserve for sure

upvoted 1 times

...

krishna1303

5 months, 3 weeks ago

1. Copy Behavior: merge files 2. Sink file type: Paquet

upvoted 1 times

...

Karthikyn

8 months, 1 week ago

1. Preserve Hierarchy 2. Parquet Since Merge Files behavior combines multiple files into a single output file. While this can be useful for reducing the number of files, it may introduce additional performance overhead due to the merging process. Since it ask to provide the fastest possible query time.

upvoted 1 times

...

7082935

11 months, 1 week ago

This question is somewhat confusing. The detail states "Each file contains the same data attributes and data from a subsidiary of your company". There is a possibility that each file is for a different subsidiary and would be stored in a different hierarchical folder, so the requirement to "preserve hierarchy" could be the right answer here.

upvoted 2 times

...

Bakachi55

1 year, 3 months ago

Dear Community, I would like to express my heartfelt gratitude for the thoughtful mock questions that have been shared. Your generosity in providing these valuable resources has been immensely helpful. As we engage in discussions and learn together, I am reminded of the strength and camaraderie that exists within our community. To everyone who has contributed, whether by creating questions, participating in discussions, or simply offering encouragement, thank you. Your collective efforts make this community a vibrant and supportive place for learning and growth. Let us continue to share knowledge, support one another, and celebrate our shared passion for learning

upvoted 5 times

...

waghsvw53

1 year, 4 months ago

https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory

upvoted 1 times

...

Azure_2023

1 year, 4 months ago

1. Flatten hierarchy - Performance: Fastest; Schema inference: Straightforward; 2. Parquet

upvoted 2 times

...

lisa710

1 year, 6 months ago

Merge Files: This option combines the 10 JSON files into a single Parquet file, reducing overhead and improving query performance significantly. Parquet: This columnar format is optimized for fast queries, especially when dealing with large datasets and selective column reads. It also supports compression and schema inference.

upvoted 4 times

...