exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 8 discussion

Actual exam question from Microsoft's DP-203
Question #: 8
Topic #: 1
[All DP-203 Questions]

HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: Preserver hierarchy -
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.

Box 2: Parquet -
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.
Parquet supports the schema property.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-factory/format-parquet

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
alain2
Highly Voted 3 months, 2 weeks ago
1. Merge Files 2. Parquet https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
upvoted 167 times
edba
3 years ago
just want to add a bit more reference regarding copyBehavior in ADF plus info mentioned in Best Practice doc, so it shall be MergeFile first. https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 12 times
...
Ameenymous
3 years, 7 months ago
The smaller the files, the negative the performance so Merge and Parquet seems to be the right answer.
upvoted 25 times
...
kilowd
3 years, 2 months ago
Larger files lead to better performance and reduced costs. Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). S
upvoted 10 times
...
...
ThiruthuvaRajan
Highly Voted 3 years, 7 months ago
It should be 1)Merge Files - Question clearly says "initially ingested as 10 small json files". There is no hint on hierarchy or partition information. so clearly we need to merge these files for better performance 2) Parquet -> Always gives better performance for columnar based data
upvoted 11 times
...
Karthikyn
Most Recent 2 months, 1 week ago
1. Preserve Hierarchy 2. Parquet Since Merge Files behavior combines multiple files into a single output file. While this can be useful for reducing the number of files, it may introduce additional performance overhead due to the merging process. Since it ask to provide the fastest possible query time.
upvoted 1 times
...
7082935
5 months, 1 week ago
This question is somewhat confusing. The detail states "Each file contains the same data attributes and data from a subsidiary of your company". There is a possibility that each file is for a different subsidiary and would be stored in a different hierarchical folder, so the requirement to "preserve hierarchy" could be the right answer here.
upvoted 2 times
...
Bakachi55
9 months, 3 weeks ago
Dear Community, I would like to express my heartfelt gratitude for the thoughtful mock questions that have been shared. Your generosity in providing these valuable resources has been immensely helpful. As we engage in discussions and learn together, I am reminded of the strength and camaraderie that exists within our community. To everyone who has contributed, whether by creating questions, participating in discussions, or simply offering encouragement, thank you. Your collective efforts make this community a vibrant and supportive place for learning and growth. Let us continue to share knowledge, support one another, and celebrate our shared passion for learning
upvoted 5 times
...
waghsvw53
10 months, 2 weeks ago
https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory
upvoted 1 times
...
Azure_2023
10 months, 3 weeks ago
1. Flatten hierarchy - Performance: Fastest; Schema inference: Straightforward; 2. Parquet
upvoted 2 times
...
lisa710
1 year ago
Merge Files: This option combines the 10 JSON files into a single Parquet file, reducing overhead and improving query performance significantly. Parquet: This columnar format is optimized for fast queries, especially when dealing with large datasets and selective column reads. It also supports compression and schema inference.
upvoted 4 times
...
Chemmangat
1 year, 3 months ago
My answer : Merge Since there is no mention of preserving the hierarchy, and the need is to make the process more efficient, merge is the way to go.
upvoted 2 times
...
hassexat
1 year, 3 months ago
MERGE FILES since you need to make transformation and data have the same attributes PARQUET because is the most effcient file format
upvoted 2 times
...
kkk5566
1 year, 4 months ago
- Merge - Parquet
upvoted 2 times
...
ladistar
1 year, 4 months ago
ChatGPT confirms it's 1. Merge, 2. Parquet
upvoted 5 times
...
rocky48
1 year, 8 months ago
1. Merge Files 2. Parquet https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
upvoted 5 times
...
Honour
1 year, 9 months ago
Hey, the answer here should be "Merge Files" and "Parquet". The question said nothing about hierarchies.
upvoted 2 times
...
bubby248
1 year, 10 months ago
Merge small files will be best for fast retreiving. Parquet for sink file type
upvoted 1 times
...
INDEAVR
1 year, 10 months ago
Either preserving or flattening hierarchy has little to no performance overhead, whereas merging files causes additional performance overhead. It's perserve
upvoted 1 times
...
vigilante89
2 years ago
Copy Behaviour: MERGE FILES Because the small files already have same data attributes i.e. same schema. So merging all the data into one single file and converting the file to parquet makes more sense to make the query time, space and cost efficient. Sink/Destination File Type: PARQUET This is a no-brainer because parquet is the most efficient file format in this case in terms of time, space and cost efficiency.
upvoted 5 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago