Exam DP-600 All Questions

View all questions & answers for the DP-600 exam

Exam DP-600 topic 1 question 45 discussion

Actual exam question from Microsoft's DP-600

Question #: 45
Topic #: 1

HOTSPOT -
You have a Fabric workspace that uses the default Spark starter pool and runtime version 1.2.
You plan to read a CSV file named Sales_raw.csv in a lakehouse, select columns, and save the data as a Delta table to the managed area of the lakehouse. Sales_raw.csv contains 12 columns.
You have the following code.

For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.

Show Suggested Answer

Suggested Answer:

by Momoanwar at Feb. 18, 2024, 1:30 a.m.

Comments

Submit Cancel

metiii

Highly Voted 1 year, 3 months ago

1. No, this is called filter pushdown / predicate pushdown / column pruning. This config is available when reading from a columnar type like parquet, I didn't find anything related to csv, I know that you can pushdown a predicate on csv to make it only read some rows in that case it works but it probably doesn't work for selecting columns so spark will read the entire file then filters the columns. 2. Yes partitioning creates some overhead since Spark needs to create more files 3. Yes infereSchema forces spark to read the file twice once for schema and once for data

upvoted 40 times

...

282b85d

Highly Voted 1 year, 1 month ago

N-N-Y •No: The Spark engine will initially read all columns from the CSV file because the .select() transformation is applied after the data has been read into memory. Therefore, all 12 columns from Sales_raw.csv are read before the selection of specific columns is applied. •No: Removing the partition might not necessarily reduce the execution time. While there might be some overhead in writing data to partitions, the overall impact on read performance, especially for large datasets, is usually beneficial. The query execution time for saving might be higher due to partitioning, but the read performance improvement usually outweighs this cost. •Yes: Adding inferSchema = 'true' will increase the execution time of the query because Spark will need to read through the entire dataset to determine the data types of each column. This extra pass over the data adds to the initial read time.

upvoted 21 times

Huepig

6 months, 3 weeks ago

This is not correct. Partitioning is not recommended because of both read and write overheads. Most often than not, partitioning results in “many small file” problem and data skew. Also, partitioning will result in reshuffle which causes longer write durations. In this case, unless the csv is so massive that each “year” partition is greater 128mb as parquet, the partition will increase both read and write durations. To add to this

upvoted 1 times

...

lelima

1 year, 1 month ago

This is correct

upvoted 1 times

...

testtaker45

Most Recent 5 months, 2 weeks ago

N, Y, Y Lol, such a controversial question. 1. The key word is read. Load will read all of them, it will only select the specified. When Spark does a .select(), it will read everything into mem first. 2. Removing the partition would reduce compute requirements, making the query execute faster, but would likely increase the in memory footprint as ultimately more is loaded into memory, not just a part. 3.infer_schema will try to guess the type, like int, date, etc, but it requires compute. This will slow down the query. I hope this helps someone!

upvoted 1 times

...

Rakesh16

7 months, 2 weeks ago

No,No,yes

upvoted 1 times

...

Pegooli

11 months, 3 weeks ago

I'm going to Y-N-N

upvoted 1 times

...

calvintcy

1 year ago

I took the test today. This question was included, but the option 'Removing the partition will reduce the execution time of the query' has been replaced by 'Will the Year column replace the OrderDate column?'. My answer was No.

upvoted 11 times

radamantes

5 months, 3 weeks ago

Did you approve? Most of question were in this site? Can you share?

upvoted 1 times

...

mnc_1997

1 year, 1 month ago

just tried it, it only writes the columns that were selected. Answer: YNY

upvoted 2 times

mnc_1997

1 year, 1 month ago

also, what spark does is perform a lazy evaluation approach, it does not read each method(read,load,option) into memory. The actual reading happens when an action is performed such as(display,show,write). Spark will create a plan on how to execute the entire query and will optimize this plan for efficient execution.

upvoted 2 times

...

stilferx

1 year, 1 month ago

IMHO, 1. N 2. N - arguable 3. Y 1 No - because it is CSV. It will be read in full (in contrast to parquet) 2 No - well, maybe 0.5% slower due to creating a new files. But actually - no 3 Yes - because infering schema - it is additional process

upvoted 6 times

...

vish9

1 year, 1 month ago

No, CSV will be read in full and then filtered. No: Using the partition by clause in Spark's Delta format can impact write performance in several ways: Increased Write Throughput: Partitioning your data can potentially increase write throughput by distributing the write workload across multiple partitions. This parallelism allows Spark to write data to different partitions concurrently, improving overall write performance, especially when dealing with large datasets. Y. Infer schema will slow the performance

upvoted 5 times

...

dp600

1 year, 2 months ago

I would go with NYY. It's a CSV it is a row format, I don't think you can separate it by columns before reading the entire content. Partitioning takes extra work, so it may slow down the proccess. InferSchema requires an extra scan of the document or I think so, so maybe, I will go with yes.

upvoted 2 times

...

DilumD

1 year, 2 months ago

1. Yes: Reason: Select columns: The code selects specific columns from the DataFrame using the select method. The selected columns are "SalesOrderNumber", "OrderDate", "CustomerName", and "UnitPrice". 2. Yes: Reason: removing the partitionBy will simplify the process. Partitioning data involves some overhead in organizing the data into separate folders/files based on the partitioning column. 3. No: Reason: Potentially Slower: Enabling inferSchema generally results in a slightly slower initial read operation. This is because Spark needs to do an additional scan of a portion of your data to analyze and determine data types before loading it.

upvoted 1 times

...

wellingtonluis

1 year, 4 months ago

After read all file, engine will select just some. But, initially it runs the entire file.

upvoted 5 times

...

XiltroX

1 year, 4 months ago

The answer is probably YNY 1. Those are exactly the columns that are being read. So Yes 2. Removing the PartitionBy line would not result in any performance changes. So NO 3. Adding inferSchema as True WILL result in extra time in execution as it will make the engine go over the data twice (one to read data and the other time to read Schema). So YES.

upvoted 11 times

...

estrelle2008

1 year, 4 months ago

full of typos, this one. Anyhow, my guess: YNN inferSchema=true helps automatically determine column data types, but it needs a extra pass over the data, which comes with a slight query performance cost. So last statement = No

upvoted 2 times

...

Momoanwar

1 year, 4 months ago

Its read not red. This question is ambiguous would say : no no yes. For the point 1 : with case sensitivity sales_raw is not Sales_raw

upvoted 5 times

...