exam questions

Exam DP-600 All Questions

View all questions & answers for the DP-600 exam

Exam DP-600 topic 1 question 45 discussion

Actual exam question from Microsoft's DP-600
Question #: 45
Topic #: 1
[All DP-600 Questions]

HOTSPOT -
You have a Fabric workspace that uses the default Spark starter pool and runtime version 1.2.
You plan to read a CSV file named Sales_raw.csv in a lakehouse, select columns, and save the data as a Delta table to the managed area of the lakehouse. Sales_raw.csv contains 12 columns.
You have the following code.

For each of the following statements, select Yes if the statement is true. Otherwise, select No.
NOTE: Each correct selection is worth one point.

Show Suggested Answer Hide Answer
Suggested Answer:

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
metiii
Highly Voted 9 months, 1 week ago
1. No, this is called filter pushdown / predicate pushdown / column pruning. This config is available when reading from a columnar type like parquet, I didn't find anything related to csv, I know that you can pushdown a predicate on csv to make it only read some rows in that case it works but it probably doesn't work for selecting columns so spark will read the entire file then filters the columns. 2. Yes partitioning creates some overhead since Spark needs to create more files 3. Yes infereSchema forces spark to read the file twice once for schema and once for data
upvoted 32 times
...
282b85d
Highly Voted 6 months, 2 weeks ago
N-N-Y •No: The Spark engine will initially read all columns from the CSV file because the .select() transformation is applied after the data has been read into memory. Therefore, all 12 columns from Sales_raw.csv are read before the selection of specific columns is applied. •No: Removing the partition might not necessarily reduce the execution time. While there might be some overhead in writing data to partitions, the overall impact on read performance, especially for large datasets, is usually beneficial. The query execution time for saving might be higher due to partitioning, but the read performance improvement usually outweighs this cost. •Yes: Adding inferSchema = 'true' will increase the execution time of the query because Spark will need to read through the entire dataset to determine the data types of each column. This extra pass over the data adds to the initial read time.
upvoted 16 times
Huepig
4 days, 21 hours ago
This is not correct. Partitioning is not recommended because of both read and write overheads. Most often than not, partitioning results in “many small file” problem and data skew. Also, partitioning will result in reshuffle which causes longer write durations. In this case, unless the csv is so massive that each “year” partition is greater 128mb as parquet, the partition will increase both read and write durations. To add to this
upvoted 1 times
...
lelima
6 months, 1 week ago
This is correct
upvoted 1 times
...
...
Rakesh16
Most Recent 4 weeks, 1 day ago
No,No,yes
upvoted 1 times
...
Pegooli
4 months, 4 weeks ago
I'm going to Y-N-N
upvoted 1 times
...
calvintcy
6 months ago
I took the test today. This question was included, but the option 'Removing the partition will reduce the execution time of the query' has been replaced by 'Will the Year column replace the OrderDate column?'. My answer was No.
upvoted 9 times
...
mnc_1997
6 months, 2 weeks ago
just tried it, it only writes the columns that were selected. Answer: YNY
upvoted 1 times
mnc_1997
6 months, 2 weeks ago
also, what spark does is perform a lazy evaluation approach, it does not read each method(read,load,option) into memory. The actual reading happens when an action is performed such as(display,show,write). Spark will create a plan on how to execute the entire query and will optimize this plan for efficient execution.
upvoted 2 times
...
...
stilferx
7 months, 1 week ago
IMHO, 1. N 2. N - arguable 3. Y 1 No - because it is CSV. It will be read in full (in contrast to parquet) 2 No - well, maybe 0.5% slower due to creating a new files. But actually - no 3 Yes - because infering schema - it is additional process
upvoted 6 times
...
vish9
7 months, 1 week ago
No, CSV will be read in full and then filtered. No: Using the partition by clause in Spark's Delta format can impact write performance in several ways: Increased Write Throughput: Partitioning your data can potentially increase write throughput by distributing the write workload across multiple partitions. This parallelism allows Spark to write data to different partitions concurrently, improving overall write performance, especially when dealing with large datasets. Y. Infer schema will slow the performance
upvoted 5 times
...
dp600
7 months, 2 weeks ago
I would go with NYY. It's a CSV it is a row format, I don't think you can separate it by columns before reading the entire content. Partitioning takes extra work, so it may slow down the proccess. InferSchema requires an extra scan of the document or I think so, so maybe, I will go with yes.
upvoted 2 times
...
DilumD
7 months, 2 weeks ago
1. Yes: Reason: Select columns: The code selects specific columns from the DataFrame using the select method. The selected columns are "SalesOrderNumber", "OrderDate", "CustomerName", and "UnitPrice". 2. Yes: Reason: removing the partitionBy will simplify the process. Partitioning data involves some overhead in organizing the data into separate folders/files based on the partitioning column. 3. No: Reason: Potentially Slower: Enabling inferSchema generally results in a slightly slower initial read operation. This is because Spark needs to do an additional scan of a portion of your data to analyze and determine data types before loading it.
upvoted 1 times
...
wellingtonluis
9 months, 2 weeks ago
After read all file, engine will select just some. But, initially it runs the entire file.
upvoted 5 times
...
XiltroX
9 months, 2 weeks ago
The answer is probably YNY 1. Those are exactly the columns that are being read. So Yes 2. Removing the PartitionBy line would not result in any performance changes. So NO 3. Adding inferSchema as True WILL result in extra time in execution as it will make the engine go over the data twice (one to read data and the other time to read Schema). So YES.
upvoted 11 times
...
estrelle2008
9 months, 3 weeks ago
full of typos, this one. Anyhow, my guess: YNN inferSchema=true helps automatically determine column data types, but it needs a extra pass over the data, which comes with a slight query performance cost. So last statement = No
upvoted 2 times
...
Momoanwar
10 months ago
Its read not red. This question is ambiguous would say : no no yes. For the point 1 : with case sensitivity sales_raw is not Sales_raw
upvoted 5 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago