Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 82 discussion

Actual exam question from Google's Professional Machine Learning Engineer

Question #: 82
Topic #: 1

[All Professional Machine Learning Engineer Questions]

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance. Which action should you try first to increase the efficiency of your pipeline?

A. Preprocess the input CSV file into a TFRecord file.
B. Randomly select a 10 gigabyte subset of the data to train your model.
C. Split into multiple CSV files and use a parallel interleave transformation.
D. Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Show Suggested Answer

Suggested Answer: C 🗳️

by LearnSodas at Dec. 11, 2022, 6:03 p.m.

Comments

Submit Cancel

phani49

2 days, 14 hours ago

Selected Answer: A

Based on the official documentation, Option A (converting to TFRecord format) is actually the correct first action to try, and the claim is incorrect. Why TFRecord is the Best First Option TFRecord format is specifically recommended for large datasets because: - It provides extremely high throughput when reading from Cloud Storage, especially for large-scale training[2] - It's the recommended format for structured data and large files[2] - It's designed for efficient serialization of structured data and optimal performance with TensorFlow

upvoted 1 times

...

AB_C

3 weeks, 4 days ago

Selected Answer: C

c is the right answer

upvoted 1 times

...

Prakzz

5 months, 3 weeks ago

Selected Answer: A

Preprocessing the input CSV file into a TFRecord file optimizes the input data pipeline by enabling more efficient reading and processing. TFRecord is a binary format that is faster to read and more efficient for TensorFlow to process compared to CSV, which is a text-based format. This change can significantly reduce the time spent on data input operations during model training.

upvoted 3 times

...

PhilipKoku

6 months, 2 weeks ago

Selected Answer: A

A) Convert CSV file into TFRecord is more effecient and processing CSV in parallel (C)

upvoted 1 times

...

pinimichele01

8 months ago

Selected Answer: C

Converting a large 5 terabyte CSV file to a TFRecord can be a time-consuming process, and you would still be dealing with a single large file.

upvoted 4 times

...

tavva_prudhvi

1 year, 1 month ago

Selected Answer: C

While preprocessing the input CSV file into a TFRecord file (Option A) can improve the performance of your input pipeline, it is not the first action to try in this situation. Converting a large 5 terabyte CSV file to a TFRecord can be a time-consuming process, and you would still be dealing with a single large file.

upvoted 1 times

...

andresvelasco

1 year, 3 months ago

Selected Answer: C

i think C based on the consideration: "Which action should you try first ", meaning it should be less impactful to continue using CSV.

upvoted 1 times

...

TNT87

1 year, 6 months ago

Selected Answer: C

https://www.tensorflow.org/guide/data_performance#best_practice_summary

upvoted 2 times

...

M25

1 year, 7 months ago

Selected Answer: C

Went with C

upvoted 1 times

...

e707

1 year, 8 months ago

Selected Answer: C

Option A, preprocess the input CSV file into a TFRecord file, is not as good because it requires additional processing time. Hence, I think C is the best choice.

upvoted 1 times

...

frangm23

1 year, 8 months ago

Selected Answer: A

I think it could be A. https://cloud.google.com/architecture/best-practices-for-ml-performance-cost#preprocess_the_data_once_and_save_it_as_a_tfrecord_file

upvoted 1 times

...

[Removed]

1 year, 8 months ago

Selected Answer: A

Clearly both A and C works here, but I can't find any documentation which suggests C is any better than A.

upvoted 1 times

...

Yajnas_arpohc

1 year, 9 months ago

"Which action should you try first" seems to be key -- C seems more intuitive as first step! A is valid as well (interleave works w TFRecords) & definitely more efficient IMO, but maybe 2nd step!

upvoted 2 times

...

shankalman717

1 year, 10 months ago

Selected Answer: A

Option B (randomly selecting a 10 gigabyte subset of the data) could lead to a loss of useful data and may not be representative of the entire dataset. Option C (splitting into multiple CSV files and using a parallel interleave transformation) may also improve the performance, but may be more complex to implement and maintain, and may not be as efficient as converting to TFRecord. Option D (setting the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method) is not directly related to the input data format and may not provide as significant a performance improvement as converting to TFRecord.

upvoted 3 times

tavva_prudhvi

1 year, 9 months ago

Please read this site https://www.tensorflow.org/tutorials/load_data/csv, its simple to implement in the same input pipeline, and we cannot judge the answer by implementation difficulties!

upvoted 1 times

...

SMASL

1 year, 10 months ago

Could anyone be kind to explain why C is preferred over A? My initial guess was on A, but everyone here seems to unanimously prefer C. Is it because it is not about optimizing I/O performance, but rather the input _pipeline_, which is about processing arrived data within that TF input pipeline (non-I/O)? I just try to understand here. Thanks for reply in advance!

upvoted 4 times

tavva_prudhvi

1 year, 9 months ago

Option C, splitting into multiple CSV files and using a parallel interleave transformation, could improve the pipeline efficiency by allowing multiple workers to read the data in parallel.

upvoted 1 times

[Removed]

1 year, 8 months ago

yes but how is it more efficient than converting to a TFRecord file?

upvoted 1 times

tavva_prudhvi

1 year, 5 months ago

A TFRecord file is a binary file format that is used to store TensorFlow data. It is more efficient than a CSV file because it can be read more quickly and it takes up less space. However, it is still a large file, and it would take a long time to read it into memory. Splitting the file into multiple smaller files would reduce the amount of time it takes to read the files into memory, and it would also make it easier to parallelize the reading process.

upvoted 1 times

...

enghabeth

1 year, 10 months ago

Selected Answer: C

split data it's best way in my opinion

upvoted 1 times

...

hiromi

2 years ago

Selected Answer: C

C Keywords -> You need to optimize the input pipeline performance https://www.tensorflow.org/guide/data_performance

upvoted 2 times

hiromi

1 year, 12 months ago

- https://www.tensorflow.org/tutorials/load_data/csv

upvoted 1 times

...

Load full discussion...

Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 82 discussion

Comments

phani49

AB_C

Prakzz

PhilipKoku

pinimichele01

tavva_prudhvi

andresvelasco

TNT87

M25

e707

frangm23

[Removed]

Yajnas_arpohc

shankalman717

tavva_prudhvi

SMASL

tavva_prudhvi

[Removed]

tavva_prudhvi

enghabeth

hiromi

hiromi

SY0-701