exam questions

Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 45 discussion

Actual exam question from Google's Professional Machine Learning Engineer
Question #: 45
Topic #: 1
[All Professional Machine Learning Engineer Questions]

You are training a TensorFlow model on a structured dataset with 100 billion records stored in several CSV files. You need to improve the input/output execution performance. What should you do?

  • A. Load the data into BigQuery, and read the data from BigQuery.
  • B. Load the data into Cloud Bigtable, and read the data from Bigtable.
  • C. Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage.
  • D. Convert the CSV files into shards of TFRecords, and store the data in the Hadoop Distributed File System (HDFS).
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
ralf_cc
Highly Voted 2 years, 11 months ago
C - not enough info in the question, but C is the "most correct" one
upvoted 26 times
...
theseawillclaim
Most Recent 2 days, 23 hours ago
Selected Answer: C
It's C. BigTable would usually help with heavy I/O ops, but is not suited for (semi)structured data by design.
upvoted 1 times
...
PhilipKoku
2 weeks, 1 day ago
Selected Answer: C
C) The most suitable option for improving input/output execution performance in this scenario is C. Convert the CSV files into shards of TFRecords and store the data in Cloud Storage. This approach leverages the efficiency of TFRecords and the scalability of Cloud Storage, aligning with TensorFlow best practices.
upvoted 2 times
...
fragkris
6 months, 2 weeks ago
Selected Answer: C
C is the google reccomended approach.
upvoted 1 times
...
Sum_Sum
7 months, 1 week ago
C is the correct one as BQ will not help you with performance
upvoted 1 times
...
peetTech
8 months, 3 weeks ago
Selected Answer: C
C https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards#:~:text=Splitting%20TFRecord%20files%20into%20shards,them%20through%20a%20training%20process.
upvoted 2 times
...
peetTech
8 months, 3 weeks ago
C https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards#:~:text=Splitting%20TFRecord%20files%20into%20shards,them%20through%20a%20training%20process.
upvoted 1 times
...
ftl
9 months, 1 week ago
bard: The correct answer is: C. Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage. TFRecords is a TensorFlow-specific binary format that is optimized for performance. Converting the CSV files into TFRecords will improve the input/output execution performance. Sharding the TFRecords will allow the data to be read in parallel, which will further improve performance. The other options are not as likely to improve performance. Loading the data into BigQuery or Cloud Bigtable will add an additional layer of abstraction, which can slow down performance. Storing the TFRecords in HDFS is not likely to improve performance, as HDFS is not optimized for TensorFlow.
upvoted 1 times
...
tavva_prudhvi
10 months, 2 weeks ago
Using BigQuery or Bigtable may not be the most efficient option for input/output operations with TensorFlow. Storing the data in HDFS may be an option, but Cloud Storage is generally a more scalable and cost-effective solution.
upvoted 1 times
...
PST21
1 year ago
While Bigtable can offer high-performance I/O capabilities, it is important to note that it is primarily designed for structured data storage and real-time access patterns. In this scenario, the focus is on optimizing input/output execution performance, and using TFRecords in Cloud Storage aligns well with that goal.
upvoted 1 times
...
Voyager2
1 year ago
Selected Answer: A
A. Load the data into BigQuery, and read the data from BigQuery. https://cloud.google.com/blog/products/ai-machine-learning/tensorflow-enterprise-makes-accessing-data-on-google-cloud-faster-and-easier Precisely on this link provided in other comments it whos that the best shot with tfrecords is: 18752 Records per second. In the same report it shows that bigquery is morethan 40000 recors per second
upvoted 2 times
tavva_prudhvi
11 months ago
BigQuery is designed for running large-scale analytical queries, not for serving input pipelines for machine learning models like TensorFlow. BigQuery's strength is in its ability to handle complex queries over vast amounts of data, but it may not provide the optimal performance for the specific task of feeding data into a TensorFlow model. On the other hand, converting the CSV files into shards of TFRecords and storing them in Cloud Storage (Option C) will provide better performance because TFRecords is a format designed specifically for TensorFlow. It allows for efficient storage and retrieval of data, making it a more suitable choice for improving the input/output execution performance. Additionally, Cloud Storage provides high throughput and low-latency data access, which is beneficial for training large-scale TensorFlow models.
upvoted 3 times
...
...
M25
1 year, 1 month ago
Selected Answer: C
Went with C
upvoted 2 times
...
shankalman717
1 year, 4 months ago
Selected Answer: C
Cloud Bigtable is typically used to process unstructured data, such as time-series data, logs, or other types of data that do not conform to a fixed schema. However, Cloud Bigtable can also be used to store structured data if necessary, such as in the case of a key-value store or a database that does not require complex relational queries.
upvoted 1 times
...
shankalman717
1 year, 4 months ago
Selected Answer: C
Option C, converting the CSV files into shards of TFRecords and storing the data in Cloud Storage, is the most appropriate solution for improving input/output execution performance in this scenario
upvoted 1 times
...
behzadsw
1 year, 5 months ago
Selected Answer: A
https://cloud.google.com/architecture/ml-on-gcp-best-practices#store-tabular-data-in-bigquery BigQuery for structured data, cloud storage for unstructed data
upvoted 4 times
ShePiDai
1 year, 1 month ago
agree. BigQuery and Cloud Storage have effectively identical storage performance, where BigQuery is optimised for structured dataset and GCS for unstructured.
upvoted 1 times
...
...
Mohamed_Mossad
2 years ago
Selected Answer: D
"100 billion records stored in several CSV files" that means we deal with distributed big data problem , so HDFS is very suitable , Will choose D
upvoted 1 times
hoai_nam_1512
1 year, 10 months ago
HDFS will require more resources 100 bil record is processed fine with Cloud Storage object
upvoted 2 times
...
...
David_ml
2 years, 1 month ago
Answer is C. TFRecords in cloud storage for big data is the recommended practice by Google for training TF models.
upvoted 4 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago