Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 87 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 87
Topic #: 1
[All Professional Data Engineer Questions]

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

  • A. Increase the size of your parquet files to ensure them to be 1 GB minimum.
  • B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
  • C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
  • D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
rickywck
Highly Voted 4 years, 8 months ago
Should be A: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files https://www.dremio.com/tuning-parquet/ C & D will improve performance but need to pay more $$
upvoted 69 times
diluvio
3 years, 1 month ago
It is A . please read the links above
upvoted 5 times
...
odacir
1 year, 11 months ago
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance
upvoted 1 times
...
raf2121
3 years, 3 months ago
Point for discussion - Another reason why it can't be C or D. SSD's are not available on pre-emptible Worker nodes (answers didn't say whether they wanted to switch from HDD to SDD for Master nodes) https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs
upvoted 8 times
rr4444
2 years, 4 months ago
You can have local SSDs for the dataproc normal or preemptible VMs https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-pd-ssd
upvoted 1 times
...
raf2121
3 years, 3 months ago
Also for Shuffling Operations, one need to override the preemptible VMs configuration to increase boot disk size. (Second half of answer D is correct but first half is wrong)
upvoted 1 times
...
...
zellck
1 year, 11 months ago
https://cloud.google.com/dataproc/docs/support/spark-job-tuning#limit_the_number_of_files Store data in larger file sizes, for example, file sizes in the 256MB–512MB range.
upvoted 3 times
...
...
madhu1171
Highly Voted 4 years, 8 months ago
Answer should be D
upvoted 12 times
jvg637
4 years, 8 months ago
D: # By default, preemptible node disk sizes are limited to 100GB or the size of the non-preemptible node disk sizes, whichever is smaller. However you can override the default preemptible disk size to any requested size. Since the majority of our cluster is using preemptible nodes, the size of the disk used for caching operations will see a noticeable performance improvement using a larger disk. Also, SSD's will perform better than HDD. This will increase costs slightly, but is the best option available while maintaining costs.
upvoted 15 times
ch3n6
4 years, 5 months ago
C is correct. D is wrong. they are using 'dataproc and GCS', not related to boot disk at all .
upvoted 2 times
VishalB
4 years, 4 months ago
C is recommended only - If you have many small files, consider copying files for processing to the local HDFS and then copying the results back
upvoted 1 times
FARR
4 years, 3 months ago
File sizes are already within the expected range for GCS (128MB-1GB) so not C. D seems most feasible
upvoted 3 times
...
...
...
...
...
Javakidson
Most Recent 2 weeks, 6 days ago
A is the answer
upvoted 1 times
...
SamuelTsch
1 month ago
Selected Answer: A
I think either A or C. The problem is occured by I/O performance. Option A is feasible, which reduces the number of files leading better parallel processing. Option C tries to handle I/O performance issue. Taking other factors like budget and no mention of HDD/SSD, option A is possible the correct answer.
upvoted 1 times
...
baimus
2 months ago
Selected Answer: A
There's no mention of a drive type used, only GCS. That means A is the only sensible option.
upvoted 1 times
...
987af6b
4 months ago
Selected Answer: A
Question doesn't actually say they are using HDD in the scenario, for that reason I choose A
upvoted 2 times
...
philli1011
9 months, 2 weeks ago
A We don't know if HDD was used, so we can know what to do about that, but we know that the parquet files are small and much, and we can act on that by increasing the sizes to have lesser number of it.
upvoted 2 times
...
rocky48
11 months, 3 weeks ago
Selected Answer: A
Should be A: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files
upvoted 1 times
rocky48
11 months, 3 weeks ago
Given the scenario and the cost-sensitive nature of your organization, the best option would be: C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job, and copy results back to GCS. Option C allows you to leverage the benefits of SSDs and HDFS while minimizing costs by continuing to use Dataproc on preemptible VMs. This approach optimizes both performance and cost-effectiveness for your analytical workload on Google Cloud.
upvoted 1 times
...
...
Mathew106
1 year, 4 months ago
Selected Answer: A
https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files Cost effective is the key in the question.
upvoted 1 times
...
Nandhu95
1 year, 8 months ago
Selected Answer: D
Preemptible VMs can't be used for HDFS storage. As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads.
upvoted 1 times
...
midgoo
1 year, 8 months ago
Selected Answer: D
Should NOT be A as: 1. The file size is already at the optimal size 2. If the current file size works well in the current Hadoop, it is expected to have similar performance in Dataproc The only difference between the current and Dataproc is that Dataproc is using preemptible nodes. So yes, it may incur a bit more cost by using SSD but assuming using the preemptible already save most of it, so we want to save less to improve the performance
upvoted 1 times
Mathew106
1 year, 4 months ago
Optimal size is 1GB
upvoted 1 times
...
...
[Removed]
1 year, 9 months ago
Selected Answer: A
Cost sensitive is the keyword.
upvoted 1 times
...
musumusu
1 year, 9 months ago
this question asked by Google, So option C is not correct otherwise, good approach to use initial data in hdfs and swtich from HDD to SDDs for 2 non-preemptible node. Option D is right but they are not mentioning that they will stop using 2 non-preemptible node. but i assume it :P
upvoted 2 times
...
PolyMoe
1 year, 10 months ago
Selected Answer: C
C. ref : https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance - size recommended is 128MB-1GB ==> so it is not size issue ==> not A - there is no issue mentioned with file format ==> not B - D. could be a good solution, but requires overriding preemptible VMs. however, the questions asks to continue using preemtibles ==> not D - C. is a good solution.
upvoted 3 times
ayush_1995
1 year, 9 months ago
agreed C over D as switching from HDDs to SSDs and overriding the preemptible VMs configuration to increase the boot disk size, may not be the best solution for improving performance in this scenario because it doesn't address the main issue which is the large number of shuffling operations that are causing performance degradation. While SSDs may have faster read and write speeds than HDDs, they may not provide significant performance improvements for a workload that is primarily CPU-bound and heavily reliant on shuffling operations. Additionally, increasing the boot disk size of the preemptible VMs may not be necessary or cost-effective for this particular workload.
upvoted 1 times
...
...
slade_wilson
1 year, 11 months ago
Selected Answer: D
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance Manage Cloud Storage file sizes To get optimal performance, split your data in Cloud Storage into files with sizes from 128 MB to 1 GB. Using lots of small files can create a bottleneck. If you have many small files, consider copying files for processing to the local HDFS and then copying the results back. Switch to SSD disks If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance.
upvoted 2 times
...
odacir
1 year, 11 months ago
Selected Answer: D
Its D 100%. It's the recommended best practice for this scenario. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance
upvoted 3 times
...
zellck
1 year, 11 months ago
Selected Answer: D
D is the answer. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#switch_to_ssd_disks If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#use_preemptible_vms As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads. For details, see the page on preemptible VMs in the Dataproc documentation.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...