Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 225 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 225
Topic #: 1
[All Professional Data Engineer Questions]

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

  • A. Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
  • B. Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
  • C. Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
  • D. Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
hrishi19
5 days, 9 hours ago
Selected Answer: C
The question states that the data should be available on BigQuery and only option C meets this requirement.
upvoted 1 times
...
JamesKarianis
3 months, 1 week ago
Selected Answer: A
A is correct
upvoted 1 times
...
Anudeep58
5 months, 2 weeks ago
Selected Answer: A
Option B: Registering the bucket as a Dataplex asset adds an additional layer of data governance and management. While useful, it may not be necessary for your immediate migration needs and can introduce additional complexity. Option C: Migrating data directly to BigQuery would require significant changes to your Spark pipelines since they would need to be refactored to read from and write to BigQuery instead of Parquet files. This approach could introduce higher costs due to BigQuery storage and querying. Option D: Using BigLake and Dataproc on Compute Engine is more complex and requires more management compared to Dataproc Serverless. Additionally, it might not be as cost-effective as leveraging Cloud Storage and Dataproc Serverless.
upvoted 3 times
aoifneofi_ef
3 months ago
Just adding further commentary on why A is correct while why other options are incorrect is explained above. Parquet files have schema engrained in them. Hence Spark pipelines on Hadoop Cluster may not have needed tables at all. Hence the simplest solution would be to move it to Cloud Storage instead of BigQuery and this way there would be minimal changes to the ETL pipelines - just change HDFS file system pointer to GCS file system for read writes and no need for any additional tables
upvoted 2 times
...
...
josech
6 months, 1 week ago
Selected Answer: A
The question says "You want to use managed services, while minimizing ETL data processing changes and overhead costs". Dataproc is a managed service that doesn't need to refactor the data transformation Spark code you already have (you will have to refactor only the wrtie and read code), an it has a Big Query connector for future use. https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery
upvoted 1 times
...
52ed0e5
8 months, 2 weeks ago
Selected Answer: C
Migrate your data directly to BigQuery. Refactor Spark pipelines to read from and write to BigQuery. Run the Spark jobs on Dataproc Serverless. The best choice for ensuring data availability in BigQuery. It allows seamless integration with BigQuery and minimizes ETL changes.
upvoted 3 times
...
Ramon98
8 months, 4 weeks ago
Selected Answer: C
A tricky one, because of "you need to ensure that your data is available in BigQuery". The easiest and most straight forward migration seems answer A to me, and then you can use external tables to make the parquet data directly available in BigQuery. https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet However creating the external tables is an extra step? So therefore maybe C is the answer?
upvoted 3 times
...
Moss2011
9 months ago
Selected Answer: C
I think the key phrase here is "you need to ensure that your data is available in BigQuery" that's why I thing C it's the best option
upvoted 1 times
...
JyoGCP
9 months, 1 week ago
Selected Answer: C
I think it's C. Dataproc can use BigQuery to read and write data. Dataproc's BigQuery connector is a library that allows Spark and Hadoop applications to process and write data from BigQuery. Here's how Dataproc can be used with BigQuery: Process large datasets: Use Spark to process data stored in BigQuery. Write results: Write the results back to BigQuery or other data storage for further analysis. Read data: The BigQuery connector can read data from BigQuery into a Spark DataFrame. Write data: The connector writes data to BigQuery by buffering all the data into a Cloud Storage temporary table.
upvoted 3 times
JyoGCP
9 months, 1 week ago
As per question.. "BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services (DATAPROC), while minimizing ETL data processing changes and overhead costs."
upvoted 3 times
...
...
matiijax
9 months, 1 week ago
Selected Answer: B
I think its B and the reason is that egistering the data as a Dataplex asset enables seamless integration with BigQuery later on. Dataplex simplifies data discovery and lineage tracking, making it easier to prepare your data for BigQuery transformations.
upvoted 3 times
...
saschak94
9 months, 2 weeks ago
Why would I select A here? Why not moving the data to BigQuery and running Dataproc Serverless jobs accessing the data in BigQuery?
upvoted 3 times
...
raaad
10 months, 3 weeks ago
Selected Answer: A
- This option involves moving Parquet files to Cloud Storage, which is a common and cost-effective storage solution for big data and is compatible with Spark jobs. - Using Dataproc Metastore to manage metadata allows us to keep Hadoop ecosystem's structural information. - Running Spark jobs on Dataproc Serverless takes advantage of managed Spark services without managing clusters. - Once the data is in Cloud Storage, you can also easily load it into BigQuery for further analysis.
upvoted 4 times
...
e70ea9e
10 months, 3 weeks ago
Selected Answer: A
Managed Services: Leverages Dataproc Serverless for a fully managed Spark environment, reducing overhead and administrative tasks. Minimal Data Processing Changes: Keeps Spark pipelines largely intact by working with Parquet files on Cloud Storage, minimizing refactoring efforts. BigQuery Integration: Dataproc Serverless can directly access BigQuery, enabling future transformation pipelines without additional data movement. Cost-Effective: Serverless model scales resources only when needed, optimizing costs for intermittent workloads.
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...