Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 225 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 225
Topic #: 1

[All Professional Data Engineer Questions]

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

A. Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
B. Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
C. Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
D. Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.

Show Suggested Answer

Suggested Answer: C 🗳️

by e70ea9e at Dec. 30, 2023, 9:51 a.m.

Comments

Submit Cancel

LP_PDE

1 week, 4 days ago

Selected Answer: A

Both Spark and BigQuery can directly access data in Cloud Storage.

upvoted 1 times

...

hrishi19

2 months, 3 weeks ago

Selected Answer: C

The question states that the data should be available on BigQuery and only option C meets this requirement.

upvoted 3 times

...

JamesKarianis

5 months, 3 weeks ago

Selected Answer: A

A is correct

upvoted 1 times

...

Anudeep58

8 months ago

Selected Answer: A

Option B: Registering the bucket as a Dataplex asset adds an additional layer of data governance and management. While useful, it may not be necessary for your immediate migration needs and can introduce additional complexity. Option C: Migrating data directly to BigQuery would require significant changes to your Spark pipelines since they would need to be refactored to read from and write to BigQuery instead of Parquet files. This approach could introduce higher costs due to BigQuery storage and querying. Option D: Using BigLake and Dataproc on Compute Engine is more complex and requires more management compared to Dataproc Serverless. Additionally, it might not be as cost-effective as leveraging Cloud Storage and Dataproc Serverless.

upvoted 3 times

aoifneofi_ef

5 months, 2 weeks ago

Just adding further commentary on why A is correct while why other options are incorrect is explained above. Parquet files have schema engrained in them. Hence Spark pipelines on Hadoop Cluster may not have needed tables at all. Hence the simplest solution would be to move it to Cloud Storage instead of BigQuery and this way there would be minimal changes to the ETL pipelines - just change HDFS file system pointer to GCS file system for read writes and no need for any additional tables

upvoted 2 times

...

josech

8 months, 3 weeks ago

Selected Answer: A

The question says "You want to use managed services, while minimizing ETL data processing changes and overhead costs". Dataproc is a managed service that doesn't need to refactor the data transformation Spark code you already have (you will have to refactor only the wrtie and read code), an it has a Big Query connector for future use. https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery

upvoted 1 times

...

52ed0e5

11 months ago

Selected Answer: C

Migrate your data directly to BigQuery. Refactor Spark pipelines to read from and write to BigQuery. Run the Spark jobs on Dataproc Serverless. The best choice for ensuring data availability in BigQuery. It allows seamless integration with BigQuery and minimizes ETL changes.

upvoted 3 times

...

Ramon98

11 months, 2 weeks ago

Selected Answer: C

A tricky one, because of "you need to ensure that your data is available in BigQuery". The easiest and most straight forward migration seems answer A to me, and then you can use external tables to make the parquet data directly available in BigQuery. https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet However creating the external tables is an extra step? So therefore maybe C is the answer?

upvoted 4 times

...

Moss2011

11 months, 2 weeks ago

Selected Answer: C

I think the key phrase here is "you need to ensure that your data is available in BigQuery" that's why I thing C it's the best option

upvoted 1 times

...

JyoGCP

11 months, 3 weeks ago

Selected Answer: C

I think it's C. Dataproc can use BigQuery to read and write data. Dataproc's BigQuery connector is a library that allows Spark and Hadoop applications to process and write data from BigQuery. Here's how Dataproc can be used with BigQuery: Process large datasets: Use Spark to process data stored in BigQuery. Write results: Write the results back to BigQuery or other data storage for further analysis. Read data: The BigQuery connector can read data from BigQuery into a Spark DataFrame. Write data: The connector writes data to BigQuery by buffering all the data into a Cloud Storage temporary table.

upvoted 3 times

JyoGCP

11 months, 3 weeks ago

As per question.. "BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services (DATAPROC), while minimizing ETL data processing changes and overhead costs."

upvoted 3 times

...

matiijax

11 months, 3 weeks ago

Selected Answer: B

I think its B and the reason is that egistering the data as a Dataplex asset enables seamless integration with BigQuery later on. Dataplex simplifies data discovery and lineage tracking, making it easier to prepare your data for BigQuery transformations.

upvoted 3 times

...

saschak94

12 months ago

Why would I select A here? Why not moving the data to BigQuery and running Dataproc Serverless jobs accessing the data in BigQuery?

upvoted 3 times

...

raaad

1 year, 1 month ago

Selected Answer: A

- This option involves moving Parquet files to Cloud Storage, which is a common and cost-effective storage solution for big data and is compatible with Spark jobs. - Using Dataproc Metastore to manage metadata allows us to keep Hadoop ecosystem's structural information. - Running Spark jobs on Dataproc Serverless takes advantage of managed Spark services without managing clusters. - Once the data is in Cloud Storage, you can also easily load it into BigQuery for further analysis.

upvoted 4 times

...

e70ea9e

1 year, 1 month ago

Selected Answer: A

Managed Services: Leverages Dataproc Serverless for a fully managed Spark environment, reducing overhead and administrative tasks. Minimal Data Processing Changes: Keeps Spark pipelines largely intact by working with Parquet files on Cloud Storage, minimizing refactoring efforts. BigQuery Integration: Dataproc Serverless can directly access BigQuery, enabling future transformation pipelines without additional data movement. Cost-Effective: Serverless model scales resources only when needed, optimizing costs for intermittent workloads.

upvoted 2 times

...

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 225 discussion

Comments

LP_PDE

hrishi19

JamesKarianis

Anudeep58

aoifneofi_ef

josech

52ed0e5

Ramon98

Moss2011

JyoGCP

JyoGCP

matiijax

saschak94

raaad

e70ea9e

SY0-701