Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 275 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 275
Topic #: 1

[All Professional Data Engineer Questions]

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

A. Import the ORC files to Bigtable tables for the data scientist team.
B. Import the ORC files to BigQuery tables for the data scientist team.
C. Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.
D. Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.

Show Suggested Answer

Suggested Answer: D 🗳️

by Smakyel79 at Jan. 7, 2024, 5:17 p.m.

Comments

Submit Cancel

raaad

Highly Voted 1 year, 6 months ago

Selected Answer: D

- It leverages the strengths of BigQuery for SQL-based exploration while avoiding additional costs and complexity associated with data transformation or migration. - The data remains in ORC format in Cloud Storage, and BigQuery's external tables feature allows direct querying of this data.

upvoted 8 times

nadavw

10 months, 3 weeks ago

There is a requirement to use a 'hive query engine'', and BQ is using only the hive metastore and his own engine, so 'D' seems a better fit here.

upvoted 1 times

...

kaisarfarel

Highly Voted 1 year, 4 months ago

I think C is the correct answer, DS want to explore the data in a "similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine". Dataproc can help to create clusters quickly with the Hadoop cluster. CMIIW

upvoted 7 times

apoio.certificacoes.closer

6 months, 3 weeks ago

I think "Similar" is doing a lot of heavy lift on the confusion. If it was equal, I'd say C. Since it similar, it can be GoogleSQL (Bigquery).

upvoted 2 times

...

56d02cd

Most Recent 2 weeks, 1 day ago

Selected Answer: C

It says that scientists need to "explore the data with SQL on the Hive query engine". That excludes BigQuery.

upvoted 1 times

...

Pime13

6 months, 1 week ago

Selected Answer: D

D. Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team. This approach allows you to leverage the scalability and cost-effectiveness of Cloud Storage while enabling your data scientists to query the data using BigQuery's powerful SQL engine without the need to move or transform the data. This setup also minimizes the need for additional infrastructure and maintenance, making it a practical choice for your analytics environment.

upvoted 1 times

...

SamuelTsch

8 months, 2 weeks ago

Selected Answer: B

using external tables have always limitations - affecting performance, no preview of the data and no cost estimation. So, why option D is correct?

upvoted 1 times

...

hanoverquay

1 year, 4 months ago

Selected Answer: D

option d

upvoted 1 times

...

0725f1f

1 year, 4 months ago

Selected Answer: C

it is talking about partition as well

upvoted 3 times

...

JyoGCP

1 year, 4 months ago

Selected Answer: D

Option D

upvoted 1 times

...

Matt_108

1 year, 6 months ago

Selected Answer: D

Option D - leverages BigQuery for SQL-based exploration on direct querying to cloud storage

upvoted 2 times

...

Smakyel79

1 year, 6 months ago

Selected Answer: D

This approach leverages BigQuery's powerful analytics capabilities without the overhead of data transformation or maintaining a separate cluster, while also allowing your team to use SQL for data exploration, similar to their experience with the on-premises Hadoop/Hive environment.

upvoted 3 times

...