Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 46 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 46
Topic #: 1

[All Professional Data Engineer Questions]

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible.
What should you do?

A. Load the data every 30 minutes into a new partitioned table in BigQuery.
B. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
C. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
D. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Show Suggested Answer

Suggested Answer: B 🗳️

by mmarulli at March 11, 2020, 2:31 p.m.

Comments

Submit Cancel

mmarulli

Highly Voted 5 years, 4 months ago

this is one of the sample exam questions that google has on their website. The correct answer is B

upvoted 43 times

nadavw

11 months ago

B - since it seems that not all data is in BigQuery but the analysis is done using BigQuery so federated query is the optimal approach

upvoted 2 times

...

[Removed]

Highly Voted 5 years, 3 months ago

Answer: B Description: B is correct because regional storage is cheaper than BigQuery storage.

upvoted 13 times

funtoosh

4 years, 4 months ago

it's not only cheaper but the requirement is that the data keep updating every 30 min and you need to combine the data in bigquery, use external tables to do that is the recommended practice

upvoted 9 times

...

jatinbhatia2055

Most Recent 7 months ago

Selected Answer: A

BigQuery is a powerful data warehouse designed for analyzing large datasets efficiently. Partitioning tables allows you to manage large datasets by splitting them into segments based on a key, such as time. By creating a partitioned table and updating it every 30 minutes, you can load the new price data directly into the correct partitions. BigQuery’s partitioned tables optimize both the storage and querying cost because BigQuery only scans the relevant partitions when querying, minimizing the amount of data read and hence reducing costs. Partitioning by time (e.g., timestamp or date columns) is particularly effective for datasets with periodic updates (like price data) since each batch of data will be loaded into the corresponding partition.

upvoted 2 times

...

SamuelTsch

8 months, 3 weeks ago

Selected Answer: B

Actually, in this question, I think B is the most suitable. C, D are somehow overkill. A due to the minimum partition granularity. However, with B, the data could not be previewd also it is not possible to estimate the cost.

upvoted 2 times

...

Pennepal

1 year, 2 months ago

D. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage. Here's why this approach is ideal: Cost-Effective Storage: Cloud Storage offers regional storage classes that are cost-effective for frequently accessed data. Storing the price data in a regional Cloud Storage bucket keeps it readily available. Cloud Dataflow for Updates: Cloud Dataflow is a managed service for building data pipelines. You can create a Dataflow job that runs every 30 minutes to: Download the latest economic data file from Cloud Storage. Process and potentially transform the data as needed. Load the updated data into BigQuery. BigQuery Integration: BigQuery seamlessly integrates with Cloud Dataflow. The Dataflow job can directly load the processed data into a BigQuery table for further analysis with your customer data.

upvoted 2 times

...

TVH_Data_Engineer

1 year, 7 months ago

Selected Answer: A

BigQuery supports partitioned tables, which allow for efficient querying and management of large datasets that are updated frequently. By loading the updated data into a new partition every 30 minutes, you can ensure that only relevant partitions are queried, reducing the amount of data processed and thereby minimizing costs. What's wrong with B ? While creating a federated data source in BigQuery pointing to a Google Cloud Storage bucket is feasible, it might not be the most efficient for data that is updated every 30 minutes. Querying federated data sources can sometimes be more expensive and less performant than querying data stored directly in BigQuery.

upvoted 3 times

...

Melampos

2 years, 2 months ago

Selected Answer: D

Federated queries let you send a query statement to Cloud Spanner or Cloud SQL databases not to cloud storage

upvoted 1 times

sid_is_dis

2 years ago

Is you are right about "federated queries", but the option B says about "federated data source". These are different concepts

upvoted 3 times

...

Abhilash_pendyala

2 years, 2 months ago

ChatGPT says partitioned tables is the best approach, The answers here are quite contrasting with that answer, Even i thought it has to be option A, I am so confused now? Any proper straight forward answer ?

upvoted 1 times

...

musumusu

2 years, 4 months ago

Answer B: Uploading data into staging tables/ external tables or federated source in BQ is the best approach. Option A is also good approach, anyone can explain about his part what is wrong about this?

upvoted 1 times

yoga9993

2 years, 4 months ago

we can't implement A, it's because biquery partition table can only be done minimun in range 1 hour, the requirement said it must be update every 30 minutes, so A is imposible option as the minimum partition is in hour level

upvoted 7 times

...

AzureDP900

2 years, 6 months ago

B is right

upvoted 1 times

...

Krish6488

2 years, 6 months ago

Selected Answer: B

Discounting A due to limitations on partitions Discounting C because datastore does not fit into the nature of data we are talking about and federation between BQ and datastore it an overkill Between B and D, updating the price file on GCS and joining BQ tables and external tables sourcing data from GCS is most cost optimal way for this use case

upvoted 2 times

ler_mp

2 years, 6 months ago

D is also overkill for this use case, so I'd pick B

upvoted 1 times

...

jkhong

2 years, 6 months ago

Selected Answer: B

Consideration: As cheaply as possible. Make sure data stays up to date. Initially chose A. But in actuality there is no need to maintain or store past data so storage of past data and partitioning doesn't seem like a key requirement. Instead we can connect just to a single Cloud Storage file, either by: i. replace previous prices with latest prices ii. store previous prices in GCS if required to be retained

upvoted 1 times

...

DGames

2 years, 7 months ago

Selected Answer: B

B is most inexpensive approach.

upvoted 1 times

...

odacir

2 years, 7 months ago

Selected Answer: B

The technical requirement is having frequently access info to join with other BQ data, as cheap as possible. B fits perfectly. Corner cases for external data sources: • Avoiding duplicate data in BigQuery storage • Queries that do not have strong performance requirements • Small amount of frequently changing data to join with other tables in BigQuery https://cloud.google.com/blog/products/gcp/accessing-external-federated-data-sources-with-bigquerys-data-access-layer

upvoted 2 times

...

assU2

2 years, 7 months ago

Selected Answer: D

I would say D, regional Google Cloud Storage bucket - cheap. A - not cheap B - NoSQL database for your web and mobile applications C - Federated queries let you send a query statement to Cloud Spanner or Cloud SQL databases And we need to combine data in DQ with data from bucket

upvoted 1 times

...