Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 43 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 43
Topic #: 1
[All Professional Data Engineer Questions]

You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee. How can you make that data available while minimizing cost?

  • A. Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.
  • B. Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.
  • C. Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName, LastName, and FullName into a new table in BigQuery.
  • D. Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
[Removed]
Highly Voted 4 years, 8 months ago
Answer will be A because when you create View it does not store extra space and its a logical representation, for rest of the option you need to write large code and extra processing for dataflow/dataproc
upvoted 67 times
[Removed]
4 years, 8 months ago
Because views are not materialized, the query that defines the view is run each time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query
upvoted 11 times
lgdantas
4 years, 3 months ago
Wouldn´t "total amount of data in all table fields referenced directly or indirectly by the top-level query" be FirstName and LastName?
upvoted 3 times
lollo1234
3 years, 6 months ago
You're right, BigQuery bills on number of bytes processed, regardless of them being materialized. If you don't create a new column and use a view instead, you will probably have a small performance hit but query costs would be the same and storage cost wouldn't increase (unlike storing a new column)
upvoted 4 times
yoshik
3 years, 2 months ago
You are asked to modify the schema and data. By using a view, the underlined table remains intact.
upvoted 11 times
HarshKothari21
2 years, 2 months ago
good catch, yoshik.
upvoted 1 times
...
...
...
...
alecuba16
2 years, 6 months ago
Views are cached the same as regular tables are, so I don't get the point of billing. It will cost the same as query to a regular table.
upvoted 3 times
ovokpus
1 year, 12 months ago
the point of billing is extra storage costs for a new concatenated column
upvoted 3 times
...
...
...
[Removed]
4 years, 8 months ago
Can't be A
upvoted 5 times
...
beowulf_kat
2 years ago
I agree that A is correct. Also, I think B is wrong as the UPDATE statement is used to update values in existing columns, not to create a new column.
upvoted 2 times
ovokpus
2 years ago
Of course, you use UPDATE after creating the new column, that is what the option said
upvoted 2 times
[Removed]
1 year, 9 months ago
What happen if there are new employees joining the company, update every single time?
upvoted 1 times
...
...
...
funtoosh
3 years, 9 months ago
cannot be 'A'as it clearly says that you need to change the schema and data.
upvoted 17 times
exnaniantwort
2 years, 10 months ago
Your primary task is to "make data available". Changing the schema is just the request from the member "A member of IT is building an application and ***asks you to modify the schema and data*** in BigQuery". You don't have to follow it if it does not make sense.
upvoted 15 times
YorelNation
2 years, 2 months ago
A yes, That make a lot of sense and also if you update the table only once with UPDATE if there is a new employee it will not be up to date with the new column, if the app use a view it will be up to date every time it query. But in any case the cost will not be minimized.
upvoted 4 times
...
exnaniantwort
2 years, 10 months ago
There is always different application requirement to use different format. That way you will just creating more and more redundant columns in different formats. That is tedious.
upvoted 6 times
...
...
...
...
BhupiSG
Highly Voted 3 years, 8 months ago
Correct: B BigQuery has no quota on the DML statements. (Search Google - does bigquery have quota for update). Why not C: This is a one time activity and SQL is the easiest way to program it. DataFlow is way overkill for this. You will need to find an engineer who can develop DataFlow pipelines. Whereas, SQL is so much more widely known and easier. One of the great features about BigQuery is its SQL interface. Even for BigQueryML services.
upvoted 47 times
lollo1234
3 years, 6 months ago
DML statements don't increase costs, but storing a new column does. I see A is correct (also see my comment above)
upvoted 5 times
exnaniantwort
2 years, 10 months ago
Exactly. Cost is the reason to reject B. How come so many people vote for this wrong option?
upvoted 2 times
ler_mp
1 year, 10 months ago
Storage is cheap compared to computation
upvoted 7 times
...
...
...
DGames
1 year, 11 months ago
But you need to maintain table means regularly you have to execute the update query whenever new data comes.
upvoted 2 times
...
lollo1234
3 years, 6 months ago
I will also add that B would imply changing upstream workloads to write the new field every time a records gets added
upvoted 8 times
...
...
iooj
Most Recent 3 months, 4 weeks ago
E. Say to the IT specialist to take care of it on the app side... B would work for historical data if we had an underlying change made to automate the concatenation for new records. It is not clear, so I would say A is a quick solution.
upvoted 1 times
...
Ramanaiah
6 months, 4 weeks ago
Selected Answer: B
Requirement is to be able to filter on full name. So, you would be querying all data unless you have materialized full name column.
upvoted 1 times
...
philli1011
10 months ago
Definitely A
upvoted 1 times
...
axantroff
1 year ago
Selected Answer: A
The question might be outdated, but I would like to offer my perspective: 1. Ideally, I would opt for a materialized view to avoid updating pipelines 2. In 2023, I see no concerns regarding the costs involved in storing denormalized data for analytical needs 3. Regarding this question I would choose option A, although the concern about extra costs due to recalculations is valid for me
upvoted 2 times
LaxmanTiwari
11 months, 1 week ago
Did u pass the exam ?
upvoted 2 times
...
...
steghe
1 year ago
Answer should be A 'cos the First request is: make that data available.
upvoted 1 times
...
alihabib
1 year, 3 months ago
Its A ..... "asked to change schema" is a trick to test your skills. Better to make use of MV's if anyhow the application is gonna query repeatedly. MV's will rebuild itself, if query invalidates from cache results
upvoted 1 times
...
nescafe7
1 year, 3 months ago
Selected Answer: A
In the case of B, the data pipeline that adds new employee information must also be modified, which is not the correct answer in terms of cost minimization.
upvoted 2 times
...
Mathew106
1 year, 4 months ago
Selected Answer: A
It's A. If you add a column to the table, you will be billed every time you query that new column. The same way you would be billed with the view created by A. B,C and D create a new column. A does not create a new column. It just provides the interface for the application to access the data. B,C and D will have to be rerun to compute the column value of new customers. A is done only once, costs 0 for storage, and is charged about the same as all the others when it comes to compute because even if you choose B C and D you would have to query the data in the end anyway.
upvoted 1 times
...
autumn2005
1 year, 4 months ago
Selected Answer: C
modify the schema
upvoted 1 times
...
theseawillclaim
1 year, 4 months ago
Can you code a script for a BQ Column? I don't think it's "B", it is pretty tricky
upvoted 1 times
...
KC_go_reply
1 year, 5 months ago
Selected Answer: A
Everything but A) new view is wrong. B) sounds okay, but introduces a new column which means more storage, thus increasing cost. C) Dataflow is obvious overkill for a simple task such as concatenating two strings. D) Starting up a Dataproc cluster just for string concatenation is super overkill.
upvoted 1 times
...
vaga1
1 year, 6 months ago
Selected Answer: A
if a new field is only necessary for one project, and it is only the concatenation of two existing fields, it is ok to create a view that gets used for a specific task.
upvoted 1 times
...
Jarek7
1 year, 7 months ago
Selected Answer: A
I'd go for A. The main issue with answers B,C,D is that they are just temporary solution. Whenever a new employee comes in (there are 400.000 of them at the moment, so we can expect every day a few new guys) we need to update the fullname table/field again. Additionally each of these answers need twice as much capacity (BigQuery stores data in a columnar format, so optimizing is not possible). Although the price for the needed capacity will be far below 0.01$/month. The main argument against A is that compute power costs more than capacity. Please look how BQ is priced: https://cloud.google.com/bigquery/pricing#query_pricing In the default On-demand compute pricing it is charged for "the number of bytes processed by each query" so there will be no any difference in computing costs for any option. Yeah, there is also this argument about modyfing schema in the requirements. Lets be professional - it is not a requirement for OUR schema. If you can resolve the issue with 0 change to YOUR schema then you are more than ok. And anyway, from requestor point of view, the schema HE uses in his app will be modifed as he needed.
upvoted 2 times
...
izekc
1 year, 7 months ago
Selected Answer: A
Should be A.
upvoted 1 times
...
Adswerve
1 year, 7 months ago
Selected Answer: B
I vote B. A is expensive, the requirements say we need to minimize cost. B works and meets the requirements. We create an empty column and then UPDATE it to set it to a desired values. https://cloud.google.com/bigquery/docs/managing-table-schemas#console https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#update_statement
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...