Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 15 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 15
Topic #: 1
[All Professional Data Engineer Questions]

You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

  • A. Re-write the application to load accumulated data every 2 minutes.
  • B. Convert the streaming insert code to batch load for individual messages.
  • C. Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
  • D. Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
MaxNRG
Highly Voted 3 years ago
B. Streams data into BigQuery one record at a time without needing to run a load job: https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll Instead of using a job to load data into BigQuery, you can choose to stream your data into BigQuery one record at a time by using the tabledata.insertAll method. This approach enables querying data without the delay of running a load job: https://cloud.google.com/bigquery/streaming-data-into-bigquery The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. It combines the functionality of streaming ingestion and batch loading into a single high-performance API. You can use the Storage Write API to stream records into BigQuery that become available for query as they are written, or to batch process an arbitrarily large number of records and commit them in a single atomic operation. Committed mode. Records are available for reading immediately as you write them to the stream. Use this mode for streaming workloads that need minimal read latency. https://cloud.google.com/bigquery/docs/write-api
upvoted 7 times
Abhi16820
3 years ago
IN THIS ALSO BIGQUERY HAS A BUFFER WHICH IT TAKES SLOWLY ANS INSERTS INTO REAL THING, WHAT YOU SAID IS HELPFULL IN REMOVING THE APPLICATION PART
upvoted 1 times
MarcoDipa
2 years, 11 months ago
could you please argue?
upvoted 1 times
...
...
...
noob_master
Highly Voted 2 months ago
Selected Answer: D
Answer: D. The only that describe a way to resolve the problem, with buffering the data. (the question is possible old, the best approach would be Pub/Sub + Dataflow Streaming + Bigquery for streaming data instead near-real time)
upvoted 6 times
...
GHill1982
Most Recent 1 month ago
Selected Answer: A
For maintaining data consistency while handling high throughput streaming inserts and subsequent aggregations in Google BigQuery, the best approach is to re-write the application to load accumulated data every 2 minutes.
upvoted 1 times
...
fire558787
2 months ago
"D" seems to use the typical approximate terminology of a wrong answer. "estimate the time" (how do you do that? do you do that over different times of the day?) and "wait twice as long" (who tells you that there are not a lot of cases when lag is twice as long?). Instead, "A" seems good. You don't need to show the exact results, but an approximation thereof, but you still want consistency. So an aggregation of the data every 2 minutes is a viable thing.
upvoted 5 times
...
Parth_P
2 months ago
Selected Answer: D
D is correct. The problem requirement is doing analytics on real-time data. You cannot do batch processing because the business requires it to be real-time even if it makes your job simpler, so B is incorrect. Other options are not streaming.
upvoted 2 times
...
jkhong
2 months ago
Selected Answer: D
There are assumptions over the quality of data acceptable. If slight variations of the analytics against actual can be accepted, then D would be a good choice. Many people chose B, but this also requires some form of waiting for the late data to arrive. I think a combination of D and B can be applied, but for an intial fix, delaying the aggregation queries with D seems to make more sense. If the variance is small and the some late data leakage is acceptable, and we can remain as D. If problems arise, we can always proceed to attempt B
upvoted 2 times
...
korntewin
2 months ago
Selected Answer: D
The streaming mode may be in pending mode or buffered mode where the streaming data is not immediately available before committing or flushing. Thus, we need to wait before the data will be available. Or else we need to switch to commited mode (which is not present in the choices).
upvoted 2 times
...
musumusu
2 months ago
Answer: D What to learn or look for 1. In-Flight data = (Real Time data, i.e still in streaming pipeline and not landed in BigQuery) 2. Dataflow (assume in best case) streaming pipeline is running to send data to Bigquery. Why not option B: change streaming to batch upload is not business requirement, we have to stuck to streaming and real time analysis. Option D: make bigquery run after waiting for sometime (twice here), How will you do it? - there is not setting in BQ to do it, right!. So, adjust it in your pipeline (dataflow) - For example, add Fixed window, and you want to execute aggregation query after 2 min. Code ```pipeline.apply(...) .apply(Window.<TableRow>into(FixedWindows.of(Duration.standardMinutes(2)))) .apply(BigQueryIO.writeTableRows() .to("my_dataset.my_table") ```
upvoted 5 times
...
philli1011
10 months ago
Answer: D I agree with the first part of the D answer, but for the second part, I don't know how they came about the 2 mins, is it from a calculation?
upvoted 1 times
...
imran79
1 year, 1 month ago
A. Re-write the application to load accumulated data every 2 minutes. By accumulating data and performing a batch load every 2 minutes, you can reduce the potential inconsistency caused by streaming inserts. While this introduces a slight delay, it provides a more consistent approach than streaming each individual message. This method can still meet the near real-time requirement, and the slight delay is often acceptable in scenarios where data consistency is paramount.
upvoted 3 times
...
Nirca
1 year, 1 month ago
Selected Answer: B
BBBBB is the only option
upvoted 1 times
...
ckanaar
1 year, 2 months ago
I'd argue that this question became outdated with the introduction of the BigQuery Storage Write API: https://cloud.google.com/bigquery/docs/write-api
upvoted 4 times
axantroff
1 year ago
Good point
upvoted 1 times
...
...
klughund
1 year, 3 months ago
Streaming inserts in BigQuery are not immediately available to be queried, which is causing the weak consistency you're observing. A better approach is to batch the data and load it at regular intervals. Loading the data every two minutes is still relatively real-time, and it should help solve the consistency problem. Answer A.
upvoted 3 times
...
NeoNitin
1 year, 3 months ago
All the options aim to address the challenge of strong consistency in the data and potential missing data that may occur with streaming inserts. Each approach has its pros and cons, so the best choice depends on the specific needs and requirements of the application. It's like having different strategies for keeping track of all the fun things the kids do and say on the playground, making sure nothing gets left behind!
upvoted 1 times
...
WillemHendr
1 year, 5 months ago
Streaming Inserts is marked as Legacy now. https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery#dataavailability The documentation is hinting on it can take up to 90 minutes to process the buffered data. This question is testing if you are aware of the possible long times the buffer can build up.
upvoted 3 times
...
izekc
1 year, 7 months ago
Selected Answer: B
In my experience, estimation in D is not a technical solution. it is just a guess solution. You might still get caught when loading get higher and easily take twice as long latency, then problem occur again. So for a more permanent solution, you should definitely go with B
upvoted 3 times
...
bha11111
1 year, 8 months ago
Selected Answer: D
1s tline of question requires near real time queries so D is the best option as batch load is never near real time
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...