Exam Professional Data Engineer topic 1 question 263 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 263
Topic #: 1

[All Professional Data Engineer Questions]

You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?

A. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.
B. Insert output sinks after each key processing step, and observe the writing throughput of each block.
C. Log debug information in each ParDo function, and analyze the logs at execution time.
D. Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks.

Show Suggested Answer

Suggested Answer: A 🗳️

by scaenruy at Jan. 3, 2024, 6:09 p.m.

Comments

Submit Cancel

raaad

Highly Voted 1 year, 6 months ago

Selected Answer: A

- The Reshuffle operation is used in Dataflow pipelines to break fusion and redistribute elements, which can sometimes help improve parallelization and identify bottlenecks. - By inserting Reshuffle after each processing step and observing the pipeline's performance in the Dataflow console, you can potentially identify stages that are disproportionately slow or stalled. - This can help in pinpointing the step where the bottleneck might be occurring.

upvoted 9 times

1 year, 5 months ago

It should be C

upvoted 2 times

...

tibuenoc

1 year, 5 months ago

Selected Answer: B

The best option is B Because create additional output to capturing and processing error data, will get error each step that allows you to observe the writing throughput of each block, which can help identify specific processing steps causing bottlenecks. Option A also is valid but can not directly address all bottlenecks, especially if the graph was merged.

upvoted 1 times

...

Sofiia98

1 year, 6 months ago

Selected Answer: A

From the Dataflow documentation: "There are a few cases in your pipeline where you may want to prevent the Dataflow service from performing fusion optimizations. These are cases in which the Dataflow service might incorrectly guess the optimal way to fuse operations in the pipeline, which could limit the Dataflow service's ability to make use of all available workers. You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the data, and performs deduplication of records. Reshuffle is supported by Dataflow even though it is marked deprecated in the Apache Beam documentation."

upvoted 4 times

...

scaenruy

1 year, 6 months ago

Selected Answer: A

A. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.

upvoted 2 times

...