Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 84 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 84
Topic #: 1
[All Professional Data Engineer Questions]

After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You've loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?

  • A. Select random samples from the tables using the RAND() function and compare the samples.
  • B. Select random samples from the tables using the HASH() function and compare the samples.
  • C. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
  • D. Create stratified random samples using the OVER() function and compare equivalent samples from each table.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
rickywck
Highly Voted 4 years, 8 months ago
C is the only way which all records will be compared.
upvoted 33 times
odacir
1 year, 11 months ago
Agree with your argument
upvoted 2 times
...
...
[Removed]
Highly Voted 4 years, 8 months ago
Answer: C Description: Full comparison with this option, rest are comparison on sample which doesnot ensure all the data will be ok
upvoted 16 times
...
SamuelTsch
Most Recent 1 month ago
Selected Answer: C
Hash is always a good idea to compare the data
upvoted 1 times
...
midgoo
1 year, 8 months ago
In practice, I will do B. That means it may have error due to randomness. But that is how we normally do validation/QA in general, i.e. we test random samples In this question, I will do C.
upvoted 3 times
...
musumusu
1 year, 9 months ago
key words here- Hash or collect value on "EACH table", after sorting the table. Option C
upvoted 1 times
...
samdhimal
1 year, 9 months ago
C. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table. This approach will ensure that the data is read in a consistent order, and the hash function will provide a quick and efficient way to compare the contents of the tables and ensure that they are identical.
upvoted 1 times
samdhimal
1 year, 9 months ago
A. Selecting random samples from the tables using the RAND() function may not provide an accurate representation of the data and there is a risk that the comparison will not identify any differences between the tables. B. Selecting random samples from the tables using the HASH() function may not be an effective method for comparison, as the HASH() function may return different results for equivalent data. D. Creating stratified random samples using the OVER() function may not provide a comprehensive comparison between the tables as there is a risk that important differences could be missed in the sample data.
upvoted 2 times
...
...
zellck
1 year, 11 months ago
Selected Answer: C
C is the answer.
upvoted 2 times
odacir
1 year, 11 months ago
All records need to be checked to be sure, so C is the answer
upvoted 1 times
...
...
Leeeeee
1 year, 12 months ago
Selected Answer: C
All records
upvoted 1 times
...
hfuihe
2 years, 1 month ago
Selected Answer: B
B is the only way which all records will be compared.
upvoted 1 times
cloudmon
2 years ago
You must have meant to say C
upvoted 2 times
...
...
medeis_jar
2 years, 10 months ago
Selected Answer: C
HASH() to compare data skipping dates and timestamps
upvoted 1 times
stefanop
2 years, 6 months ago
The hash in answer C is used to select a sample of the table, not to compare them
upvoted 1 times
stefanop
2 years, 6 months ago
Ignore my comment, it was about answer B. I suggest you to go with answer C which is the only solution comparing all the rows/tables
upvoted 1 times
...
...
...
MaxNRG
2 years, 11 months ago
Selected Answer: C
options A B and D only will determine that it “might” be identical since is only a sample. HASH() can be helpful when doing bulk comparisons, but you still have to compare field by field to get the final answer. The only one left is C which looks good to me
upvoted 2 times
...
JayZeeLee
3 years ago
C. The rest use RAND() at some point, which makes it hard to compare for consistency, unless there's a 'seed' option, which wasn't mentioned. So C.
upvoted 1 times
...
u_t_s
3 years, 1 month ago
Since there is no PK and it is possible that set of values is commons in some records which result in same hashkey for those records. But still Anwer is C
upvoted 3 times
...
sumanshu
3 years, 4 months ago
Vote for 'C"
upvoted 1 times
...
daghayeghi
3 years, 9 months ago
B: Because said migrated to BigQuery, then we don't need Dataproc, and samples don't mean you don't compare all of data.
upvoted 3 times
yoshik
3 years, 2 months ago
a sample is a subset of data. then you should assure that the union of the samples contain the data set. Excessively complicated. You migrate to BigQuery but need to check BigQuery output, that is why you should use another tool, Dataproc in this case. Agree that then you should control Dataproc output but suppositions are becoming too many.
upvoted 1 times
...
...
atnafu2020
4 years, 3 months ago
C Using Cloud Storage with big data Cloud Storage is a key part of storing and working with Big Data on Google Cloud. Examples include: Loading data into BigQuery. Using Dataproc, which automatically installs the HDFS-compatible Cloud Storage connector, enabling the use of Cloud Storage buckets in parallel with HDFS. Using a bucket to hold staging files and temporary data for Dataflow pipelines. For Dataflow, a Cloud Storage bucket is required. For BigQuery and Dataproc, using a Cloud Storage bucket is optional but recommended. gsutil is a command-line tool that enables you to work with Cloud Storage buckets and objects easily and robustly, in particular in big data scenarios. For example, with gsutil you can copy many files in parallel with a single command, copy large files efficiently, calculate checksums on your data, and measure performance from your local computer to Cloud Storage.
upvoted 3 times
...
haroldbenites
4 years, 3 months ago
C is correct
upvoted 4 times
haroldbenites
4 years, 3 months ago
It Says: "...that they are identical." , You must not use sample.
upvoted 3 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...