Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 3 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 3
Topic #: 1
[All Professional Data Engineer Questions]

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

  • A. Add capacity (memory and disk space) to the database server by the order of 200.
  • B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
  • C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
  • D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
MaxNRG
Highly Voted 3 years ago
C is correct because this option provides the least amount of inconvenience over using pre-specified date ranges or one table per clinic while also increasing performance due to avoiding self-joins. A is not correct because adding additional compute resources is not a recommended way to resolve database schema problems. B is not correct because this will reduce the functionality of the database and make running reports more difficult. D is not correct because this will likely increase the number of tables so much that it will be more difficult to generate reports vs. the correct option. https://cloud.google.com/bigquery/docs/best-practices-performance-patterns https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#explicit-alias-visibility
upvoted 6 times
...
balseron99
Highly Voted 3 years, 9 months ago
A is incorrect because adding space won't solve the problem of query performance. B is incorrect because there is nothing related to the report generation which is specified and sharding tables on date ranges is not a good option as it will create many tables. C is CORRECT because the statement says "the scope of the project has expanded. The database must now store 100 times more patient records". As the data increases there would be difficulty in managing the tables and querying it. Hence creating different table is correct as per the need. D is Incorrect as it Partitions on each clinic. We have to adjust the database design so that it performs optimally when generating reports. Also nothing is specified for generation of reports in the required statement.
upvoted 5 times
...
SamuelTsch
Most Recent 1 month ago
Selected Answer: C
C is the most suitable solution for this situation. It provides a better way for scalability and monitoring. B has a constraint on predefined date range, which is usually not suitable for reporting.
upvoted 1 times
...
rocky48
1 year ago
Selected Answer: C
Normalization is a technique used to organize data in a relational database to reduce data redundancy and improve data integrity. Breaking the patient records into separate tables (patient and visits) and eliminating self-joins will make the database more scalable and improve query performance. It also helps maintain data integrity and makes it easier to manage large datasets efficiently. Options A, B, and D may provide some benefits in specific cases, but for a scenario where the project scope has expanded significantly and there are performance issues with self-joins, normalization (Option C) is the most robust and scalable solution.
upvoted 3 times
...
rtcpost
1 year, 1 month ago
Selected Answer: C
Normalization is a technique used to organize data in a relational database to reduce data redundancy and improve data integrity. Breaking the patient records into separate tables (patient and visits) and eliminating self-joins will make the database more scalable and improve query performance. It also helps maintain data integrity and makes it easier to manage large datasets efficiently. Options A, B, and D may provide some benefits in specific cases, but for a scenario where the project scope has expanded significantly and there are performance issues with self-joins, normalization (Option C) is the most robust and scalable solution.
upvoted 3 times
...
vaga1
1 year, 6 months ago
Selected Answer: C
"100 times more patient records"immediately brings to create a patient dimensional table to save space on disk if a generical relational database is mentioned.
upvoted 1 times
...
maurilio_cardoso_multiedro
1 year, 8 months ago
C - https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
upvoted 1 times
...
bha11111
1 year, 8 months ago
Selected Answer: C
C- This is correct have verified from different sources
upvoted 1 times
...
Morock
1 year, 9 months ago
Selected Answer: C
Should be C. Basic ER design...
upvoted 1 times
...
GCPpro
1 year, 10 months ago
c - is the correct one.
upvoted 1 times
...
testoneAZ
1 year, 10 months ago
C should be the correct answer
upvoted 1 times
...
Brillianttyagi
1 year, 11 months ago
Selected Answer: C
C- Is the correct answer!
upvoted 1 times
...
Arkon88
2 years, 8 months ago
Selected Answer: C
C - based on Google documentation, self-join is an anti-pattern: https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
upvoted 2 times
...
ch1nczyk
2 years, 9 months ago
Selected Answer: C
Correct
upvoted 1 times
...
samdhimal
2 years, 10 months ago
correct answer -> Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join. Avoid self-join at all cost because that's what google says. Reference: https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
upvoted 3 times
samdhimal
1 year, 10 months ago
Normalizing the database design will help to minimize data redundancy and improve the efficiency of the queries. By separating the patient and visit information into separate tables, the database will be able to handle the increased number of records and generate reports more efficiently, because the self-joins will no longer be required. Option A is not a good solution because adding more capacity to the server will not address the underlying problem of the database design, and it may not be sufficient to handle the increased data volume. Option B is not a good solution because it limits the flexibility of the queries and reports, and it may not be sufficient to handle the increased data volume. Option D is not a good solution because partitioning the table into smaller tables may lead to data redundancy and it may not be sufficient to handle the increased data volume.
upvoted 3 times
...
...
anji007
3 years, 1 month ago
Ans: C
upvoted 2 times
...
sumanshu
3 years, 4 months ago
Vote for C. I read this question in 'google 'sample question' of PDE.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...