exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 4 question 32 discussion

Actual exam question from Microsoft's DP-203
Question #: 32
Topic #: 4
[All DP-203 Questions]

HOTSPOT -
You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DateKey, ProductKey, RegionKey.
There are 120 unique product keys and 65 unique region keys.

Queries that use the data warehouse take a long time to complete.
You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew.
What should you recommend? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: Hash-distributed -

Box 2: ProductKey -
ProductKey is used extensively in joins.
Hash-distributed tables improve query performance on large fact tables.

Box 3: Hash-distributed -

Box 4: RegionKey -
Round-robin tables are useful for improving loading speed.
Consider using the round-robin distribution for your table in the following scenarios:
✑ When getting started as a simple starting point since it is the default
✑ If there is no obvious joining key
✑ If there is not good candidate column for hash distributing the table
✑ If the table does not share a common join key with other tables
✑ If the join is less significant than other joins in the query
✑ When the table is a temporary staging table
Note: A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
lara_mia1
Highly Voted 3 years, 10 months ago
1. Hash Distributed, ProductKey because >2GB and ProductKey is extensively used in joins 2. Hash Distributed, RegionKey because "The table size on disk is more than 2 GB." and you have to chose a distribution column which: "Is not used in WHERE clauses. This could narrow the query to not run on all the distributions." source: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choosing-a-distribution-column
upvoted 107 times
Marcello83
3 years, 9 months ago
I agree with lara_mia1
upvoted 3 times
...
niceguy0371
3 years, 8 months ago
Disagree on nr. 1 because of the reason you give for nr. 2. (choose a distribution column that is not used in where clauses. A join is also a where clause
upvoted 5 times
sdokmak
2 years, 10 months ago
nah mate, check out his link: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.
upvoted 4 times
...
...
vblessings
3 years, 8 months ago
i agree
upvoted 3 times
...
...
Rob77
Highly Voted 3 years, 11 months ago
Both hash as both are > 2GB. In the 2nd table RegionKey cannot be used with round_robin distribution as round_robin does not take a distribution key...
upvoted 29 times
ploer
3 years, 2 months ago
Correct: "A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution." https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 1 times
...
...
EvanG
Most Recent 6 months, 3 weeks ago
My 2 cents. if you are choosing the productkey for the sales table and this table is used extensively in joins as stated, and since the regionkey is skewed, we should choose the productkey as a distribution column for the table invoice as well.
upvoted 2 times
...
evangelist
9 months ago
Based on the votes, the best agreement for the configuration is: Sales Table: Hash Distributed, ProductKey (134 votes) Invoices Table: Hash Distributed, RegionKey (147 votes)
upvoted 1 times
...
Dusica
11 months, 3 weeks ago
Hash - Product Key x2
upvoted 1 times
...
kkk5566
1 year, 7 months ago
1. Hash Distributed, ProductKey 2. Hash Distributed, RegionKey
upvoted 2 times
...
[Removed]
1 year, 8 months ago
When two large fact tables have frequent joins - in this case one is large and another is a small dimension table. Hence highlighted answer is correct
upvoted 1 times
Dusica
11 months, 3 weeks ago
both are fact tables
upvoted 1 times
...
...
dom271219
2 years, 7 months ago
"Choose a distribution column with data that distributes evenly" ProductKey is more relevant in both cases
upvoted 4 times
...
Deeksha1234
2 years, 8 months ago
1. Hash Distributed, ProductKey because table size >2GB and ProductKey is extensively used in joins . another, region key could have been considered (after join key which is product key) since its being used in grouping but 75% records belongs to one region so - NO for region key. 2. Hash Distributed, RegionKey because the table size on disk is more than 2 GB and Its being used in grouping (for this table more than 75% record doesn't fall in same region) and you have to chose a distribution column which is not used in WHERE clause.
upvoted 5 times
...
Nishikag
2 years, 9 months ago
To minimize data movement, select a distribution column that: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.
upvoted 2 times
...
Remedios79
2 years, 9 months ago
the provided aswers are correct
upvoted 1 times
...
kiranSargar
3 years, 1 month ago
Generally facts table are hash distributed. so both the table should use hash distribution and distribution key would be product_key for both.
upvoted 1 times
...
DarioEtna
3 years, 8 months ago
as for me i guess this is the right choice: 1. Hash Distributed, RegionKey because 2. Hash Distributed, RegionKey because "When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns" [Microsoft Documentation] If we use for one ProductKey and for one RegionKey maybe the data movements would increase...or not?
upvoted 3 times
Lucky_me
3 years, 3 months ago
If we choose RegionKey for Sales, we would have a processing skew.
upvoted 4 times
...
Aditya0891
2 years, 10 months ago
DarioEtna where in the question is it mentioned that both tables will be used together in a join query? They have different set of columns in where and group by, so why are you so sure that they will be used together? Answers provided are correct here
upvoted 1 times
...
DarioEtna
3 years, 8 months ago
But we cannot use ProductKey in both because in Invoice table it is used in WHERE condition
upvoted 4 times
...
...
Amalbenrebai
3 years, 8 months ago
Regarding the invoces table, we can use the Round-robin distribution because there is no obvious joining key in the table
upvoted 2 times
...
zarga
3 years, 9 months ago
1. Hash on product key 2. Hash on region key (used on group by and have 65 unique values)
upvoted 9 times
...
BrennaFrenna
3 years, 10 months ago
The sales table makes sense with hashing distribution on ProductKey and since there is no obvious joining key for invoices, you should use round robin distribution on RegionKey. When it would be a smaller table you should use replicated.
upvoted 3 times
...
tubis
3 years, 10 months ago
When it says 75% of records related to one of the 40 regions, if we partition the Sales by Region, isn't it improve the reading process drastically in compare to productKey?
upvoted 1 times
Preben
3 years, 10 months ago
That's 75 % of 61 % of the regions that will be done effectively. That's only efficient for 45 % of the queries. Not a whole lot.
upvoted 2 times
...
patricka95
3 years, 9 months ago
No, if 75% relate to one region and we hash on region, that means that those will all be on one node and there will be skew. Correct answers are Hash, Product, Hash, Region.
upvoted 3 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago