Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 4 question 32 discussion

Actual exam question from Microsoft's DP-203

Question #: 32
Topic #: 4

HOTSPOT -
You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DateKey, ProductKey, RegionKey.
There are 120 unique product keys and 65 unique region keys.

Queries that use the data warehouse take a long time to complete.
You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew.
What should you recommend? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: Hash-distributed -

Box 2: ProductKey -
ProductKey is used extensively in joins.
Hash-distributed tables improve query performance on large fact tables.

Box 3: Hash-distributed -

Box 4: RegionKey -
Round-robin tables are useful for improving loading speed.
Consider using the round-robin distribution for your table in the following scenarios:
✑ When getting started as a simple starting point since it is the default
✑ If there is no obvious joining key
✑ If there is not good candidate column for hash distributing the table
✑ If the table does not share a common join key with other tables
✑ If the join is less significant than other joins in the query
✑ When the table is a temporary staging table
Note: A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute

by Rob77 at May 17, 2021, 4:09 p.m.

Comments

Submit Cancel

lara_mia1

Highly Voted 3 years, 10 months ago

1. Hash Distributed, ProductKey because >2GB and ProductKey is extensively used in joins 2. Hash Distributed, RegionKey because "The table size on disk is more than 2 GB." and you have to chose a distribution column which: "Is not used in WHERE clauses. This could narrow the query to not run on all the distributions." source: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choosing-a-distribution-column

upvoted 107 times

Marcello83

3 years, 9 months ago

I agree with lara_mia1

upvoted 3 times

...

niceguy0371

3 years, 8 months ago

Disagree on nr. 1 because of the reason you give for nr. 2. (choose a distribution column that is not used in where clauses. A join is also a where clause

upvoted 5 times

sdokmak

2 years, 10 months ago

nah mate, check out his link: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.

upvoted 4 times

...

vblessings

3 years, 8 months ago

i agree

upvoted 3 times

...

Rob77

Highly Voted 3 years, 11 months ago

Both hash as both are > 2GB. In the 2nd table RegionKey cannot be used with round_robin distribution as round_robin does not take a distribution key...

upvoted 29 times

ploer

3 years, 2 months ago

Correct: "A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution." https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

upvoted 1 times

...

EvanG

Most Recent 6 months, 3 weeks ago

My 2 cents. if you are choosing the productkey for the sales table and this table is used extensively in joins as stated, and since the regionkey is skewed, we should choose the productkey as a distribution column for the table invoice as well.

upvoted 2 times

...

evangelist

9 months ago

Based on the votes, the best agreement for the configuration is: Sales Table: Hash Distributed, ProductKey (134 votes) Invoices Table: Hash Distributed, RegionKey (147 votes)

upvoted 1 times

...

Dusica

11 months, 3 weeks ago

Hash - Product Key x2

upvoted 1 times

...

kkk5566

1 year, 7 months ago

1. Hash Distributed, ProductKey 2. Hash Distributed, RegionKey

upvoted 2 times

...

[Removed]

1 year, 8 months ago

When two large fact tables have frequent joins - in this case one is large and another is a small dimension table. Hence highlighted answer is correct

upvoted 1 times

Dusica

11 months, 3 weeks ago

both are fact tables

upvoted 1 times

...

dom271219

2 years, 7 months ago

"Choose a distribution column with data that distributes evenly" ProductKey is more relevant in both cases

upvoted 4 times

...

Deeksha1234

2 years, 8 months ago

1. Hash Distributed, ProductKey because table size >2GB and ProductKey is extensively used in joins . another, region key could have been considered (after join key which is product key) since its being used in grouping but 75% records belongs to one region so - NO for region key. 2. Hash Distributed, RegionKey because the table size on disk is more than 2 GB and Its being used in grouping (for this table more than 75% record doesn't fall in same region) and you have to chose a distribution column which is not used in WHERE clause.

upvoted 5 times

...

Nishikag

2 years, 9 months ago

To minimize data movement, select a distribution column that: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.

upvoted 2 times

...

Remedios79

2 years, 9 months ago

the provided aswers are correct

upvoted 1 times

...

kiranSargar

3 years, 1 month ago

Generally facts table are hash distributed. so both the table should use hash distribution and distribution key would be product_key for both.

upvoted 1 times

...

DarioEtna

3 years, 8 months ago

as for me i guess this is the right choice: 1. Hash Distributed, RegionKey because 2. Hash Distributed, RegionKey because "When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns" [Microsoft Documentation] If we use for one ProductKey and for one RegionKey maybe the data movements would increase...or not?

upvoted 3 times

Lucky_me

3 years, 3 months ago

If we choose RegionKey for Sales, we would have a processing skew.

upvoted 4 times

...

Aditya0891

2 years, 10 months ago

DarioEtna where in the question is it mentioned that both tables will be used together in a join query? They have different set of columns in where and group by, so why are you so sure that they will be used together? Answers provided are correct here

upvoted 1 times

...

DarioEtna

3 years, 8 months ago

But we cannot use ProductKey in both because in Invoice table it is used in WHERE condition

upvoted 4 times

...

Amalbenrebai

3 years, 8 months ago

Regarding the invoces table, we can use the Round-robin distribution because there is no obvious joining key in the table

upvoted 2 times

...

zarga

3 years, 9 months ago

1. Hash on product key 2. Hash on region key (used on group by and have 65 unique values)

upvoted 9 times

...

BrennaFrenna

3 years, 10 months ago

The sales table makes sense with hashing distribution on ProductKey and since there is no obvious joining key for invoices, you should use round robin distribution on RegionKey. When it would be a smaller table you should use replicated.

upvoted 3 times

...

tubis

3 years, 10 months ago

When it says 75% of records related to one of the 40 regions, if we partition the Sales by Region, isn't it improve the reading process drastically in compare to productKey?

upvoted 1 times

Preben

3 years, 10 months ago

That's 75 % of 61 % of the regions that will be done effectively. That's only efficient for 45 % of the queries. Not a whole lot.

upvoted 2 times

...

patricka95

3 years, 9 months ago

No, if 75% relate to one region and we hash on region, that means that those will all be on one node and there will be skew. Correct answers are Hash, Product, Hash, Region.

upvoted 3 times

...

Load full discussion...

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 4 question 32 discussion

Comments

lara_mia1

Marcello83

niceguy0371

sdokmak

vblessings

Rob77

ploer

EvanG

evangelist

Dusica

kkk5566

[Removed]

Dusica

dom271219

Deeksha1234

Nishikag

Remedios79

kiranSargar

DarioEtna

Lucky_me

Aditya0891

DarioEtna

Amalbenrebai

zarga

BrennaFrenna

tubis

Preben

patricka95

SY0-701