exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 39 discussion

Actual exam question from Microsoft's DP-203
Question #: 39
Topic #: 1
[All DP-203 Questions]

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.

FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.

SELECT -
SupplierKey, StockItemKey, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -
GROUP By SupplierKey, StockItemKey
Which table distribution will minimize query times?

  • A. replicated
  • B. hash-distributed on PurchaseKey
  • C. round-robin
  • D. hash-distributed on DateKey
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
AugustineUba
Highly Voted 3 years, 6 months ago
From the documentation the answer is clear enough. B is the right answer. When choosing a distribution column, select a distribution column that: "Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."
upvoted 65 times
YipingRuan
3 years, 3 months ago
To minimize data movement, select a distribution column that: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. "PurchaseKey" is not used in the group by
upvoted 8 times
cem_kalender
2 years, 3 months ago
A distribution column should have high cardinality to ensure even distribution over nodes.
upvoted 3 times
...
...
YipingRuan
3 years, 3 months ago
Consider using the round-robin distribution for your table in the following scenarios: When getting started as a simple starting point since it is the default If there is no obvious joining key If there is no good candidate column for hash distributing the table If the table does not share a common join key with other tables If the join is less significant than other joins in the query
upvoted 7 times
...
...
waterbender19
Highly Voted 3 years, 6 months ago
I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a DATE datatype.
upvoted 21 times
waterbender19
3 years, 6 months ago
and thet statement that Fact table will be added 1 million rows daily means that each datekey value has an equal amount of rows associated with that value.
upvoted 5 times
Lucky_me
3 years, 1 month ago
But the DateKey is used in the WHERE clause.
upvoted 4 times
kamil_k
2 years, 11 months ago
I agree, date key is int, and besides, even if it was a date, when you query a couple days then 1 million rows per distribution is not that much. So what if you are going to use only a couple distributions to do the job? Isn't it still faster than using all distributions to process all of the records to get the required date range?
upvoted 1 times
...
...
...
AnandEMani
3 years, 5 months ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute this link says date filed , NOT a date Data type. B is correct
upvoted 7 times
...
kamil_k
2 years, 11 months ago
n.b. if we look at the example query itself the date range is 31 days so we will use 31 distributions out of 60, and only process ~31 million records
upvoted 2 times
...
...
dippip123
Most Recent 2 days, 6 hours ago
Selected Answer: B
In addition to not using a Date column, the azure documentation also says: To minimize data movement, select a distribution column or set of columns that: Is not used in WHERE clauses. When a query's WHERE clause and the table's distribution columns are on the same column, the query could encounter high data skew, leading to processing load falling on only few distributions. This impacts query performance, ideally many distributions share the processing load.
upvoted 1 times
...
samirarian
3 days, 5 hours ago
Selected Answer: D
d is correct
upvoted 1 times
...
JustImperius
3 weeks, 4 days ago
Selected Answer: B
Hi Community, it's B. We are not talking about partitions. We are talking about distribution and we know that its a poor choice to choose a low cardinality column like datekey for a distribution.
upvoted 3 times
...
nockda
1 month, 3 weeks ago
Selected Answer: D
In the Where clause, DateKey is the one which can affect to query. Partition should be with DateKey
upvoted 1 times
...
Imtiaz_alum
2 months ago
Selected Answer: D
Hash-distributed on DateKey Characteristics: Rows are distributed based on the hash of DateKey, which is also the column used in the query filter (WHERE DateKey >= ... AND DateKey <= ...). This ensures that rows satisfying the query predicate are grouped on fewer nodes, minimizing data movement and speeding up the query. Suitability: Perfect for the FactPurchase table in this scenario because: The query filters on DateKey. Filtering and aggregation on DateKey avoids significant data movement across nodes
upvoted 1 times
...
nockda
2 months ago
Selected Answer: D
Answer should be D. PurchaseKey is not related with this query.
upvoted 1 times
...
Daniel627
2 months, 1 week ago
Selected Answer: B
Even if Datekey is int, it is used in "WHERE" clause so it is not suitable as well. 1: "Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column or column set that is frequently in the GROUP BY clause." 2: "Is not used in WHERE clauses. When a query's WHERE clause and the table's distribution columns are on the same column, the query could encounter high data skew, leading to processing load falling on only few distributions. This impacts query performance, ideally many distributions share the processing load."
upvoted 1 times
...
f7c717f
2 months, 1 week ago
Selected Answer: D
The answer is "D" since the question is referring to "three years of data" so they are asking for date partitioning
upvoted 1 times
...
moize
2 months, 1 week ago
Selected Answer: D
D. Distribué par hachage sur DateKey Cette distribution optimise les requêtes qui filtrent par DateKey en minimisant les mouvements de données entre les nœuds.
upvoted 1 times
...
EmnCours
2 months, 2 weeks ago
Selected Answer: D
Correct Answer: D
upvoted 1 times
...
nockda
2 months, 2 weeks ago
Selected Answer: D
It should be D.
upvoted 1 times
...
nockda
2 months, 3 weeks ago
Selected Answer: D
The answer should be D. because this query is searched by datekey
upvoted 1 times
...
renan_ineu
5 months, 3 weeks ago
To optimize parallel processing in Azure Synapse Analytics, it's important to select a distribution column or set of columns that: Has many unique values. Does not have many nulls. Is not a date column. Partitioning by DateKey does not meet these criteria. Date columns typically have limited unique values, can contain many nulls, and, by nature, are date fields. Given these constraints, DateKey is not suitable for distribution. For fact tables, which usually exceed 2 GB in size, a hash distribution is recommended. In this scenario, using PurchaseKey as the distribution key is ideal as it aligns with all the specified criteria: it has many unique values, is unlikely to have nulls, and is not a date column. This approach ensures more efficient data distribution and query performance. https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 1 times
...
evangelist
6 months, 3 weeks ago
Selected Answer: D
The real approach in actual project is limiting the dateKey: DistributionKey AS HASHBYTES('MD5', CONCAT(YEAR(CAST(DateKey AS DATE)), FORMAT(CAST(DateKey AS DATE), 'MM'))) ) WITH ( DISTRIBUTION = HASH(DistributionKey), CLUSTERED COLUMNSTORE INDEX );
upvoted 1 times
...
evangelist
7 months ago
Selected Answer: B
total votes on B:85 on D; 23
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago