exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 61 discussion

Actual exam question from Microsoft's DP-203
Question #: 61
Topic #: 1
[All DP-203 Questions]

HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool.
You need to create a table named FactInternetSales that will be a large fact table in a dimensional model. FactInternetSales will contain 100 million rows and two columns named SalesAmount and OrderQuantity. Queries executed on FactInternetSales will aggregate the values in SalesAmount and OrderQuantity from the last year for a specific product. The solution must minimize the data size and query execution time.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: (CLUSTERED COLUMNSTORE INDEX

CLUSTERED COLUMNSTORE INDEX -
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to
10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x) SP1, columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload.
Note: Clustered columnstore index
A clustered columnstore index is the physical storage for the entire table.

To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows. The deltastore operations are handled behind the scenes. To return the correct query results, the clustered columnstore index combines query results from both the columnstore and the deltastore.
Box 2: HASH([ProductKey])
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables.
Choose a distribution column with data that distributes evenly
Incorrect:
* Not HASH([OrderDateKey]). Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work
* A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don't require data movement. Replication requires extra storage, though, and isn't practical for large tables.
* A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
ted0809
Highly Voted 2 years, 6 months ago
you don't hash the date.. never..
upvoted 55 times
tlb_20
1 year ago
Check the question 39, which is quite similar: SELECT SupplierKey, StockItemKey, COUNT(*) FROM FactPurchase WHERE DateKey >= 20210101 AND DateKey <= 20210131 GROUP By SupplierKey, StockItemKey B. hash-distributed on PurchaseKey D. hash-distributed on DateKey Bear in mind the type of the orderDateKey column, it's int, not date, not datetime... Doesn't it make a difference? In this case I'd choose the ProductKey since it says queries will be done "for specific products", but in the previous case the WHERE clause uses the DateKey field.
upvoted 1 times
hypersam
3 months, 3 weeks ago
you don't hash dateKey because same dateKey goes to same distribution, thus only a few out of total 60 distributions are working in a query. Instead, within each distribution we can partition by dateKey, so that all distributions are used in a query and each distribution can filter data efficiently because of dateKey partition
upvoted 1 times
...
Dusica
12 months ago
Date key is same as date, it would have distributed by date and that is a no no
upvoted 1 times
...
...
Data_Analytics
1 year, 6 months ago
This sentence helped me soooo much - Hash the date NEVER.. thank you
upvoted 6 times
Euanm28
1 year, 6 months ago
If you cant read don't bother commenting
upvoted 3 times
...
...
Gikan
1 year, 3 months ago
Usually you don't use the hash date, because if the query is made daily, all record contains the same date, so the same node will do the job. In this example " from the last year for a specific product", so we have 365 separate node from date, but we have only one product!!
upvoted 3 times
...
...
Lestrang
Highly Voted 2 years, 3 months ago
By using the product key as the distribution key, the data for a specific product will be stored on the same node, allowing for faster aggregation of the values in SalesAmount and OrderQuantity for that product.
upvoted 11 times
...
Pey1nkh
Most Recent 2 months, 1 week ago
correct answer! why Clustered Columnstore Index ? cause its best for large fact tables. Reduces data storage by up to 10x. Speeds up aggregation queries.
upvoted 1 times
...
Dusica
12 months ago
distriubtion hash(product key) clustered index (date key)
upvoted 1 times
17lan
5 months ago
Answer to 1. Question and why other options are wrong: Correct Answer: CLUSTERED COLUMNSTORE INDEX Reasoning: A clustered columnstore index (CCI) is the default and most efficient index for large fact tables in Synapse Analytics because it compresses data significantly, reducing storage and improving query performance for analytical workloads. It is optimized for aggregations, such as summing up SalesAmount and OrderQuantity, which are typical in analytical queries. Why Others Are Incorrect: CLUSTERED INDEX ([OrderDateKey]): A clustered index is better suited for transactional systems, not for analytical systems with large fact tables. HEAP: Heap tables lack any indexing, making them inefficient for queries that require aggregation or filtering on large datasets. INDEX on [ProductKey]: A non-clustered index might help in specific queries but doesn't optimize overall performance for large-scale analytics.
upvoted 1 times
...
...
dgerok
1 year ago
The answer is right
upvoted 1 times
...
kkk5566
1 year, 7 months ago
Clustered columnstore Hash(ProductKey)
upvoted 6 times
...
smsme323
2 years, 7 months ago
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choose-a-distribution-column-with-data-that-distributes-evenly To balance the parallel processing, select a distribution column or set of columns that: Has many unique values. The distribution column(s) can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values. Does not have NULLs, or has only a few NULLs. For an extreme example, if all values in the distribution column(s) are NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one distribution, and does not benefit from parallel processing. Is not a date column. All data for the same date lands in the same distribution, or will cluster records by date. If several users are all filtering on the same date (such as today's date), then only 1 of the 60 distributions do all the processing work. Ans: Hash(ProductKey)
upvoted 9 times
...
Phund
2 years, 7 months ago
must hash on OrderDateKey because that field was not a date and it was used for filter condition "from the last year for a specific product"
upvoted 8 times
Lestrang
2 years, 7 months ago
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute the example in the sample-test is in this page and they used productkey for hashing, so yeah, the answer is productkey
upvoted 14 times
...
...
anks84
2 years, 7 months ago
correct
upvoted 4 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago