Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 61 discussion

Actual exam question from Microsoft's DP-203

Question #: 61
Topic #: 1

HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool.
You need to create a table named FactInternetSales that will be a large fact table in a dimensional model. FactInternetSales will contain 100 million rows and two columns named SalesAmount and OrderQuantity. Queries executed on FactInternetSales will aggregate the values in SalesAmount and OrderQuantity from the last year for a specific product. The solution must minimize the data size and query execution time.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: (CLUSTERED COLUMNSTORE INDEX

CLUSTERED COLUMNSTORE INDEX -
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage. You can also achieve gains up to
10 times the data compression over the uncompressed data size. Beginning with SQL Server 2016 (13.x) SP1, columnstore indexes enable operational analytics: the ability to run performant real-time analytics on a transactional workload.
Note: Clustered columnstore index
A clustered columnstore index is the physical storage for the entire table.

To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows. The deltastore operations are handled behind the scenes. To return the correct query results, the clustered columnstore index combines query results from both the columnstore and the deltastore.
Box 2: HASH([ProductKey])
A hash distributed table distributes rows based on the value in the distribution column. A hash distributed table is designed to achieve high performance for queries on large tables.
Choose a distribution column with data that distributes evenly
Incorrect:
* Not HASH([OrderDateKey]). Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work
* A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables don't require data movement. Replication requires extra storage, though, and isn't practical for large tables.
* A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. Keep in mind that queries can require more data movement than the other distribution methods.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

by anks84 at Sept. 6, 2022, 3:43 a.m.

Comments

Submit Cancel

ted0809

Highly Voted 2 years, 6 months ago

you don't hash the date.. never..

upvoted 55 times

tlb_20

1 year ago

Check the question 39, which is quite similar: SELECT SupplierKey, StockItemKey, COUNT(*) FROM FactPurchase WHERE DateKey >= 20210101 AND DateKey <= 20210131 GROUP By SupplierKey, StockItemKey B. hash-distributed on PurchaseKey D. hash-distributed on DateKey Bear in mind the type of the orderDateKey column, it's int, not date, not datetime... Doesn't it make a difference? In this case I'd choose the ProductKey since it says queries will be done "for specific products", but in the previous case the WHERE clause uses the DateKey field.

upvoted 1 times

hypersam

3 months, 3 weeks ago

you don't hash dateKey because same dateKey goes to same distribution, thus only a few out of total 60 distributions are working in a query. Instead, within each distribution we can partition by dateKey, so that all distributions are used in a query and each distribution can filter data efficiently because of dateKey partition

upvoted 1 times

...

Dusica

12 months ago

Date key is same as date, it would have distributed by date and that is a no no

upvoted 1 times

...

Data_Analytics

1 year, 6 months ago

This sentence helped me soooo much - Hash the date NEVER.. thank you

upvoted 6 times

Euanm28

1 year, 6 months ago

If you cant read don't bother commenting

upvoted 3 times

...

Gikan

1 year, 3 months ago

Usually you don't use the hash date, because if the query is made daily, all record contains the same date, so the same node will do the job. In this example " from the last year for a specific product", so we have 365 separate node from date, but we have only one product!!

upvoted 3 times

...

Lestrang

Highly Voted 2 years, 3 months ago

By using the product key as the distribution key, the data for a specific product will be stored on the same node, allowing for faster aggregation of the values in SalesAmount and OrderQuantity for that product.

upvoted 11 times

...

Pey1nkh

Most Recent 2 months, 1 week ago

correct answer! why Clustered Columnstore Index ? cause its best for large fact tables. Reduces data storage by up to 10x. Speeds up aggregation queries.

upvoted 1 times

...

Dusica

12 months ago

distriubtion hash(product key) clustered index (date key)

upvoted 1 times

17lan

5 months ago

Answer to 1. Question and why other options are wrong: Correct Answer: CLUSTERED COLUMNSTORE INDEX Reasoning: A clustered columnstore index (CCI) is the default and most efficient index for large fact tables in Synapse Analytics because it compresses data significantly, reducing storage and improving query performance for analytical workloads. It is optimized for aggregations, such as summing up SalesAmount and OrderQuantity, which are typical in analytical queries. Why Others Are Incorrect: CLUSTERED INDEX ([OrderDateKey]): A clustered index is better suited for transactional systems, not for analytical systems with large fact tables. HEAP: Heap tables lack any indexing, making them inefficient for queries that require aggregation or filtering on large datasets. INDEX on [ProductKey]: A non-clustered index might help in specific queries but doesn't optimize overall performance for large-scale analytics.

upvoted 1 times

...

dgerok

1 year ago

The answer is right

upvoted 1 times

...

kkk5566

1 year, 7 months ago

Clustered columnstore Hash(ProductKey)

upvoted 6 times

...

smsme323

2 years, 7 months ago

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choose-a-distribution-column-with-data-that-distributes-evenly To balance the parallel processing, select a distribution column or set of columns that: Has many unique values. The distribution column(s) can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values. Does not have NULLs, or has only a few NULLs. For an extreme example, if all values in the distribution column(s) are NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one distribution, and does not benefit from parallel processing. Is not a date column. All data for the same date lands in the same distribution, or will cluster records by date. If several users are all filtering on the same date (such as today's date), then only 1 of the 60 distributions do all the processing work. Ans: Hash(ProductKey)

upvoted 9 times

...

Phund

2 years, 7 months ago

must hash on OrderDateKey because that field was not a date and it was used for filter condition "from the last year for a specific product"

upvoted 8 times

Lestrang

2 years, 7 months ago

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute the example in the sample-test is in this page and they used productkey for hashing, so yeah, the answer is productkey

upvoted 14 times

...

anks84

2 years, 7 months ago

correct

upvoted 4 times

...

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 61 discussion

Comments

ted0809

tlb_20

hypersam

Dusica

Data_Analytics

Euanm28

Gikan

Lestrang

Pey1nkh

Dusica

17lan

dgerok

kkk5566

smsme323

Phund

Lestrang

anks84

SY0-701