Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 65 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 65
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A university intends to use Amazon Kinesis Data Firehose to collect JSON-formatted batches of water quality readings in Amazon S3. The readings are from 50 sensors scattered across a local lake. Students will query the stored data using Amazon Athena to observe changes in a captured metric over time, such as water temperature or acidity. Interest has grown in the study, prompting the university to reconsider how data will be stored.
Which data format and partitioning choices will MOST significantly reduce costs? (Choose two.)

A. Store the data in Apache Avro format using Snappy compression.
B. Partition the data by year, month, and day.
C. Store the data in Apache ORC format using no compression.
D. Store the data in Apache Parquet format using Snappy compression.
E. Partition the data by sensor, year, month, and day.

Show Suggested Answer

Suggested Answer: BD 🗳️

by Priyanka_01 at Aug. 14, 2020, 8:17 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Priyanka_01

Highly Voted 3 years, 6 months ago

D :can save from 30% to 90% on your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. B: For partition

upvoted 44 times

...

jay1ram2

Highly Voted 3 years, 6 months ago

B and D are the right answers. Some background: Snappy compresses the data to help with I/O, it roughly does the same level of compression for both parquet and AVRO. AVRO stores the data in row format and does not compresses the data. However, Parquet is a columnar store (without any additional compression algorithm like snappy applied), it natively compresses the data by 2X to 5X on average. A) Since Parquet does a better job in compression, this option is incorrect B) This is correct since data is partitioned with keys (year, month, day) with medium cardinality. C) Even though ORC and Parquet are both columnar storage formats and both supported by Athena, Since no compression is used in this option, we can safely ignore this. D) Parquet with Snappy is a better choice than ORC with no compression, so this is correct. E) Adding sensor(ID) to the partition creates high cardinality on the partitions and may lead to multiple small files under each partition which will slow down performance. So, B is a better option as you can keep all 50 sensor data in a single file for a day.

upvoted 30 times

wally_1995

1 year, 10 months ago

I found this at this link: Columns that are used as filters are good candidates for partitioning. Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ So I'd also go with B!

upvoted 1 times

...

Sai3596

Most Recent 1 year, 1 month ago

In the question it is mentioned that "Athena to observe changes in a captured metric over time, such as water temperature or acidity." No signs of using a specific sensor and then observe metrics. So if we introduce sensor in partition and not filter it in the query you are introducing an additional partition to search. The request of the question it makes sense that we use B

upvoted 1 times

...

MLCL

1 year, 8 months ago

Selected Answer: BD

BD : Makes the most sense

upvoted 1 times

...

pk349

1 year, 12 months ago

DE: I passed the test

upvoted 1 times

GCPereira

1 year, 4 months ago

i'm also passed the test, but the sensor id increases the cardinality of the dataset... then the best option is to partition the data by year, month, and day, compress and convert JSON to a colunar file, in this case, parquet file.

upvoted 1 times

...

2 years, 4 months ago

Selected Answer: DE

D is an obvious choice. E has the highest potential to save costs also for queries that filter the sensor and the task is to find the solution with the most cost savings

upvoted 2 times

rocky48

2 years, 4 months ago

Adding sensor(ID) to the partition creates high cardinality on the partitions and may lead to multiple small files under each partition which will slow down performance. But the question mentions about saving costs and not performance.

upvoted 2 times

...

cloudlearnerhere

2 years, 5 months ago

Selected Answer: BD

Correct answers are B & D Option B as the data can be partitions by year, month, and day as it needs to be analyzed using captured metrics over time and not specific to any sensor. Option D as columnar data format helps to improve query performance. Options A & C are wrong as Avro and ORC without compression would not provide query performance similar to parquet with compression. Option E is wrong as the data needs to be analyzed as per the metrics and not specific to a particular sensor.

upvoted 3 times

2 years, 9 months ago

Selected Answer: BD

B and D are the right answers.

upvoted 2 times

rocky48

2 years, 4 months ago

E has the highest potential to save costs also for queries that filter the sensor and the task is to find the solution with the most cost savings.

upvoted 1 times

...

ru4aws

2 years, 9 months ago

Selected Answer: DE

The metrics of temperature and acidity may be varying between different locations of the lake, students may want to see if any issues at a particular location level based on metrics. So its advisable to partition by Sensor

upvoted 2 times

...

[Removed]

2 years, 9 months ago

I vote A & E. Vote for A because "gather JSON-formatted batches of water quality values in Amazon S3" is the requirement. We can't compress the Json format file using Parquet or ORC.

upvoted 1 times

...

Load full discussion...

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 65 discussion

Comments

Priyanka_01

jay1ram2

wally_1995

Sai3596

MLCL

pk349

GCPereira

rags1482

murali12180

Gabba

rocky48

rocky48

cloudlearnerhere

Saneeda

aefuen1

LukeTran3206

Arka_01

ryuhei

rocky48

rocky48

ru4aws

[Removed]

SY0-701