exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 65 discussion

A university intends to use Amazon Kinesis Data Firehose to collect JSON-formatted batches of water quality readings in Amazon S3. The readings are from 50 sensors scattered across a local lake. Students will query the stored data using Amazon Athena to observe changes in a captured metric over time, such as water temperature or acidity. Interest has grown in the study, prompting the university to reconsider how data will be stored.
Which data format and partitioning choices will MOST significantly reduce costs? (Choose two.)

  • A. Store the data in Apache Avro format using Snappy compression.
  • B. Partition the data by year, month, and day.
  • C. Store the data in Apache ORC format using no compression.
  • D. Store the data in Apache Parquet format using Snappy compression.
  • E. Partition the data by sensor, year, month, and day.
Show Suggested Answer Hide Answer
Suggested Answer: BD 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Priyanka_01
Highly Voted 3 years, 6 months ago
D :can save from 30% to 90% on your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. B: For partition
upvoted 44 times
...
jay1ram2
Highly Voted 3 years, 6 months ago
B and D are the right answers. Some background: Snappy compresses the data to help with I/O, it roughly does the same level of compression for both parquet and AVRO. AVRO stores the data in row format and does not compresses the data. However, Parquet is a columnar store (without any additional compression algorithm like snappy applied), it natively compresses the data by 2X to 5X on average. A) Since Parquet does a better job in compression, this option is incorrect B) This is correct since data is partitioned with keys (year, month, day) with medium cardinality. C) Even though ORC and Parquet are both columnar storage formats and both supported by Athena, Since no compression is used in this option, we can safely ignore this. D) Parquet with Snappy is a better choice than ORC with no compression, so this is correct. E) Adding sensor(ID) to the partition creates high cardinality on the partitions and may lead to multiple small files under each partition which will slow down performance. So, B is a better option as you can keep all 50 sensor data in a single file for a day.
upvoted 30 times
wally_1995
1 year, 10 months ago
I found this at this link: Columns that are used as filters are good candidates for partitioning. Partitioning has a cost. As the number of partitions in your table increases, the higher the overhead of retrieving and processing the partition metadata, and the smaller your files. Partitioning too finely can wipe out the initial benefit. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ So I'd also go with B!
upvoted 1 times
...
...
Sai3596
Most Recent 1 year, 1 month ago
In the question it is mentioned that "Athena to observe changes in a captured metric over time, such as water temperature or acidity." No signs of using a specific sensor and then observe metrics. So if we introduce sensor in partition and not filter it in the query you are introducing an additional partition to search. The request of the question it makes sense that we use B
upvoted 1 times
...
MLCL
1 year, 8 months ago
Selected Answer: BD
BD : Makes the most sense
upvoted 1 times
...
pk349
1 year, 12 months ago
DE: I passed the test
upvoted 1 times
GCPereira
1 year, 4 months ago
i'm also passed the test, but the sensor id increases the cardinality of the dataset... then the best option is to partition the data by year, month, and day, compress and convert JSON to a colunar file, in this case, parquet file.
upvoted 1 times
...
...
rags1482
2 years, 1 month ago
Partitioning by sensor, year, month, and day (option E) would likely increase costs as compared to partitioning by only year, month, and day (option B) because it would create a larger number of smaller partitions. Each partition would contain data from a single sensor for a given date range, resulting in more small files that would need to be scanned by Athena for each query. So B is better answer than E
upvoted 2 times
...
murali12180
2 years, 2 months ago
Selected Answer: DE
partition by sensor and then by year/month/day make sense, parquet with snappy gives best compressions
upvoted 1 times
...
Gabba
2 years, 2 months ago
Selected Answer: BD
B partition strategy better than E. D for sure.
upvoted 4 times
...
rocky48
2 years, 4 months ago
Selected Answer: DE
D is an obvious choice. E has the highest potential to save costs also for queries that filter the sensor and the task is to find the solution with the most cost savings
upvoted 2 times
rocky48
2 years, 4 months ago
Adding sensor(ID) to the partition creates high cardinality on the partitions and may lead to multiple small files under each partition which will slow down performance. But the question mentions about saving costs and not performance.
upvoted 2 times
...
...
cloudlearnerhere
2 years, 5 months ago
Selected Answer: BD
Correct answers are B & D Option B as the data can be partitions by year, month, and day as it needs to be analyzed using captured metrics over time and not specific to any sensor. Option D as columnar data format helps to improve query performance. Options A & C are wrong as Avro and ORC without compression would not provide query performance similar to parquet with compression. Option E is wrong as the data needs to be analyzed as per the metrics and not specific to a particular sensor.
upvoted 3 times
Saneeda
2 years, 5 months ago
(A. Store the data in Apache Avro format using Snappy compression) Option A includes compression but Parquet with Snappy Compression is better option because Avro stores data in row format per @jay1ram2. Correct me if I am wrong.
upvoted 1 times
...
...
aefuen1
2 years, 5 months ago
Selected Answer: BD
B and D. They are will query by time, not sensor id.
upvoted 1 times
...
LukeTran3206
2 years, 6 months ago
Selected Answer: BD
If possible, avoid having a large number of small files – Amazon S3 has a limit of 5500 requests per second. Athena queries share the same limit. https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html
upvoted 1 times
...
Arka_01
2 years, 7 months ago
Selected Answer: DE
Because more and optimal number of partitions can be done through the option E. Snappy compression with Parquet format, allows easy integration and maximum storage saving.
upvoted 1 times
...
ryuhei
2 years, 8 months ago
Selected Answer: BD
Answer is B & D
upvoted 1 times
...
rocky48
2 years, 9 months ago
Selected Answer: BD
B and D are the right answers.
upvoted 2 times
rocky48
2 years, 4 months ago
E has the highest potential to save costs also for queries that filter the sensor and the task is to find the solution with the most cost savings.
upvoted 1 times
...
...
ru4aws
2 years, 9 months ago
Selected Answer: DE
The metrics of temperature and acidity may be varying between different locations of the lake, students may want to see if any issues at a particular location level based on metrics. So its advisable to partition by Sensor
upvoted 2 times
...
[Removed]
2 years, 9 months ago
I vote A & E. Vote for A because "gather JSON-formatted batches of water quality values in Amazon S3" is the requirement. We can't compress the Json format file using Parquet or ORC.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago