exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 58 discussion

A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants to optimize the files to reduce the cost of querying and also improve the speed of data loading into the Amazon
Redshift cluster.
Which solution meets these requirements?

  • A. Use AWS Glue to convert all the files from .csv to a single large Apache Parquet file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
  • B. Use Amazon EMR to convert each .csv file to Apache Avro. COPY the files into Amazon Redshift and query the file with Athena from Amazon S3.
  • C. Use AWS Glue to convert the files from .csv to a single large Apache ORC file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
  • D. Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
ali_baba_acs
Highly Voted 3 years, 6 months ago
D, is the good answer. In fact each nodes have 2 slices so ideally we can parrelize the copy process by sending a multiple of 20.
upvoted 27 times
LMax
3 years, 5 months ago
D for sure
upvoted 3 times
...
...
Paitan
Highly Voted 3 years, 6 months ago
Trick question. Since we have 10 nodes with 2 slices each, ideally a multiple of 20 files should help in the parallelize the Copy process. So D is the right answer.
upvoted 8 times
roymunson
1 year, 5 months ago
And why it is a trick question? IMO it's an obv hint.
upvoted 1 times
...
...
pk349
Most Recent 1 year, 11 months ago
D: I passed the test
upvoted 1 times
...
cloudlearnerhere
2 years, 5 months ago
Selected Answer: D
Correct answer is D as AWS Glue can be used to combine the .csv files to 20. parquet files. This would allow even processing across 2 slices. Also, multiple files help rapid loading of data to Redshift. Options A & C are wrong as single large file is not efficient as it would use only single slice. Option B is wrong as using Glue would be more cost effective as compared to EMR.
upvoted 4 times
bill1214
1 year, 11 months ago
I agree with your response - however EMR is more cost effective than Glue.. Glue is serverless while EMR is just a managed service.
upvoted 1 times
...
...
thirukudil
2 years, 5 months ago
Selected Answer: D
Option D is perfect solution to achieve both the requirements - to reduce the cost of querying (when we query on parquet files, lower the amount of data would be scanned which in turn reduce the cost ) and also improve the speed of data loading into the Amazon Redshift cluster(Split large files wherever possible to a number equal to a multiple of total number of slices. So here 20*n would be the correct splitting of the large data file.).
upvoted 1 times
...
Arka_01
2 years, 7 months ago
Selected Answer: D
Best practices to Copy data from S3 to Redshift - 1) Use Columnar data format. Which is Parquet/ORC. 2) Split large files wherever possible to a number equal to a multiple of total number of slices. So here 20*n would be the correct splitting of the large data file.
upvoted 3 times
...
Hruday
2 years, 8 months ago
Selected Answer: D
Ans is D
upvoted 1 times
...
rocky48
2 years, 9 months ago
Selected Answer: D
Selected Answer: D
upvoted 1 times
...
Ramshizzle
2 years, 10 months ago
Selected Answer: D
D is the right answer. 20 files = one per slice. If you use COPY on a dataset all the files will be divided over the available nodes/slices.
upvoted 1 times
...
somenath
2 years, 11 months ago
Per the link https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html , the answer shall be D, for the compressed files multiple copy commands for its slices adhere parallel load whereas for the delimited file a single copy command works better. Here in all the options the file compression is taking place , so option D seems the best choice here.
upvoted 1 times
...
arun004
3 years, 4 months ago
Not sure D is correct or not but only reason i choose D since A and C are almost same and B won't work
upvoted 1 times
...
aws2019
3 years, 5 months ago
D is the right answer
upvoted 1 times
...
iconara
3 years, 5 months ago
D is the answer, but I doubt the total time will be shorter.The load will be quicker, shure, but it’s not as if Spark reads CSV files quicker in any way, so all that you get is overhead. The Athena queries will run faster on fewer files, though, and if that wasthe focus this question would have made sense.
upvoted 1 times
...
afantict
3 years, 5 months ago
Is Athena query cheaper than the existing redshift query?
upvoted 1 times
...
lostsoul07
3 years, 5 months ago
D is the right answer
upvoted 3 times
...
sanjaym
3 years, 6 months ago
D is correct answer.
upvoted 2 times
...
apuredol
3 years, 6 months ago
is D ok? In understand you should avoid multiple concurrency copy commands "We strongly recommend using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT or CREATE TABLE AS to improve performance. For information, see INSERT or CREATE TABLE AS". https://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html
upvoted 1 times
CHRIS12722222
3 years, 1 month ago
Single COPY command loads multiple files into Redshift in parallel
upvoted 2 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago