Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 58 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 58
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants to optimize the files to reduce the cost of querying and also improve the speed of data loading into the Amazon
Redshift cluster.
Which solution meets these requirements?

A. Use AWS Glue to convert all the files from .csv to a single large Apache Parquet file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
B. Use Amazon EMR to convert each .csv file to Apache Avro. COPY the files into Amazon Redshift and query the file with Athena from Amazon S3.
C. Use AWS Glue to convert the files from .csv to a single large Apache ORC file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
D. Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.

Show Suggested Answer

Suggested Answer: D 🗳️

by ali_baba_acs at Aug. 25, 2020, 3:29 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

ali_baba_acs

Highly Voted 3 years, 6 months ago

D, is the good answer. In fact each nodes have 2 slices so ideally we can parrelize the copy process by sending a multiple of 20.

upvoted 27 times

LMax

3 years, 5 months ago

D for sure

upvoted 3 times

...

Paitan

Highly Voted 3 years, 6 months ago

Trick question. Since we have 10 nodes with 2 slices each, ideally a multiple of 20 files should help in the parallelize the Copy process. So D is the right answer.

upvoted 8 times

roymunson

1 year, 5 months ago

And why it is a trick question? IMO it's an obv hint.

upvoted 1 times

...

pk349

Most Recent 1 year, 11 months ago

D: I passed the test

upvoted 1 times

...

cloudlearnerhere

2 years, 5 months ago

Selected Answer: D

Correct answer is D as AWS Glue can be used to combine the .csv files to 20. parquet files. This would allow even processing across 2 slices. Also, multiple files help rapid loading of data to Redshift. Options A & C are wrong as single large file is not efficient as it would use only single slice. Option B is wrong as using Glue would be more cost effective as compared to EMR.

upvoted 4 times

bill1214

1 year, 11 months ago

I agree with your response - however EMR is more cost effective than Glue.. Glue is serverless while EMR is just a managed service.

upvoted 1 times

...

thirukudil

2 years, 5 months ago

Selected Answer: D

Option D is perfect solution to achieve both the requirements - to reduce the cost of querying (when we query on parquet files, lower the amount of data would be scanned which in turn reduce the cost ) and also improve the speed of data loading into the Amazon Redshift cluster(Split large files wherever possible to a number equal to a multiple of total number of slices. So here 20*n would be the correct splitting of the large data file.).

upvoted 1 times

...

Arka_01

2 years, 7 months ago

Selected Answer: D

Best practices to Copy data from S3 to Redshift - 1) Use Columnar data format. Which is Parquet/ORC. 2) Split large files wherever possible to a number equal to a multiple of total number of slices. So here 20*n would be the correct splitting of the large data file.

upvoted 3 times

...

Hruday

2 years, 8 months ago

Selected Answer: D

Ans is D

upvoted 1 times

...

rocky48

2 years, 9 months ago

Selected Answer: D

upvoted 1 times

...

Ramshizzle

2 years, 10 months ago

Selected Answer: D

D is the right answer. 20 files = one per slice. If you use COPY on a dataset all the files will be divided over the available nodes/slices.

upvoted 1 times

...

somenath

2 years, 11 months ago

Per the link https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html , the answer shall be D, for the compressed files multiple copy commands for its slices adhere parallel load whereas for the delimited file a single copy command works better. Here in all the options the file compression is taking place , so option D seems the best choice here.

upvoted 1 times

...

arun004

3 years, 4 months ago

Not sure D is correct or not but only reason i choose D since A and C are almost same and B won't work

upvoted 1 times

...

aws2019

3 years, 5 months ago

D is the right answer

upvoted 1 times

...

iconara

3 years, 5 months ago

D is the answer, but I doubt the total time will be shorter.The load will be quicker, shure, but it’s not as if Spark reads CSV files quicker in any way, so all that you get is overhead. The Athena queries will run faster on fewer files, though, and if that wasthe focus this question would have made sense.

upvoted 1 times

...

afantict

3 years, 5 months ago

Is Athena query cheaper than the existing redshift query?

upvoted 1 times

...

lostsoul07

3 years, 5 months ago

D is the right answer

upvoted 3 times

...

sanjaym

3 years, 6 months ago

D is correct answer.

upvoted 2 times

...

apuredol

3 years, 6 months ago

is D ok? In understand you should avoid multiple concurrency copy commands "We strongly recommend using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT or CREATE TABLE AS to improve performance. For information, see INSERT or CREATE TABLE AS". https://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html

upvoted 1 times

CHRIS12722222

3 years, 1 month ago

Single COPY command loads multiple files into Redshift in parallel

upvoted 2 times

...

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 58 discussion

Comments

ali_baba_acs

LMax

Paitan

roymunson

pk349

cloudlearnerhere

bill1214

thirukudil

Arka_01

Hruday

rocky48

Ramshizzle

somenath

arun004

aws2019

iconara

afantict

lostsoul07

sanjaym

apuredol

CHRIS12722222

SY0-701