Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 102 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 102
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

An operations team notices that a few AWS Glue jobs for a given ETL application are failing. The AWS Glue jobs read a large number of small JSON files from an
Amazon S3 bucket and write the data to a different S3 bucket in Apache Parquet format with no major transformations. Upon initial investigation, a data engineer notices the following error message in the History tab on the AWS Glue console: `Command Failed with Exit Code 1.`
Upon further investigation, the data engineer notices that the driver memory profile of the failed jobs crosses the safe threshold of 50% usage quickly and reaches
90`"95% soon after. The average memory usage across all executors continues to be less than 4%.
The data engineer also notices the following error while examining the related Amazon CloudWatch Logs.

What should the data engineer do to solve the failure in the MOST cost-effective way?

A. Change the worker type from Standard to G.2X.
B. Modify the AWS Glue ETL code to use the 'groupFiles': 'inPartition' feature.
C. Increase the fetch size setting by using AWS Glue dynamics frame.
D. Modify maximum capacity to increase the total maximum data processing units (DPUs) used.

Show Suggested Answer

Suggested Answer: B 🗳️

by VikG12 at May 3, 2021, 7:15 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

jyrajan69

Highly Voted 3 years, 6 months ago

Bssed on the link, I will go for B https://awsfeed.com/whats-new/big-data/optimize-memory-management-in-aws-glue

upvoted 33 times

lakeswimmer

3 years, 4 months ago

B It is https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html

upvoted 7 times

...

cloudlearnerhere

Highly Voted 2 years, 5 months ago

Correct answer is B as the key issue is the driver memory problem caused because of the glue job processing multiple small files. Grouping of the files helps group the files and hence Spark driver stores significantly less state in memory. In this scenario, a Spark job is reading a large number of small files from Amazon Simple Storage Service (Amazon S3). It converts the files to Apache Parquet format and then writes them out to Amazon S3. The Spark driver is running out of memory. The input Amazon S3 data has more than 1 million files in different Amazon S3 partitions. You can fix the processing of the multiple files by using the grouping feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks.

upvoted 16 times

cloudlearnerhere

2 years, 5 months ago

df = glueContext.create_dynamic_frame_from_options ("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json") datasink = glueContext.write_dynamic_frame.from_options (frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")

upvoted 3 times

cloudlearnerhere

2 years, 5 months ago

The driver runs below the threshold of 50 percent memory usage over the entire duration of the AWS Glue job. The executors stream the data from Amazon S3, process it, and write it out to Amazon S3. As a result, they consume less than 5 percent memory at any point in time.

upvoted 2 times

...

MLCL

Most Recent 1 year, 8 months ago

Selected Answer: B

B is the correct answer https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html

upvoted 1 times

...

pk349

1 year, 11 months ago

B: I passed the test

upvoted 1 times

...

Mang2000

2 years, 2 months ago

B, more details you can find here https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html

upvoted 2 times

...

[Removed]

2 years, 4 months ago

B https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/

upvoted 4 times

...

he11ow0rId

2 years, 7 months ago

Selected Answer: B

D would work, but it's not the most COST EFFECTIVE way. So B

upvoted 3 times

...

rocky48

2 years, 9 months ago

Selected Answer: B

upvoted 1 times

...

samsanta2012

2 years, 10 months ago

Selected Answer: D

A data processing unit (DPU) is a relative measure of processing power that consists of vCPUs and memory. Change the maximum capacity parameter value and set it to a higher number

upvoted 1 times

...

CloudTimes

2 years, 10 months ago

Selected Answer: B

B -- Groupfiles will help here ..

upvoted 1 times

...

Bik000

2 years, 11 months ago

Selected Answer: D

Answer should be D

upvoted 1 times

...

Bik000

2 years, 11 months ago

Selected Answer: D

Answer is D

upvoted 1 times

...

MWL

2 years, 11 months ago

Selected Answer: B

No doubt B.

upvoted 2 times

...

jrheen

2 years, 11 months ago

Answer - B

upvoted 2 times

...

yusnardo

3 years, 1 month ago

Selected Answer: B

https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-debug-oom-fix

upvoted 3 times

...

RSSRAO

3 years, 2 months ago

Selected Answer: B

B is the correct answer

upvoted 3 times

...

lakeswimmer

3 years, 4 months ago

B https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html

upvoted 1 times

...

Load full discussion...

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 102 discussion

Comments

jyrajan69

lakeswimmer

cloudlearnerhere

cloudlearnerhere

cloudlearnerhere

MLCL

pk349

Mang2000

[Removed]

he11ow0rId

rocky48

samsanta2012

CloudTimes

Bik000

Bik000

MWL

jrheen

yusnardo

RSSRAO

lakeswimmer

SY0-701