exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 102 discussion

An operations team notices that a few AWS Glue jobs for a given ETL application are failing. The AWS Glue jobs read a large number of small JSON files from an
Amazon S3 bucket and write the data to a different S3 bucket in Apache Parquet format with no major transformations. Upon initial investigation, a data engineer notices the following error message in the History tab on the AWS Glue console: `Command Failed with Exit Code 1.`
Upon further investigation, the data engineer notices that the driver memory profile of the failed jobs crosses the safe threshold of 50% usage quickly and reaches
90`"95% soon after. The average memory usage across all executors continues to be less than 4%.
The data engineer also notices the following error while examining the related Amazon CloudWatch Logs.

What should the data engineer do to solve the failure in the MOST cost-effective way?

  • A. Change the worker type from Standard to G.2X.
  • B. Modify the AWS Glue ETL code to use the 'groupFiles': 'inPartition' feature.
  • C. Increase the fetch size setting by using AWS Glue dynamics frame.
  • D. Modify maximum capacity to increase the total maximum data processing units (DPUs) used.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
jyrajan69
Highly Voted 3 years, 6 months ago
Bssed on the link, I will go for B https://awsfeed.com/whats-new/big-data/optimize-memory-management-in-aws-glue
upvoted 33 times
lakeswimmer
3 years, 4 months ago
B It is https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
upvoted 7 times
...
...
cloudlearnerhere
Highly Voted 2 years, 5 months ago
Correct answer is B as the key issue is the driver memory problem caused because of the glue job processing multiple small files. Grouping of the files helps group the files and hence Spark driver stores significantly less state in memory. In this scenario, a Spark job is reading a large number of small files from Amazon Simple Storage Service (Amazon S3). It converts the files to Apache Parquet format and then writes them out to Amazon S3. The Spark driver is running out of memory. The input Amazon S3 data has more than 1 million files in different Amazon S3 partitions. You can fix the processing of the multiple files by using the grouping feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks.
upvoted 16 times
cloudlearnerhere
2 years, 5 months ago
df = glueContext.create_dynamic_frame_from_options ("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json") datasink = glueContext.write_dynamic_frame.from_options (frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
upvoted 3 times
cloudlearnerhere
2 years, 5 months ago
The driver runs below the threshold of 50 percent memory usage over the entire duration of the AWS Glue job. The executors stream the data from Amazon S3, process it, and write it out to Amazon S3. As a result, they consume less than 5 percent memory at any point in time.
upvoted 2 times
...
...
...
MLCL
Most Recent 1 year, 8 months ago
Selected Answer: B
B is the correct answer https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
upvoted 1 times
...
pk349
1 year, 11 months ago
B: I passed the test
upvoted 1 times
...
Mang2000
2 years, 2 months ago
B, more details you can find here https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html
upvoted 2 times
...
[Removed]
2 years, 4 months ago
B https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/
upvoted 4 times
...
he11ow0rId
2 years, 7 months ago
Selected Answer: B
D would work, but it's not the most COST EFFECTIVE way. So B
upvoted 3 times
...
rocky48
2 years, 9 months ago
Selected Answer: B
Selected Answer: B
upvoted 1 times
...
samsanta2012
2 years, 10 months ago
Selected Answer: D
A data processing unit (DPU) is a relative measure of processing power that consists of vCPUs and memory. Change the maximum capacity parameter value and set it to a higher number
upvoted 1 times
...
CloudTimes
2 years, 10 months ago
Selected Answer: B
B -- Groupfiles will help here ..
upvoted 1 times
...
Bik000
2 years, 11 months ago
Selected Answer: D
Answer should be D
upvoted 1 times
...
Bik000
2 years, 11 months ago
Selected Answer: D
Answer is D
upvoted 1 times
...
MWL
2 years, 11 months ago
Selected Answer: B
No doubt B.
upvoted 2 times
...
jrheen
2 years, 11 months ago
Answer - B
upvoted 2 times
...
yusnardo
3 years, 1 month ago
Selected Answer: B
https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-debug-oom-fix
upvoted 3 times
...
RSSRAO
3 years, 2 months ago
Selected Answer: B
B is the correct answer
upvoted 3 times
...
lakeswimmer
3 years, 4 months ago
B https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago