Exam AWS Certified Solutions Architect - Associate SAA-C03 topic 1 question 103 discussion

Exam question from Amazon's AWS Certified Solutions Architect - Associate SAA-C03

Question #: 103
Topic #: 1

[All AWS Certified Solutions Architect - Associate SAA-C03 Questions]

A company has an AWS Glue extract, transform, and load (ETL) job that runs every day at the same time. The job processes XML data that is in an Amazon S3 bucket. New data is added to the S3 bucket every day. A solutions architect notices that AWS Glue is processing all the data during each run.
What should the solutions architect do to prevent AWS Glue from reprocessing old data?

A. Edit the job to use job bookmarks.
B. Edit the job to delete data after the data is processed.
C. Edit the job by setting the NumberOfWorkers field to 1.
D. Use a FindMatches machine learning (ML) transform.

Show Suggested Answer

Suggested Answer: A 🗳️

by LeGloupier at Oct. 18, 2022, 10:20 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

123jhl0

Highly Voted 2 years, 2 months ago

Selected Answer: A

This is the purpose of bookmarks: "AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data." https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

upvoted 51 times

...

cookieMr

Highly Voted 1 year, 6 months ago

Selected Answer: A

A. Job bookmarks in Glue allow you to track the last-processed data in a job. By enabling job bookmarks, Glue keeps track of the processed data and automatically resumes processing from where it left off in subsequent job runs. B. Results in the permanent removal of the data from the S3, making it unavailable for future job runs. This is not desirable if the data needs to be retained or used for subsequent analysis. C.It would only affect the parallelism of the job but would not address the issue of reprocessing old data. It does not provide a mechanism to track the processed data or skip already processed data. D. It is not directly related to preventing Glue from reprocessing old data. The FindMatches transform is used for identifying and matching duplicate or matching records in a dataset. While it can be used in data processing pipelines, it does not address the specific requirement of avoiding reprocessing old data in this scenario.

upvoted 12 times

...

awsgeek75

Most Recent 11 months, 3 weeks ago

Selected Answer: A

B: Glue can delete DataSet but this option is too vague to consider or too open to mean anything C: Won't help with repeated ETL. This property affects parallelism D: Too vague

upvoted 1 times

...

2 years ago

Delete files in S3 freely is not good. so B is not correct,

upvoted 1 times

...

techhb

2 years ago

Selected Answer: A

A is correct

upvoted 1 times

...

Buruguduystunstugudunstuy

2 years ago

Selected Answer: A

Option A. Edit the job to use job bookmarks. Job bookmarks in AWS Glue allow the ETL job to track the data that has been processed and to skip data that has already been processed. This can prevent AWS Glue from reprocessing old data and can improve the performance of the ETL job by only processing new data. To use job bookmarks, the solutions architect can edit the job and set the "Use job bookmark" option to "True". The ETL job will then use the job bookmark to track the data that has been processed and skip data that has already been processed in subsequent runs.

upvoted 3 times

...