Exam AWS Certified Machine Learning - Specialty topic 1 question 225 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 225
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10 hours on average to train the model on GPU instances. The data scientist suspects that training is not converging and that resource utilization is not optimal.

What should the data scientist do to identify and address training issues with the LEAST development effort?

A. Use CPU utilization metrics that are captured in Amazon CloudWatch. Configure a CloudWatch alarm to stop the training job early if low CPU utilization occurs.
B. Use high-resolution custom metrics that are captured in Amazon CloudWatch. Configure an AWS Lambda function to analyze the metrics and to stop the training job early if issues are detected.
C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected.
D. Use the SageMaker Debugger confusion and feature_importance_overweight built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected.

Show Suggested Answer

Suggested Answer: C 🗳️

by Amit11011996 at Feb. 6, 2023, 6:18 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Amit11011996

Highly Voted 1 year, 4 months ago

It has to be C.

upvoted 5 times

...

Mickey321

Most Recent 10 months, 2 weeks ago

Selected Answer: C

Option C uses the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. This option is the most efficient because it leverages the existing features of SageMaker Debugger to monitor and troubleshoot your training job without requiring any additional development effort. You can use the following steps to implement this option.

upvoted 1 times

...

oso0348

1 year, 3 months ago

Selected Answer: C

Answer is C The best option for the data scientist to identify and address training issues with the least development effort is option C: Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. SageMaker Debugger is a tool that helps to debug machine learning training processes. It provides several built-in rules that can detect and diagnose common issues that can occur during training. In this case, the data scientist suspects that the training is not converging and that resource utilization is not optimal. The vanishing_gradient and LowGPUUtilization rules can help to identify these issues.

upvoted 3 times

...

AjoseO

1 year, 4 months ago

Selected Answer: C

C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. The SageMaker Debugger is a built-in tool that helps with debugging and profiling machine learning models trained in SageMaker. In this scenario, the data scientist suspects that there are issues with the training process, so using the SageMaker Debugger is the most appropriate solution. The vanishing_gradient and LowGPUUtilization built-in rules can detect common training issues such as a vanishing gradient problem or low GPU utilization, which could affect the training convergence and resource utilization. By launching the StopTrainingJob action if issues are detected, the training job can be stopped early, which can help to save resources and time. This approach requires the least development effort, as it is built-in to SageMaker and does not require the data scientist to create custom metrics or configure CloudWatch alarms.

upvoted 3 times

...

Jerry84

1 year, 4 months ago

Selected Answer: C

https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html

upvoted 4 times

...