Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 33 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 33
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A gaming company has launched an online game where people can start playing for free, but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users.
The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not use any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns.
Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set. However, the prediction results on a test dataset were not satisfactory
Which of the following approaches should the Data Science team take to mitigate this issue? (Choose two.)

A. Add more deep trees to the random forest to enable the model to learn more features.
B. Include a copy of the samples in the test dataset in the training dataset.
C. Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
D. Change the cost function so that false negatives have a higher impact on the cost value than false positives.
E. Change the cost function so that false positives have a higher impact on the cost value than false negatives.

Show Suggested Answer

Suggested Answer: CD 🗳️

by heihei at Dec. 16, 2019, 2:56 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Phong

Highly Voted 3 years, 9 months ago

I think it should be CD C: because we need a balance dataset D: The number of positive samples is large so model tends to predict 0 (negative) for all cases leading to False Negative problem. We should minimize that. My opinion

upvoted 30 times

...

Phong

Highly Voted 3 years, 9 months ago

I think it should be CD C: because we need a balance dataset D: The number of negative samples is large so model tends to predict 0 (negative) for all cases leading to False Negative problem. We should minimize that. My opinion

upvoted 24 times

...

JonSno

Most Recent 4 months, 2 weeks ago

Selected Answer: CD

Why These Are the Best Choices? C. Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. Balances the dataset by increasing the number of positive samples. Adding noise prevents overfitting and helps the model generalize better. Alternative: Use SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic positive examples. D. Change the cost function so that false negatives have a higher impact on the cost value than false positives. Since missing a potential paying user (false negative) is more critical than misclassifying a non-paying user, adjusting the cost function to penalize false negatives more will improve recall for paid users. Methods: Use weighted loss functions (e.g., weighted cross-entropy). Adjust class weights in random forest or another algorithm. Use AUC-ROC or F1-score instead of accuracy for evaluation.

upvoted 2 times

...

dinITExam

8 months ago

Think C and D

upvoted 1 times

...

John_Pongthorn

3 years, 4 months ago

Selected Answer: CD

C,D is correct (percentage of the positive class is key to decide which case we are interested in) This question, positive class (Pay) is 0.01% as compared to 99.99( not pay) , as a result, we have to pay attention to Pay because if we miss 0.01% out, we didn't get revenue. it is a false negative. In contrast to these questions, it positive class (Pay) is 40% as compared to negative class (60% not pay), it is avoidable to emphasize on 40% ( if model predict as payment but in reality customer neglect), we won't get revenue the amount from false positive)

upvoted 5 times

...

apprehensive_scar

3 years, 4 months ago

I think is CD

upvoted 1 times

...

cloud_trail

3 years, 7 months ago

C and D. Hopefully, no one honestly thinks that B is a good answer. Never expose test data to the training set or vice versa. C is right because of the highly imbalanced training set. D is right because you want to minimize false negatives, maximize true positives, maximize recall of the positive class. I'm not sure why anyone's worried about precision in this case.

upvoted 4 times

...

felbuch

3 years, 8 months ago

CD The model has 99% accuracy because it's simply predicting that everyone's a negative. Since almost everyone's a negative, it will get almost everyone right. So we need to penalize the model for predicting that someone is a negative when it is not (i.e. penalize false negatives). So that's D. Also, it would be really nice to have more positives -- one way to do that is to follow option C.

upvoted 8 times

...

engomaradel

3 years, 8 months ago

CD 100%

upvoted 1 times

...

ybad

3 years, 8 months ago

CD C:imbalance of test (1000 positive, 999000 negative = 0.1% positive) thus C to increase that D :also to reduce generalizing, since everyone says no, the model would generalize to no, but increasing the penalty of a false negative would reduce generalizing..

upvoted 2 times

...

Omar_Cascudo

3 years, 8 months ago

It is needed to diminish the FP, because they are player predicted to pay and in reality will not pay. So FP should impact the cost metric more. CE should be the answer.

upvoted 2 times

...

bidds

3 years, 8 months ago

CD are correct for sure.

upvoted 3 times

...

hans1234

3 years, 8 months ago

It is C,E... we want to find all paying customers, which are positives, so we have to punish incorrectly finding negatives, which is E

upvoted 2 times

...

Wira

3 years, 8 months ago

CD although i am worried about the noise being introduced as it could skew the data nevertheless no better answer is given

upvoted 2 times

...

aws_razor

3 years, 8 months ago

CD We need high recall so that we do not miss many Positive cases. In that case we need to have less False Negative(FN) therefore it should have high impact on cost function.

upvoted 3 times

...

roytruong

3 years, 8 months ago

in my view, CD are answers C: of course, handle the imbalanced dataset D: right now, model accuracy is 99%, it means model predict everything is negative leading to FN problem, so we need to minimize it more in cost function

upvoted 3 times

...

wuha5086

3 years, 9 months ago

CD, FN are valuable players, we should care more on FN

upvoted 8 times

...

Load full discussion...