Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 130 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 130
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A data scientist must build a custom recommendation model in Amazon SageMaker for an online retail company. Due to the nature of the company's products, customers buy only 4-5 products every 5-10 years. So, the company relies on a steady stream of new customers. When a new customer signs up, the company collects data on the customer's preferences. Below is a sample of the data available to the data scientist.

How should the data scientist split the dataset into a training and test set for this use case?

A. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set.
B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
C. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set.
D. Randomly select 10% of the users. Split off all interaction data from these users for the test set.

Show Suggested Answer

Suggested Answer: B 🗳️

by [deleted] at Feb. 6, 2021, 1:07 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

[Removed]

Highly Voted 3 years, 1 month ago

I would select B, straight from this AWS example: https://aws.amazon.com/blogs/machine-learning/building-a-customized-recommender-system-in-amazon-sagemaker/

upvoted 26 times

ttsun

2 years, 11 months ago

the blog didn't mentioned anything about sample selection. how is B arrived?

upvoted 3 times

...

NicZ1111

Highly Voted 2 years, 11 months ago

I think the answer is D because customers by only 4-5 products every 5-10 years so it doesn't make sense to get 10% interactions for each user as a test set.

upvoted 8 times

jrff

1 year, 12 months ago

Yes, agree. Answer should be D

upvoted 2 times

...

VinceCar

1 year, 11 months ago

B. Recommendation should use the historcial to predict the furture action. B is using the older records to prediect the newer records. D is using 90% user to predict other 10%, 90% is irrelevant to other 10%.

upvoted 2 times

...

kawaimahiro

Most Recent 5 months ago

There is no difference between A and D, so I prefer B as the answer

upvoted 3 times

...

kyuhuck

8 months, 2 weeks ago

Selected Answer: D

The best way to split the dataset into a training and test set for this use case is to randomly select 10% of the users and split off all interaction data from these users for the test set. This is because the company relies on a steady stream of new customers, so the test set should reflect the behavior of new customers who have not been seen by the model before. The other options are not suitable because they either mix old and new customers in the test set (A and B), or they bias the test set towards users with less interaction data . References: Amazon SageMaker Developer Guide: Train and Test Datasets Amazon Personalize Developer Guide: Preparing and Importing Data

upvoted 1 times

...

praveenaws

9 months, 2 weeks ago

Selected Answer: D

Primary concern is to evaluate the model's performance on completely new users then option D would be more appropriate.

upvoted 2 times

...

u_b

11 months, 2 weeks ago

I'd also take time into consideration, since even for such long-lived products there might be trends or regulations or whatever that make customers prefer one over the other. => A,D are out C will not give you a test set of desired size => out => B

upvoted 2 times

...

sonoluminescence

11 months, 4 weeks ago

Selected Answer: D

If the primary concern is to evaluate the model's performance on completely new users (which seems to be the case for the company in question), then option D would be more appropriate.

upvoted 2 times

...

DimLam

12 months ago

Selected Answer: D

I would choose D. According to the question, because of the product nature, the company doesn't rely on customer-product historical interactions for recommendations. It relies on customer explicit preferences, which are gathered on the first sign-up. The company wants to make recommendations for these new users. It is the main source of revenue for the company. To conduct thorough testing company needs to simulate the new users, not existing ones. To do it we need to randomly choose some percentage of users and remove all of their transactions from the train set. And use their transactions only in test.

upvoted 2 times

...

Rejju

1 year ago

Selected Answer: B

By selecting the most recent interactions for each user, you are simulating the scenario of having new customers in your test set. This method allows you to assess how well the model generalizes to both existing and new users.

upvoted 3 times

...

loict

1 year, 1 month ago

Selected Answer: D

A. NO - the data is denormalized and users' preferences are present in multiple rows in the interactions; if we split off interactions, we introduce leakage as the same user will be present in train & test A. NO - the data is denormalized and users' preferences are present in multiple rows in the interactions; if we split off based on the interaction, we introduce leakage as the same user will be present in train & test C. NO - bias D. YES - no bias and user based

upvoted 3 times

...

loict

1 year, 1 month ago

Selected Answer: B

A NO introduces a bias in the training set (old interactions) vs. test set (new interactions) C NO will have a very sparse test set B NO the same user will be present in the training and test set; we want a user-based model, not an interaction-based one, so a user should belong to only one set D YES - last remaining option.

upvoted 3 times

...

Mickey321

1 year, 1 month ago

Selected Answer: B

Changing to B

upvoted 2 times

...

Mickey321

1 year, 1 month ago

Selected Answer: D

Between B and D but the issue is 4-5 transaction every 5-10 years. Hence last 10% transaction is difficult. So going for D

upvoted 2 times

...

AmitGSL

1 year, 4 months ago

Selected Answer: B

I would select B as it is time series data. Order might be important. So for each user, last 10% of transactions ordered by date could be a good answer.

upvoted 2 times

...

cox1960

1 year, 5 months ago

Selected Answer: D

You want different users in training and in testing datasets, which is C or D. In addition, B is wrong since you cannot take 10% of 4-5 transactions per customer. Actually, between B, C and D, only in D you can get exactly 10%.

upvoted 2 times

...

AjoseO

1 year, 8 months ago

Selected Answer: B

This method is appropriate because it takes into account the unique buying behavior of each customer and is likely to reflect the latest preferences of the customer. It ensures that the test set contains a representative sample of the most recent customer preferences, which is important in this use case where customer preferences change infrequently over time.

upvoted 1 times

...

aScientist

1 year, 11 months ago

Selected Answer: B

B makes the most business sense. Since customers buy products every 4-5 years, it makes sense to be able to predict future sales from really old data. splitting the test set to be only recent interactions is the best way to test model performance from historically 'recent' data

upvoted 1 times

...

Load full discussion...

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 130 discussion

Comments

[Removed]

ttsun

NicZ1111

jrff

VinceCar

kawaimahiro

kyuhuck

praveenaws

u_b

sonoluminescence

DimLam

Rejju

loict

loict

Mickey321

Mickey321

AmitGSL

cox1960

AjoseO

aScientist

SY0-701