exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 130 discussion

A data scientist must build a custom recommendation model in Amazon SageMaker for an online retail company. Due to the nature of the company's products, customers buy only 4-5 products every 5-10 years. So, the company relies on a steady stream of new customers. When a new customer signs up, the company collects data on the customer's preferences. Below is a sample of the data available to the data scientist.

How should the data scientist split the dataset into a training and test set for this use case?

  • A. Shuffle all interaction data. Split off the last 10% of the interaction data for the test set.
  • B. Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
  • C. Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set.
  • D. Randomly select 10% of the users. Split off all interaction data from these users for the test set.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
[Removed]
Highly Voted 3 years, 1 month ago
I would select B, straight from this AWS example: https://aws.amazon.com/blogs/machine-learning/building-a-customized-recommender-system-in-amazon-sagemaker/
upvoted 26 times
ttsun
2 years, 11 months ago
the blog didn't mentioned anything about sample selection. how is B arrived?
upvoted 3 times
...
...
NicZ1111
Highly Voted 2 years, 11 months ago
I think the answer is D because customers by only 4-5 products every 5-10 years so it doesn't make sense to get 10% interactions for each user as a test set.
upvoted 8 times
jrff
1 year, 12 months ago
Yes, agree. Answer should be D
upvoted 2 times
...
VinceCar
1 year, 11 months ago
B. Recommendation should use the historcial to predict the furture action. B is using the older records to prediect the newer records. D is using 90% user to predict other 10%, 90% is irrelevant to other 10%.
upvoted 2 times
...
...
kawaimahiro
Most Recent 5 months ago
There is no difference between A and D, so I prefer B as the answer
upvoted 3 times
...
kyuhuck
8 months, 2 weeks ago
Selected Answer: D
The best way to split the dataset into a training and test set for this use case is to randomly select 10% of the users and split off all interaction data from these users for the test set. This is because the company relies on a steady stream of new customers, so the test set should reflect the behavior of new customers who have not been seen by the model before. The other options are not suitable because they either mix old and new customers in the test set (A and B), or they bias the test set towards users with less interaction data . References: Amazon SageMaker Developer Guide: Train and Test Datasets Amazon Personalize Developer Guide: Preparing and Importing Data
upvoted 1 times
...
praveenaws
9 months, 2 weeks ago
Selected Answer: D
Primary concern is to evaluate the model's performance on completely new users then option D would be more appropriate.
upvoted 2 times
...
u_b
11 months, 2 weeks ago
I'd also take time into consideration, since even for such long-lived products there might be trends or regulations or whatever that make customers prefer one over the other. => A,D are out C will not give you a test set of desired size => out => B
upvoted 2 times
...
sonoluminescence
11 months, 4 weeks ago
Selected Answer: D
If the primary concern is to evaluate the model's performance on completely new users (which seems to be the case for the company in question), then option D would be more appropriate.
upvoted 2 times
...
DimLam
12 months ago
Selected Answer: D
I would choose D. According to the question, because of the product nature, the company doesn't rely on customer-product historical interactions for recommendations. It relies on customer explicit preferences, which are gathered on the first sign-up. The company wants to make recommendations for these new users. It is the main source of revenue for the company. To conduct thorough testing company needs to simulate the new users, not existing ones. To do it we need to randomly choose some percentage of users and remove all of their transactions from the train set. And use their transactions only in test.
upvoted 2 times
...
Rejju
1 year ago
Selected Answer: B
By selecting the most recent interactions for each user, you are simulating the scenario of having new customers in your test set. This method allows you to assess how well the model generalizes to both existing and new users.
upvoted 3 times
...
loict
1 year, 1 month ago
Selected Answer: D
A. NO - the data is denormalized and users' preferences are present in multiple rows in the interactions; if we split off interactions, we introduce leakage as the same user will be present in train & test A. NO - the data is denormalized and users' preferences are present in multiple rows in the interactions; if we split off based on the interaction, we introduce leakage as the same user will be present in train & test C. NO - bias D. YES - no bias and user based
upvoted 3 times
...
loict
1 year, 1 month ago
Selected Answer: B
A NO introduces a bias in the training set (old interactions) vs. test set (new interactions) C NO will have a very sparse test set B NO the same user will be present in the training and test set; we want a user-based model, not an interaction-based one, so a user should belong to only one set D YES - last remaining option.
upvoted 3 times
...
Mickey321
1 year, 1 month ago
Selected Answer: B
Changing to B
upvoted 2 times
...
Mickey321
1 year, 1 month ago
Selected Answer: D
Between B and D but the issue is 4-5 transaction every 5-10 years. Hence last 10% transaction is difficult. So going for D
upvoted 2 times
...
AmitGSL
1 year, 4 months ago
Selected Answer: B
I would select B as it is time series data. Order might be important. So for each user, last 10% of transactions ordered by date could be a good answer.
upvoted 2 times
...
cox1960
1 year, 5 months ago
Selected Answer: D
You want different users in training and in testing datasets, which is C or D. In addition, B is wrong since you cannot take 10% of 4-5 transactions per customer. Actually, between B, C and D, only in D you can get exactly 10%.
upvoted 2 times
...
AjoseO
1 year, 8 months ago
Selected Answer: B
This method is appropriate because it takes into account the unique buying behavior of each customer and is likely to reflect the latest preferences of the customer. It ensures that the test set contains a representative sample of the most recent customer preferences, which is important in this use case where customer preferences change infrequently over time.
upvoted 1 times
...
aScientist
1 year, 11 months ago
Selected Answer: B
B makes the most business sense. Since customers buy products every 4-5 years, it makes sense to be able to predict future sales from really old data. splitting the test set to be only recent interactions is the best way to test model performance from historically 'recent' data
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago