exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 80 discussion

A credit card company wants to build a credit scoring model to help predict whether a new credit card applicant will default on a credit card payment. The company has collected data from a large number of sources with thousands of raw attributes. Early experiments to train a classification model revealed that many attributes are highly correlated, the large number of features slows down the training speed significantly, and that there are some overfitting issues.
The Data Scientist on this project would like to speed up the model training time without losing a lot of information from the original dataset.
Which feature engineering technique should the Data Scientist use to meet the objectives?

  • A. Run self-correlation on all features and remove highly correlated features
  • B. Normalize all numerical values to be between 0 and 1
  • C. Use an autoencoder or principal component analysis (PCA) to replace original features with new features
  • D. Cluster raw data using k-means and use sample data from each cluster to build a new dataset
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
ahquiceno
Highly Voted 3 years, 7 months ago
Answer C. Need reduce the features preserving the information on it this is achieve using PCA.
upvoted 26 times
Dr_Kiko
3 years, 5 months ago
without losing a lot of information from the original dataset since when PCA retains information?
upvoted 3 times
...
VinceCar
2 years, 5 months ago
PCA helps to speed up the training
upvoted 4 times
...
...
[Removed]
Highly Voted 3 years, 6 months ago
Answer is A, because one must avoid information loss that PCA or autoencoders introduce through new features (https://www.i2tutorials.com/what-are-the-pros-and-cons-of-the-pca/). Otherwise, I would perform C.
upvoted 6 times
SophieSu
3 years, 6 months ago
If you REMOVE highly correlated features(that means in pairs), the model lost a lot of information.
upvoted 4 times
...
rodrigus
2 years, 1 month ago
A doesn't have sense. Self-correlation is for times series data, not for pair correlation
upvoted 2 times
...
...
xicocaio
Most Recent 7 months ago
Selected Answer: A
This question can be misleading. I would choose A if self-correlation in the dataset is meaning pair-wise correlation, this is the most typical approach in real life. But if self-correlation means auto-correlation as in the time-series treatment, then it is wrong. Issues with answer C: Autoencoders are notorious for being hard to interpret. With PCA it is possible, but definitely not easy if you have a large dataset. In real life with this scenario, you would always go with pairwise correlation as the most simple yet effective approach.
upvoted 1 times
...
Giodefa96
8 months, 4 weeks ago
Selected Answer: C
Answer is C
upvoted 1 times
...
geoan13
1 year, 5 months ago
Answer C PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variables. Therefore, PCA can effectively eliminate multicollinearity between features. https://towardsdatascience.com/how-do-you-apply-pca-to-logistic-regression-to-remove-multicollinearity-10b7f8e89f9b#:~:text=PCA%20(Principal%20Component%20Analysis)%20takes,effectively%20eliminate%20multicollinearity%20between%20features.
upvoted 1 times
...
Mickey321
1 year, 8 months ago
Selected Answer: C
Option C
upvoted 1 times
...
Mickey321
1 year, 8 months ago
Selected Answer: C
An autoencoder is a type of neural network that can learn a compressed representation of the input data, called the latent space, by encoding and decoding the data through multiple hidden layers1. PCA is a statistical technique that can reduce the dimensionality of the data by finding a set of orthogonal axes, called the principal components, that capture the most variance in the data2. Both methods can transform the original features into new features that are lower-dimensional, uncorrelated, and informative.
upvoted 1 times
...
kaike_reis
1 year, 8 months ago
Selected Answer: C
C is the correct. Self-correlation is for time series, which is not mention here. Besides that, even if was correlation only, try to do this in thousand features...
upvoted 1 times
...
vbal
1 year, 11 months ago
A . run correlation matrix and remove highly correlated features.
upvoted 1 times
...
JK1977
1 year, 11 months ago
Selected Answer: C
PCA for feature reduction
upvoted 1 times
...
GOSD
1 year, 11 months ago
is it just me or is every 15th answer here PCA?
upvoted 2 times
...
oso0348
2 years ago
Selected Answer: C
Using an autoencoder or PCA can help reduce the dimensionality of the dataset by creating new features that capture the most important information in the original dataset while discarding some of the noise and highly correlated features. This can help speed up the training time and reduce overfitting issues without losing a lot of information from the original dataset. Option A may remove too many features and may not capture all the important information in the dataset, while option B only rescales the data and does not address the issue of highly correlated features. Option D is not a feature engineering technique and may not be an effective way to reduce the dimensionality of the dataset.
upvoted 1 times
...
Paolo991
2 years, 1 month ago
Selected Answer: C
PCA builds new features starting from high correlated ones. So it matches the question
upvoted 1 times
...
Sneep
2 years, 3 months ago
It's C. The Data Scientist should use principal component analysis (PCA) to replace the original features with new features. PCA is a technique that reduces the dimensionality of a dataset by projecting it onto a lower-dimensional space, while preserving as much of the original variation as possible. This can help to speed up the training time of the model and reduce overfitting issues, without losing a significant amount of information from the original dataset.
upvoted 1 times
...
Aninina
2 years, 3 months ago
Selected Answer: C
C: PCA is the solution
upvoted 1 times
...
ovokpus
2 years, 10 months ago
Selected Answer: C
Correction to C. Removing correlated features from hundreds of columns will be tedious and time consuming. PCA is the way to go here. Apologies for the flip
upvoted 2 times
...
ovokpus
2 years, 10 months ago
Selected Answer: A
Answer is A. Eliminate features that are highly correlated. This will not compromise the quality of the feature space as much as PCA would.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago