While conducting an exploratory analysis of a dataset, you discover that categorical feature A has substantial predictive power, but it is sometimes missing. What should you do?
A.
Drop feature A if more than 15% of values are missing. Otherwise, use feature A as-is.
B.
Compute the mode of feature A and then use it to replace the missing values in feature A.
C.
Replace the missing values with the values of the feature with the highest Pearson correlation with feature A.
D.
Add an additional class to categorical feature A for missing values. Create a new binary feature that indicates whether feature A is missing.
ans: D
A => no, you don't want to drop a feature with high prediction power.
B => i think this could confuse the model... a better solution could be to fill missing values using an algorithm like Expectation Maximization, but using the mode i think is a bad idea in this case, because if you have a significant number of missing values (for example >10%) this would modify the "predictive power". you don't want to lose predictive power of a feature, just guide the model to learn when to use that feature and when to ignore it.
C => this doesn't make any sense for me. not sure what i would do that.
D => i think this could be a really good approach, and i'm pretty sure it would work pretty well a lot of models. the model would learn that when "is_available_feat_A" == True, then it would use the feature A, but whenever it is missing then it would try to use other features.
I guess I would go with D, but it confuses me the fact that in option D, it doesn't say that NaN values are replaced (only that there's a new column added) and this could lead to problems like exploding gradients.
Plus, Google encourages to replace missing values. https://developers.google.com/machine-learning/testing-debugging/common/data-errors
Any thoughts on this?
B
"For categorical variables, we can usually replace missing values with mean, median, or most frequent values"
Dr. Logan Song - Journey to Become a Google Cloud Machine Learning Engineer - Page 48
While this approach may seem reasonable, it can introduce bias in the dataset by over-representing the mode, especially if the missing values are not missing at random.
Options B or D
But isnt there an inconsistency in option D? if you replace missing values with a new category ("missing") why would you haveto create an extra feature?
By creating a new class for the missing values, you explicitly capture the absence of data, which can provide valuable information for predictive modeling. Additionally, creating a binary feature allows the model to distinguish between cases where feature A is present and cases where it is missing, which can be useful for identifying potential patterns or relationships in the data.
By imputing the missing values with the mode (the most frequent value), you retain the original feature's predictive power while handling the missing values
I think, its D.
Option B of imputing the missing values of feature A with the mode of feature A could be a reasonable approach if the mode provides a good representation of the distribution of feature A. However, this method may lead to biased results if the mode is not representative of the missing values. This could be the case if the missing values have a different distribution than the observed values.
Similarly, When a categorical feature has substantial predictive power, it is important not to discard it. Instead, missing values can be handled by adding an additional class for missing values and creating a new binary feature that indicates whether feature A is missing or not. This approach ensures that the predictive power of feature A is retained while accounting for missing values. Computing the mode of feature A and replacing missing values may distort the distribution of the feature and create bias in the analysis. Similarly, replacing missing values with values from another feature may introduce noise and lead to incorrect results.
If our objective was to produce a complete dataset then we might use some average value to fill in the gaps (option B) but in this case we want to predict an outcome, so inventing our own data is not going to help in my view.
Option D is the most sensible approach to let the model choose the best features.
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
wish0035
Highly Voted 1 year, 10 months agofrangm23
1 year, 6 months agohiromi
Highly Voted 1 year, 10 months agotavva_prudhvi
11 months, 3 weeks agoPhilipKoku
Most Recent 4 months, 3 weeks agoMultiCloudIronMan
7 months agofragkris
10 months, 4 weeks agoMickey321
11 months, 2 weeks agoichbinnoah
11 months, 3 weeks agoandresvelasco
1 year, 1 month agoLiting
1 year, 3 months agoPST21
1 year, 4 months agoamtg
1 year, 4 months agoScipione_
1 year, 5 months agoM25
1 year, 5 months agotavva_prudhvi
1 year, 7 months agoBenMS
1 year, 8 months agoPancy
1 year, 10 months agoares81
1 year, 10 months ago