Exam Certified Machine Learning Associate topic 1 question 7 discussion

Actual exam question from Databricks's Certified Machine Learning Associate

Question #: 7
Topic #: 1

[All Certified Machine Learning Associate Questions]

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?

A. One-hot encoding is not supported by most machine learning libraries.
B. One-hot encoding is dependent on the target variable’s values which differ for each application.
C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
E. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Show Suggested Answer

Suggested Answer: E 🗳️

by EricP99 at June 14, 2024, 11:39 p.m.

Comments

Submit Cancel

Shuttle

3 weeks ago

Selected Answer: E

E for sure. Some algorithms do not support or work less efficient with one-hot encoding.

upvoted 1 times

...

oliver29

7 months, 1 week ago

Selected Answer: C

One-hot encoding introduces a significant computational overhead, particularly when there are many categories in the feature variables. This is because it creates a new binary feature for each category, resulting in a high-dimensional representation. Such transformations are typically more efficient when performed on smaller, task-specific training datasets rather than on a global feature repository that serves multiple applications. Encoding in the feature repository also risks being rigid and unable to adapt to new categories, which further supports your argument.

upvoted 2 times

...

jaydip650

8 months, 3 weeks ago

. One-hot encoding is dependent on the target variable’s values which differ for each application. This one doesn't hold up because one-hot encoding isn't dependent on the target variable at all. It only represents categorical features and their individual categories as binary vectors. E. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms. Certain algorithms, like tree-based models, are less sensitive to the one-hot encoding and might perform better with other encoding techniques. However, the primary issue often boils down to the high dimensionality that one-hot encoding can introduce, which can affect algorithms that don't handle sparse data well. So, you're still left with C as the best justification.

upvoted 2 times

...

8605246

1 year, 1 month ago

It might actually be E. According to these docs, this is the reason why the change was introduced was to allow algorithms that expect continuous features, such as logistic regression to use categorical features

upvoted 3 times

...

EricP99

1 year, 1 month ago

Correct answer B

upvoted 1 times

...