Exam Certified Machine Learning Associate topic 1 question 21 discussion

Actual exam question from Databricks's Certified Machine Learning Associate

Question #: 21
Topic #: 1

[All Certified Machine Learning Associate Questions]

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.
Which of the following feature engineering tasks will be the least efficient to distribute?

A. One-hot encoding categorical features
B. Target encoding categorical features
C. Imputing missing feature values with the mean
D. Imputing missing feature values with the true median
E. Creating binary indicator features for missing values

Show Suggested Answer

Suggested Answer: D 🗳️

by EricP99 at June 15, 2024, 12:18 a.m.

Comments

Submit Cancel

weslleylc

7 months, 1 week ago

Selected Answer: D

D. Calculating the median is computationally expensive in a distributed system because it requires sorting, a global operation involving data shuffling, and node coordination. In contrast, calculating the mean is efficient as it only requires summing and aggregating results across partitions.

upvoted 1 times

...

ricorosol

9 months, 1 week ago

B. Target encoding involves replacing each category of a categorical variable with a statistic related to the target variable (like the mean of the target for that category).

upvoted 2 times

...

EricP99

1 year ago

would argue that the answer is b - Target encoding (also known as mean encoding) involves replacing each category in a categorical feature with the mean of the target variable for that category. This process is more complex and challenging to distribute efficiently because it requires calculating and applying the mean target value for each category.

upvoted 3 times

...