Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 48 discussion

Actual exam question from Google's Professional Machine Learning Engineer

Question #: 48
Topic #: 1

[All Professional Machine Learning Engineer Questions]

You started working on a classification problem with time series data and achieved an area under the receiver operating characteristic curve (AUC ROC) value of
99% for training data after just a few experiments. You haven't explored using any sophisticated algorithms or spent any time on hyperparameter tuning. What should your next step be to identify and fix the problem?

A. Address the model overfitting by using a less complex algorithm.
B. Address data leakage by applying nested cross-validation during model training.
C. Address data leakage by removing features highly correlated with the target value.
D. Address the model overfitting by tuning the hyperparameters to reduce the AUC ROC value.

Show Suggested Answer

Suggested Answer: B 🗳️

by Paul_Dirac at June 27, 2021, 3:10 a.m.

Comments

Submit Cancel

Paul_Dirac

Highly Voted 3 years, 9 months ago

Ans: B (Ref: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9) (C) High correlation doesn't mean leakage. The question may suggest target leakage and the defining point of this leakage is the availability of data after the target is available.(https://www.kaggle.com/dansbecker/data-leakage)

upvoted 27 times

Jarek7

1 year, 9 months ago

This ref doesn't explain WHY we should use NCV in this case - it just explains HOW to use NCV when dealing with time series. Cross-validation, including nested cross-validation, is a powerful tool for model evaluation and hyperparameter tuning, but it does NOT DIRECTLY ADDRESS data leakage. Data leakage refers to a situation where information from the test dataset leaks into the training dataset, causing the model to have an unrealistically high performance. Nested cross-validation can indeed help provide a more accurate estimation of the model's performance on unseen data, but IT DOESN'T SOLVE the underlying issue of data leakage if it's already present.

upvoted 6 times

...

John_Pongthorn

Highly Voted 2 years, 1 month ago

Selected Answer: C

C: this is correct choice 1000000000% This is data leakage issue on training data https://cloud.google.com/automl-tables/docs/train#analyze The question is from this content. If a column's Correlation with Target value is high, make sure that is expected, and not an indication of target leakage. Let 's explain on my owner way, sometime the feature used on training data use value to calculate something from target value unintentionally, it result in high correlation with each other. for instance , you predict stock price by using moving average, MACD , RSI despite the fact that 3 features have been calculated from price (target).

upvoted 8 times

black_scissors

1 year, 10 months ago

I agree. Besides, when a CV is done randomly (not split by the time point) it can make things worse.

upvoted 2 times

...

Sivaram06

Most Recent 3 months, 1 week ago

Selected Answer: B

Gemini Explanation: Nested Cross-validation: Nested cross-validation is a robust technique to detect and mitigate data leakage. It involves two loops of cross-validation: Inner loop: Tunes hyperparameters and performs model selection. Outer loop: Evaluates the model's performance on unseen data, giving you a more realistic estimate of how well your model generalizes. Why not C : C. Address data leakage by removing features highly correlated with the target value: While highly correlated features can sometimes be a sign of leakage, they might also be genuinely informative features. Removing them without proper analysis might hurt your model's performance.

upvoted 1 times

...

Pau1234

4 months, 2 weeks ago

Selected Answer: B

As per the PMLE cert book the answer is B. Since the model is performing well with training data, it is a case of data leakage. Cross‐validation is one of the strategies to overcome data leakage.

upvoted 2 times

desertlotus1211

3 months, 2 weeks ago

The book mentions cross-validation.. NOT 'nested cross-validation' (page 34), however this answer is better than C. you want to remove value that are NOT correlated versus correlated as in answer C. ;)

upvoted 1 times

...

Foxy2021

6 months ago

Select answer: C. --reason--- While B (nested cross-validation) helps improve the evaluation process and prevents over-optimistic performance estimates, it doesn't tackle the root cause of data leakage. Data leakage is often caused by features that are too closely tied to the target—in this case, the unusually high AUC suggests that the model is gaining unfair information.

upvoted 2 times

...

chirag2506

9 months, 3 weeks ago

Selected Answer: B

B is the correct option

upvoted 1 times

...

PhilipKoku

10 months, 2 weeks ago

Selected Answer: C

C) Is the best answer

upvoted 1 times

...

girgu

10 months, 3 weeks ago

Selected Answer: C

Nested cross validation will not work for time series data. Time series data require the expanding widow training data set. Seems most likely the issue is high correlation in columns.

upvoted 1 times

...

AnnaR

11 months, 3 weeks ago

B: correct. considering c, but why should we remove a feature of highly predictive nature?? for me, this does not explain the problem of overfitting... a highly predictive feature is also useful for good performance evaluated on the test set. --> Decide for B!

upvoted 2 times

...

gscharly

12 months ago

Selected Answer: B

agree with Paul_Dirac

upvoted 1 times

...

b1a8fae

1 year, 3 months ago

Selected Answer: B

I initially went with B- however after reading this: https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/ I think C is right. Quoted from the link: "Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.". Overfitting is exactly our problem here. Correlated features in the dataset may be a sign of data leakage, but they are not necessarily.

upvoted 1 times

...

Sum_Sum

1 year, 5 months ago

Selected Answer: B

I think its B. GPT4 makes a good argument about C: While this is a valid approach to handling data leakage, it might not be sufficient if the leakage is due to reasons other than high correlation, such as temporal leakage in time-series data.

upvoted 1 times

...

pico

1 year, 7 months ago

Selected Answer: A

Option A: This option is a reasonable choice. Switching to a less complex algorithm can help reduce overfitting, and using k-fold cross-validation can provide a better estimate of how well the model will generalize to unseen data. It's essential to ensure that the high performance isn't solely due to overfitting.

upvoted 1 times

pico

1 year, 7 months ago

Option B: Nested cross-validation is primarily used to estimate model performance accurately and select the best model hyperparameters. While it's a good practice, it doesn't directly address the overfitting issue. It helps prevent over-optimistic model performance estimates but doesn't necessarily fix the overfitting problem. Option C: Removing features highly correlated with the target value can be a valid step in feature selection or preprocessing. However, it doesn't directly address the overfitting issue or explain why the model is performing exceptionally well on the training data. It's a separate step from mitigating overfitting. Option D: This option is incorrect. Tuning hyperparameters should aim to improve model performance on the validation set, not reduce it. In summary, the most appropriate next step is Option A:

upvoted 2 times

...

atlas_lyon

1 year, 7 months ago

Selected Answer: B

B: If splits are done chronologically(as it is always advised), Nested CV should work C: High correlation with target means we have to check if this is strong explanatory power or data leakage. dropping the features won't help us distinguish in those cases but may help reveal independence contribution of remaining features

upvoted 1 times

...

tavva_prudhvi

1 year, 8 months ago

Selected Answer: B

Option C is a good step to avoid overfitting, but it's not necessarily the best approach to address data leakage. Data leakage occurs when information from the validation or test data leaks into the training data, leading to overly optimistic performance metrics. In time-series data, it's important to avoid using future information to predict past events. Removing features highly correlated with the target value may help to reduce overfitting, but it does not necessarily address data leakage. Therefore, applying nested cross-validation during model training is a better approach to address data leakage in this scenario.

upvoted 2 times

...

Jarek7

1 year, 9 months ago

Selected Answer: C

https://towardsdatascience.com/avoiding-data-leakage-in-timeseries-101-25ea13fcb15f Directly says: "Dive straight into the MVP, cross-validate later!" MVP stands for Minimum Viable Product

upvoted 1 times

...

Liting

1 year, 9 months ago

Selected Answer: B

Agree with Paul_Dirac. Also it is recommended to use nested-cross-validation to avoid data leakage in time series data.

upvoted 1 times

...

Load full discussion...

Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 48 discussion

Comments

Paul_Dirac

Jarek7

John_Pongthorn

black_scissors

Sivaram06

Pau1234

desertlotus1211

Foxy2021

chirag2506

PhilipKoku

girgu

AnnaR

gscharly

b1a8fae

Sum_Sum

pico

pico

atlas_lyon

tavva_prudhvi

Jarek7

Liting

SY0-701