Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 140 discussion

Actual exam question from Google's Professional Machine Learning Engineer

Question #: 140
Topic #: 1

[All Professional Machine Learning Engineer Questions]

You work for a retailer that sells clothes to customers around the world. You have been tasked with ensuring that ML models are built in a secure manner. Specifically, you need to protect sensitive customer data that might be used in the models. You have identified four fields containing sensitive data that are being used by your data science team: AGE, IS_EXISTING_CUSTOMER, LATITUDE_LONGITUDE, and SHIRT_SIZE. What should you do with the data before it is made available to the data science team for training purposes?

A. Tokenize all of the fields using hashed dummy values to replace the real values.
B. Use principal component analysis (PCA) to reduce the four sensitive fields to one PCA vector.
C. Coarsen the data by putting AGE into quantiles and rounding LATITUDE_LONGTTUDE into single precision. The other two fields are already as coarse as possible.
D. Remove all sensitive data fields, and ask the data science team to build their models using non-sensitive data.

Show Suggested Answer

Suggested Answer: A 🗳️

by mil_spyro at Dec. 13, 2022, 6:54 p.m.

Comments

Submit Cancel

desertlotus1211

3 months, 4 weeks ago

Selected Answer: C

Answer A will strip out the ordinal or numerical relationships present in the data, which can be crucial for model performance

upvoted 1 times

...

phani49

6 months, 2 weeks ago

Selected Answer: C

AGE into Quantiles: • Age is a continuous variable and highly sensitive. Converting it into quantiles (e.g., age ranges) reduces granularity and protects individual privacy while preserving utility for modeling. • Rounding LATITUDE_LONGITUDE: • Latitude and longitude provide precise location information, which can lead to privacy risks. Rounding to single precision (e.g., reducing decimal places) anonymizes the data while retaining geographical relevance for modeling. 2. Existing Fields: • IS_EXISTING_CUSTOMER and SHIRT_SIZE: • These fields are already coarse and unlikely to reveal sensitive information directly (e.g., boolean for IS_EXISTING_CUSTOMER and categorical for SHIRT_SIZE), so no further processing is required.

upvoted 2 times

...

b7ef5e3

7 months, 2 weeks ago

Selected Answer: C

Between A and C, however A would not work well for linear data like age and long/lat. By hashing you are creating discrete categories rather than linear ones, making it difficult to find trends from other data. A may be more practical of a decision if they incorporated binning or something beforehand.

upvoted 1 times

...

bobjr

1 year ago

Selected Answer: C

The best approach is C. Coarsen the data by putting AGE into quantiles and rounding LATITUDE_LONGITUDE into single precision. The other two fields are already as coarse as possible. Here's why: Preserves Utility: Coarsening the data reduces its sensitivity while retaining some of its informational value for modeling. Age quantiles and approximate location can still be useful features for certain types of models. Minimizes Risk: By removing the exact age and precise location, you significantly reduce the risk of re-identification or misuse of sensitive information. Practicality: Coarsening is a relatively simple technique to implement and doesn't require complex transformations or additional model training. pen_spark

upvoted 3 times

...

pico

1 year, 7 months ago

Selected Answer: D

This approach involves not providing the sensitive fields (AGE, IS_EXISTING_CUSTOMER, LATITUDE_LONGITUDE, and SHIRT_SIZE) to the data science team for model training. Instead, the team can focus on building models using non-sensitive data. This helps to mitigate the risk of exposing sensitive customer information during the development and training process. While options A, B, and C propose different methods of obfuscating or transforming the sensitive data, they may introduce complexities and potential risks. Tokenizing with hashed dummy values (option A) may not be foolproof in terms of security, and PCA (option B) may not effectively retain the necessary information for accurate modeling. Coarsening the data (option C) might still retain some level of identifiable information, and it may not be sufficient for ensuring the privacy of sensitive data.

upvoted 1 times

LFavero

1 year, 4 months ago

why would you remove potential important features from the training?

upvoted 2 times

...

M25

2 years, 1 month ago

Selected Answer: A

Went with A

upvoted 3 times

...

TNT87

2 years, 3 months ago

Selected Answer: D

D. Remove all sensitive data fields, and ask the data science team to build their models using non-sensitive data. This is the best approach to protect sensitive customer data. Removing the sensitive fields is the most secure option because it eliminates the risk of any potential data breaches. Tokenizing or coarsening the data may still reveal sensitive information if the hashed dummy values can be reversed or if the coarsening can be used to identify individual customers. PCA can also be a useful technique to reduce dimensionality and protect privacy, but it may not be appropriate in this case because it is not clear how the sensitive fields can be combined into a single PCA vector without losing information.

upvoted 1 times

tavva_prudhvi

1 year, 11 months ago

Removing all sensitive data fields (Option D) would likely limit the effectiveness of the machine learning model, as important predictive variables would be excluded from the training process. It is important to balance privacy considerations with the need to train accurate models that can provide valuable insights and predictions.

upvoted 1 times

pico

1 year, 7 months ago

But in option A, Hashing can result in information loss. While the original values are hidden, the hashed values might not retain the same level of information, which can impact the effectiveness of the machine learning models.

upvoted 1 times

...

Scipione_

2 years, 4 months ago

Selected Answer: A

B -> possible in general but not suitable in this case since you don't know AGE, IS_EXISTING_CUSTOMER, LATITUDE_LONGITUDE, and SHIRT_SIZE are the first components in PCA. C -> You are changing data which could be highly correlated with the output D -> like C explanation Answer 'A' uses hashing so you encript the data without losing relevant information

upvoted 4 times

...