Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.
exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 14 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 14
Topic #: 1
[All Professional Data Engineer Questions]

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

  • A. There are very few occurrences of mutations relative to normal samples.
  • B. There are roughly equal occurrences of both normal and mutated samples in the database.
  • C. You expect future mutations to have different features from the mutated samples in the database.
  • D. You expect future mutations to have similar features to the mutated samples in the database.
  • E. You already have labels for which samples are mutated and which are normal in the database.
Show Suggested Answer Hide Answer
Suggested Answer: AD 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
jvg637
Highly Voted 4 years, 8 months ago
I think that AD makes more sense. D is the explanation you gave. In the rest, A makes more sense, in any anomaly detection algorithm it is assumed a priori that you have much more "normal" samples than mutated ones, so that you can model normal patterns and detect patterns that are "off" that normal pattern. For that you will always need the no. of normal samples to be much bigger than the no. of mutated samples.
upvoted 73 times
BigQuery
2 years, 11 months ago
Guys its A & C. Anomaly detection has two basic assumptions: ->Anomalies only occur very rarely in the data. (a) ->Their features differ from the normal instances significantly. (c) link -> https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1#:~:text=Unsupervised%20Anomaly%20Detection%20for%20Univariate%20%26%20Multivariate%20Data.&text=Anomaly%20detection%20has%20two%20basic,from%20the%20normal%20instances%20significantly.
upvoted 19 times
szefco
2 years, 11 months ago
I don't agree on C. Anomaly detection assumes "Their features differ from the NORMAL INSTANCES significantly" and in the C option you have: "You expect future mutations to have different features from the MUTATED SAMPLES IN THE DATABASE". IMHO Answer D fits better: "D. You expect future mutations to have similar features to the mutated samples in the database." - in other words: Expect future anomalies to be similar to the anomalies that we already have in database
upvoted 27 times
...
...
...
jvg637
Highly Voted 4 years, 8 months ago
A instead of B: "anomaly detection (also outlier detection[1]) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data
upvoted 21 times
...
SamuelTsch
Most Recent 1 month ago
Selected Answer: AC
The keyword is unsupervised anomaly detection. So A is correct. We think and should ensure the majority of data represents 'normal'. Unsupervised methods are good for detecting unknown patterns. Thus C could be correct.
upvoted 1 times
SamuelTsch
1 month ago
I correct my answer. AD should be better. Unsupervised method is usually used for grouping the data. So, if the future mutations have similar features to the mutated samples, our trained model should group it into anomalies even though no label exists.
upvoted 1 times
...
...
hendrixlives
2 months ago
Selected Answer: AD
AD: to use unsupervised anomaly detection the anomalies a) must be rare b) they must differ from the NORMAL. So... A: mutated samples must be scarce compared to normal tissue. D: yes, we expect the future mutated samples to have similar features to the mutated samples currently in the database. Why not C? If I train my model with mutated samples with specific characteristics, I do not expect it to find different mutations. In the future, when new mutations appear, I would retrain my model including those new samples.
upvoted 4 times
...
MaxNRG
2 months ago
Selected Answer: AD
Anomaly detection has two basic assumptions: *Anomalies only occur very rarely in the data. *Their features differ from the normal instances significantly. Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called “normal” instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. https://www.science.gov/topicpages/u/unsupervised+anomaly+detection A because “Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal”, B is for Supervised anomaly detection https://en.wikipedia.org/wiki/Anomaly_detection
upvoted 4 times
...
gudiking
2 months ago
Selected Answer: AD
A - anomaly detection is used for detecting rare events, meaning it is expected that there are much less of those than of normal ones. D - you expect the future mutations to be similar to the mutations you already have, so that you can detect them (pattern recognition)
upvoted 2 times
...
jkhong
2 months ago
Selected Answer: AD
A makes sense C and D compares future mutations to mutated samples in database The question is pretty badly worded… If we were to run a full unsupervised anomaly detection over the entire dataset, C and D will be true, since some future mutations may be similar to current mutations and some will be significantly different to current mutations. The question is suggesting "labelling" tissue samples using unsupervised anomaly detection, and subsequently using the labels with a supervised algorithm to classify future samples. If this interpretation of the question is correct, then D makes sense
upvoted 3 times
...
korntewin
2 months ago
Selected Answer: AD
The answer should be AD. A, anomaly should have a little amount, if there are many samples then we should do classification instead, because unsupervised will give a lot of false positive. D, the future anomaly should be of the same distribution as present anomaly! or else our anomaly detection will not be generalize to the future feature.
upvoted 2 times
...
samdhimal
2 months ago
A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated. D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.
upvoted 2 times
...
azmiozgen
2 months ago
Selected Answer: AD
D should be correct. You expect future samples will correlate with the training samples. That's the whole point of learning procedure. If you do not expect that they have similar features, then why would you use features in the training samples in the first place? A is also correct, since anomaly labels would be seen rarely.
upvoted 5 times
...
rocky48
2 months ago
Selected Answer: AD
A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated. D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.
upvoted 2 times
...
Nittin
3 months, 2 weeks ago
Selected Answer: AC
A. There are very few occurrences of mutations relative to normal samples. Anomaly detection is well-suited for situations where anomalies (in this case, mutations) are rare compared to the normal cases. When the dataset is highly imbalanced, with far fewer mutated samples than normal samples, anomaly detection can be used to identify these rare cases as outliers or anomalies. C. You expect future mutations to have different features from the mutated samples in the database. Unsupervised anomaly detection works under the assumption that anomalies (mutations) will differ significantly from the majority of the data (normal samples). If future mutations are expected to exhibit different features, this method can help detect those anomalies as deviations from the normal samples.
upvoted 2 times
...
iooj
3 months, 3 weeks ago
Selected Answer: AC
A. There are very few occurrences of mutations relative to normal samples. Anomaly detection is particularly useful in scenarios where anomalies (mutations, in this case) are rare compared to normal instances. This aligns with the nature of anomaly detection, which focuses on identifying rare events that deviate significantly from the majority (normal) data. C. You expect future mutations to have different features from the mutated samples in the database. Unsupervised anomaly detection methods do not rely on prior knowledge of anomalies. They work on the assumption that anomalies will be different from normal instances in a significant way. If future mutations have different features from known mutations, it supports using an unsupervised method as it can detect novel anomalies not seen during training
upvoted 3 times
...
Roulle
4 months, 2 weeks ago
That's A and D. The aim of unsupervised classification of anomalies is to identify sub-groups with characteristics in common that may resemble anomalies. So, when a new mutation appears, we can determine whether it shares characteristics with previously discovered anomaly subgroups. If this mutation is an anomaly and has very different characteristics from our detected anomaly subgroup, it is likely to be associated with an incorrect group.
upvoted 1 times
...
pandey_0307
5 months, 3 weeks ago
Selected Answer: AC
A. There are very few occurrences of mutations relative to normal samples. Unsupervised anomaly detection is particularly useful in situations where anomalies (mutations) are rare compared to the normal instances. This characteristic aligns well with unsupervised methods that can detect outliers or rare events in a dataset dominated by normal samples. C. You expect future mutations to have different features from the mutated samples in the database. Anomaly detection methods are effective when future anomalies do not follow the same patterns as the known anomalies. These methods aim to identify instances that significantly deviate from the norm, which suits the scenario where future mutations might exhibit different characteristics from those currently known.
upvoted 1 times
...
tdum76000
11 months, 1 week ago
Selected Answer: AC
As A is a good answer, i'd like to give my point of view on the second right answer. I initially thought D was the correct one, as you would normally train your model to detect mutations seen in the training dataset. But the goal of unsupervised learning is to detect unidentified patterns. If you were sure the mutations would always look the same, you'd rather use supervised learning and labels the "normal" and "mutated" tissues, which would result in better performances in my point of view.
upvoted 2 times
...
spicebits
1 year ago
Selected Answer: AC
Unsupervised anomaly detection is best for scenarios without labels or when the anomalies are unknown or ever-changing
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...