exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 27 discussion

A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions.
Here is an example from the dataset:
"The quck BROWN FOX jumps over the lazy dog.`
Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Choose three.)

  • A. Perform part-of-speech tagging and keep the action verb and the nouns only.
  • B. Normalize all words by making the sentence lowercase.
  • C. Remove stop words using an English stopword dictionary.
  • D. Correct the typography on "quck" to "quick.ג€
  • E. One-hot encode all words in the sentence.
  • F. Tokenize the sentence into words.
Show Suggested Answer Hide Answer
Suggested Answer: BCF 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
ozan11
Highly Voted 2 years, 5 months ago
B C F should be correct.
upvoted 35 times
...
BigEv
Highly Voted 2 years, 5 months ago
I will select B, C, F 1- Apply words stemming and lemmatization 2- Remove Stop words 3- Tokensize the sentences https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
upvoted 26 times
...
Togy
Most Recent 4 days, 3 hours ago
Selected Answer: BDF
B. Normalize all words by making the sentence lowercase: Word2Vec treats words as distinct entities. If you don't convert everything to lowercase, "The" and "the" will be considered different words, which is generally not what you want. Lowercasing ensures consistency. D. Correct the typography on "quck" to "quick": Misspellings need to be corrected. Word2Vec learns embeddings based on the words it encounters. If "quck" remains, it will be treated as a separate word from "quick," and you'll lose the relationship between them. Correcting typos is crucial for data quality. F. Tokenize the sentence into words: Tokenization is the process of breaking down the sentence into individual words (or tokens). Word2Vec operates on individual words, so you need to split the sentence into its constituent parts. This is a fundamental step in NLP.
upvoted 1 times
...
JonSno
3 weeks, 6 days ago
Selected Answer: BDF
While C - is debatable - not always necessary to remove stop words in Word2Vec - as sometimes the stop words do provide context ==================== For Word2Vec training, data preprocessing is essential to ensure that words are correctly represented, consistent, and free from unnecessary noise. The key steps are: Lowercasing the text (B) Word embeddings treat "FOX" and "fox" as different words. To avoid redundancy, lowercasing the text ensures consistency. Correcting typos (D) "quck" should be corrected to "quick" to prevent incorrect word representations in Word2Vec. Misspelled words can create meaningless embeddings. Tokenizing the sentence into words (F) Word2Vec operates at the word level, so breaking the sentence into individual tokens (words) is necessary.
upvoted 1 times
...
loict
6 months ago
Selected Answer: BCF
A. NO - word2vec works on raw data B. YES - case here is not significant C. YES - will help reduce dimensionality D. NO - word2vec will do it by itself E. NO - One-hot encoding is for classification F. YES - word2vec takes tokens as input
upvoted 1 times
...
Valcilio
1 year ago
Selected Answer: BCF
Data need to be tokenized and cleaned!
upvoted 2 times
...
Aninina
1 year, 2 months ago
Selected Answer: BCF
B, C F is the correct
upvoted 2 times
...
SophieSu
2 years, 4 months ago
BCF correct. D is not correct (Pay attention to “in a repeatable manner” in the question.)
upvoted 2 times
...
cloud_trail
2 years, 4 months ago
B/C/F. D should not be performed because spell check is a subjective thing. You don't know for sure what the word was supposed to be if you have a typo.
upvoted 2 times
...
harmanbirstudy
2 years, 4 months ago
I saw this exact question on "whizlabs" practice exam and correct options were B/C/F
upvoted 1 times
...
GeeBeeEl
2 years, 4 months ago
https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281 Data Preparation — Define corpus, clean, normalise and tokenise words To begin, we start with the following corpus: “natural language processing and machine learning is fun and exciting” For simplicity, we have chosen a sentence without punctuation and capitalization. Also, we did not remove stop words “and” and “is”. In reality, text data are unstructured and can be “dirty”. Cleaning them will involve steps such as o removing stop words, o removing punctuations, o convert text to lowercase (actually depends on your use-case), o replacing digits, etc. o After preprocessing, we then move on to tokenising the corpus Answer: B, C, F
upvoted 8 times
cnethers
2 years, 4 months ago
BCF is 100% correct
upvoted 2 times
...
...
Antriksh
2 years, 5 months ago
Correct answers are B, C and F
upvoted 2 times
...
TuanAnh
2 years, 5 months ago
The correct answer is B, C and F A: POS tagging has nothing to do with word2vec D: fixing "quck" to "quick" only works for that specific word F: word2vec can use CBOW or skipgram, so no need to have one-hot decoding here
upvoted 4 times
TuanAnh
2 years, 5 months ago
sorry E: word2vec can use CBOW or skipgram, so no need to have one-hot decoding here
upvoted 4 times
...
...
PRC
2 years, 5 months ago
BCF is correct
upvoted 2 times
...
AKT
2 years, 5 months ago
B, C F correct
upvoted 2 times
...
Phong
2 years, 5 months ago
B, C, and F are correct answers. I have done this question many times in many practice tests.
upvoted 12 times
...
tap123
2 years, 5 months ago
B, C, F are my choice. D is also possible but not as widely used as others.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago