Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 27 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 27
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions.
Here is an example from the dataset:
"The quck BROWN FOX jumps over the lazy dog.`
Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Choose three.)

A. Perform part-of-speech tagging and keep the action verb and the nouns only.
B. Normalize all words by making the sentence lowercase.
C. Remove stop words using an English stopword dictionary.
D. Correct the typography on "quck" to "quick.ג€
E. One-hot encode all words in the sentence.
F. Tokenize the sentence into words.

Show Suggested Answer

Suggested Answer: BCF 🗳️

by cybe001 at Jan. 12, 2020, 2:54 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

ozan11

Highly Voted 2 years, 9 months ago

B C F should be correct.

upvoted 35 times

...

BigEv

Highly Voted 2 years, 9 months ago

I will select B, C, F 1- Apply words stemming and lemmatization 2- Remove Stop words 3- Tokensize the sentences https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925

upvoted 26 times

...

Togy

Most Recent 3 months, 2 weeks ago

Selected Answer: BDF

B. Normalize all words by making the sentence lowercase: Word2Vec treats words as distinct entities. If you don't convert everything to lowercase, "The" and "the" will be considered different words, which is generally not what you want. Lowercasing ensures consistency. D. Correct the typography on "quck" to "quick": Misspellings need to be corrected. Word2Vec learns embeddings based on the words it encounters. If "quck" remains, it will be treated as a separate word from "quick," and you'll lose the relationship between them. Correcting typos is crucial for data quality. F. Tokenize the sentence into words: Tokenization is the process of breaking down the sentence into individual words (or tokens). Word2Vec operates on individual words, so you need to split the sentence into its constituent parts. This is a fundamental step in NLP.

upvoted 1 times

...

JonSno

4 months, 2 weeks ago

Selected Answer: BDF

While C - is debatable - not always necessary to remove stop words in Word2Vec - as sometimes the stop words do provide context ==================== For Word2Vec training, data preprocessing is essential to ensure that words are correctly represented, consistent, and free from unnecessary noise. The key steps are: Lowercasing the text (B) Word embeddings treat "FOX" and "fox" as different words. To avoid redundancy, lowercasing the text ensures consistency. Correcting typos (D) "quck" should be corrected to "quick" to prevent incorrect word representations in Word2Vec. Misspelled words can create meaningless embeddings. Tokenizing the sentence into words (F) Word2Vec operates at the word level, so breaking the sentence into individual tokens (words) is necessary.

upvoted 2 times

...

loict

9 months, 3 weeks ago

Selected Answer: BCF

A. NO - word2vec works on raw data B. YES - case here is not significant C. YES - will help reduce dimensionality D. NO - word2vec will do it by itself E. NO - One-hot encoding is for classification F. YES - word2vec takes tokens as input

upvoted 1 times

...

Valcilio

1 year, 3 months ago

Selected Answer: BCF

Data need to be tokenized and cleaned!

upvoted 2 times

...

Aninina

1 year, 6 months ago

Selected Answer: BCF

B, C F is the correct

upvoted 2 times

...

SophieSu

2 years, 8 months ago

BCF correct. D is not correct (Pay attention to “in a repeatable manner” in the question.)

upvoted 2 times

...

cloud_trail

2 years, 8 months ago

B/C/F. D should not be performed because spell check is a subjective thing. You don't know for sure what the word was supposed to be if you have a typo.

upvoted 2 times

...

harmanbirstudy

2 years, 8 months ago

I saw this exact question on "whizlabs" practice exam and correct options were B/C/F

upvoted 1 times

...

GeeBeeEl

2 years, 8 months ago

https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281 Data Preparation — Define corpus, clean, normalise and tokenise words To begin, we start with the following corpus: “natural language processing and machine learning is fun and exciting” For simplicity, we have chosen a sentence without punctuation and capitalization. Also, we did not remove stop words “and” and “is”. In reality, text data are unstructured and can be “dirty”. Cleaning them will involve steps such as o removing stop words, o removing punctuations, o convert text to lowercase (actually depends on your use-case), o replacing digits, etc. o After preprocessing, we then move on to tokenising the corpus Answer: B, C, F

upvoted 8 times

cnethers

2 years, 8 months ago

BCF is 100% correct

upvoted 2 times

...

Antriksh

2 years, 8 months ago

Correct answers are B, C and F

upvoted 2 times

...

TuanAnh

2 years, 8 months ago

The correct answer is B, C and F A: POS tagging has nothing to do with word2vec D: fixing "quck" to "quick" only works for that specific word F: word2vec can use CBOW or skipgram, so no need to have one-hot decoding here

upvoted 4 times