exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 187 discussion

A data engineer needs to provide a team of data scientists with the appropriate dataset to run machine learning training jobs. The data will be stored in Amazon S3. The data engineer is obtaining the data from an Amazon Redshift database and is using join queries to extract a single tabular dataset. A portion of the schema is as follows:

TransactionTimestamp (Timestamp)
CardName (Varchar)
CardNo (Varchar)

The data engineer must provide the data so that any row with a CardNo value of NULL is removed. Also, the TransactionTimestamp column must be separated into a TransactionDate column and a TransactionTime column. Finally, the CardName column must be renamed to NameOnCard.

The data will be extracted on a monthly basis and will be loaded into an S3 bucket. The solution must minimize the effort that is needed to set up infrastructure for the ingestion and transformation. The solution also must be automated and must minimize the load on the Amazon Redshift cluster.

Which solution meets these requirements?

  • A. Set up an Amazon EMR cluster. Create an Apache Spark job to read the data from the Amazon Redshift cluster and transform the data. Load the data into the S3 bucket. Schedule the job to run monthly.
  • B. Set up an Amazon EC2 instance with a SQL client tool, such as SQL Workbench/J, to query the data from the Amazon Redshift cluster directly Export the resulting dataset into a file. Upload the file into the S3 bucket. Perform these tasks monthly.
  • C. Set up an AWS Glue job that has the Amazon Redshift cluster as the source and the S3 bucket as the destination. Use the built-in transforms Filter, Map, and RenameField to perform the required transformations. Schedule the job to run monthly.
  • D. Use Amazon Redshift Spectrum to run a query that writes the data directly to the S3 bucket. Create an AWS Lambda function to run the query monthly.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
ystotest
Highly Voted 1 year, 3 months ago
Selected Answer: C
agreed with C
upvoted 12 times
...
2eb8df0
Most Recent 1 week, 1 day ago
Selected Answer: C
Its always Glue
upvoted 1 times
...
akgarg00
3 months, 3 weeks ago
Selected Answer: C
The answer was between C and D, but we are suppose to minimize use of Redshift cluster, answer is C. And B are too much effort, so not to be done as per constraints of question.
upvoted 2 times
...
teka112233
6 months, 1 week ago
Selected Answer: C
Simply the requirements are a full ETL process where data will be extracted from Redshift (E), then transformed by renaming, removing null values, or even separating the first column So (T), and finally load data to S3(L) all that with the least overhead, which make the AWS Glue ideal for these requirements
upvoted 1 times
...
loict
6 months, 1 week ago
Selected Answer: C
A. NO - AWS Glue (serverless) is a simpler option than EMR to run Spark jobs B. NO - Spark is a better option for datapipelines, it avoids the need for intermediary files C. YES - Spark and AWS Glue best combination D. NO - Amazon Redshift Spectrum is a "Lake House" architecture, meant to run SQL against against both DW & S3; here, we want to query only from the DW
upvoted 1 times
...
Mickey321
6 months, 3 weeks ago
Selected Answer: C
The reason is that this solution can leverage the existing capabilities of AWS Glue, which is a fully managed service that can help users create, run, and manage ETL (extract, transform, and load) workflows. According to the web search results, AWS Glue can connect to various data sources and destinations, such as Amazon Redshift and Amazon S3, and use Apache Spark as the underlying processing engine. AWS Glue can also provide various built-in transforms that can perform common data manipulation operations, such as filtering, mapping, renaming, or joining. Moreover, AWS Glue can support scheduling and automation of ETL jobs using triggers or workflows.
upvoted 1 times
...
Mickey321
6 months, 3 weeks ago
Selected Answer: C
agree with C
upvoted 1 times
...
kaike_reis
7 months, 1 week ago
Selected Answer: C
C Reason: we want to minimize infrastructure effort, so we should prioritize serverless solutions, we want something automated and minimize the load on the Redshift cluster. That said, Letter A is wrong as it uses a managed service (EMR) just like Letter B (EC2). Letter D brings Redshift Spectrum, however the base is not in S3, but Redshift! So, it's discarded this option, since we use this service to move data from S3 → Redshift using SQL. Letter C is correct.
upvoted 1 times
...
Jerry84
1 year, 2 months ago
Selected Answer: C
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-transforms.html
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago