Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 45 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 45
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.
The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.
Which solution should the Data Scientist build to satisfy the requirements?

A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
B. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
C. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.
D. Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

Show Suggested Answer

Suggested Answer: A 🗳️

by DonaldCMLIN at Nov. 16, 2019, 4:32 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

DonaldCMLIN

Highly Voted 3 years, 1 month ago

Kinesis Data Analytics NO PARQET FORMAT, BESIDES THAT JSON NO NEED TO STORE IN S3. RDS ISN'T serverless ingestion and analytics solution ANSWER IS A.

upvoted 32 times

...

georgeZ

Highly Voted 3 years, 1 month ago

I thinks it should be A please check https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

upvoted 14 times

...

JonSno

Most Recent 2 months, 1 week ago

Selected Answer: A

Amazon Kinesis Data Firehose Ingests real-time data with automatic buffering. Supports built-in transformation to Apache Parquet/ORC before writing to Amazon S3. Requires minimal code and infrastructure. AWS Glue Data Catalog Catalogs the schema for structured querying. Enables Athena to directly query data in S3. Amazon Athena Serverless SQL querying on S3-based datasets. Can connect to BI tools (Tableau, QuickSight) via JDBC.

upvoted 1 times

...

Alice1234

8 months, 3 weeks ago

A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use Amazon Kinesis Data Firehose to buffer and transform the streaming JSON data to a columnar format like Apache Parquet or ORC using the AWS Glue Data Catalog before delivering to Amazon S3. Analysts can then query the data using Amazon Athena and connect to BI dashboards using the Athena JDBC connector. This solution is serverless, manages high-velocity data streams, supports SQL queries, and connects to BI tools—all while being highly available.

upvoted 3 times

...

loict

1 year, 1 month ago

Selected Answer: C

A. YES - we need a catalog to create parquet (https://docs.aws.amazon.com/firehose/latest/APIReference/API_SchemaConfiguration.html) B. NO - no need for extra staging C. NO - no need for extra staging D. NO - we need a catalog

upvoted 1 times

...

Mickey321

1 year, 1 month ago

Selected Answer: A

Option A

upvoted 1 times

...

kaike_reis

1 year, 2 months ago

Selected Answer: A

A is correct. For those selecting B, answer me: how exactly the json will be stored in the S3? It's not mentioned in the answer. For me it's an incomplete solution.

upvoted 2 times

...

AjoseO

1 year, 8 months ago

Selected Answer: A

This solution leverages AWS Glue to create a schema of the incoming data format, which helps to buffer and convert the records to a query-optimized, columnar format without data loss. The Amazon Kinesis Data Firehose delivery stream is used to stream the data and transform it to Apache Parquet or ORC format using the AWS Glue Data Catalog, and the data is stored in Amazon S3, which is highly available. The Analysts can then query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena JDBC connector. This solution provides a serverless, scalable, and cost-effective solution for real-time streaming data ingestion and analytics.

upvoted 3 times

...

sqavi

1 year, 8 months ago

Selected Answer: A

Since you want to buffer and convert data so A is correct answer. No other option is fulfilling this requirement

upvoted 2 times

...

Peeking

1 year, 10 months ago

Selected Answer: A

I go for A. However, I am not sure why AWS Glue is very important here given that Firehose can convert JSON to parquet.

upvoted 2 times

Tony_1406

1 year, 6 months ago

If I haven't remembered correctly. Athena requires a schema of the S3 object to perform SQL query. That's probably why we need Glue for the schema

upvoted 1 times

ZSun

1 year, 5 months ago

once you ingest the data using Kinesis Firehose, you can set "generate table" and automatically create Glue schema. I think both Glue and Firehose can do data conversion from JSON to parquet.

upvoted 1 times

...

itallomd

1 year, 10 months ago

Why AWS Glue is needed? Firehose could convert to parquet directly...

upvoted 2 times

587df71

3 months, 3 weeks ago

https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Amazon Data Firehose requires a schema to determine how to interpret that data. Use AWS Glue to create a schema in the AWS Glue Data Catalog. Amazon Data Firehose then references that schema and uses it to interpret your input data

upvoted 1 times

...

Ccindy

1 year, 11 months ago

Selected Answer: B

Kinesis Data Analytics is near real-time, not real time

upvoted 1 times

...

ryuhei

2 years, 1 month ago

Selected Answer: A

Answer is ”A”

upvoted 1 times

...

ovokpus

2 years, 4 months ago

Selected Answer: A

The difference between "real-time" and "near-real-time" is pretty semantic(60s). The fact that the data comes through kinesis data streams (real time) is implied as the only valid input to firehose.

upvoted 1 times

ovokpus

2 years, 4 months ago

Mind you, "the ingestion process must buffer and transform incoming records from JSON to a query-optimized, columnar format" That is exactly what kinesis firehose does. "Kinesis Data Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3." See link: https://aws.amazon.com/kinesis/data-firehose/faqs/#:~:text=Kinesis%20Data%20Firehose%20buffers%20incoming,data%20delivery%20to%20Amazon%20S3.

upvoted 3 times

...

TerrancePythonJava

2 years, 7 months ago

Selected Answer: B

Data Firehose is always Near Real Time not Real Time. The prompt clearly states that process must be done in Real Time.

upvoted 1 times

...

anttan

2 years, 10 months ago

Why A? Firehose is near real-time, and not real-time which is a requirement

upvoted 1 times

cpal012

1 year, 7 months ago

There is no requirement for real time processing. It says the data is in real time but the processing of that data should buffer

upvoted 2 times

...

harmanbirstudy

3 years ago

ANSWER is A -- and every statement in it is accurate. Firehose does integrate with GLue data catalog and it also "Buffers" the data . "When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply." This is achived by glue data catalog and athena and it works on real-time data ingest.See link below. https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

upvoted 5 times

...

Load full discussion...

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 45 discussion

Comments

DonaldCMLIN

georgeZ

JonSno

Alice1234

loict

Mickey321

kaike_reis

AjoseO

sqavi

Peeking

Tony_1406

ZSun

itallomd

587df71

Ccindy

ryuhei

ovokpus

ovokpus

TerrancePythonJava

anttan

cpal012

harmanbirstudy

SY0-701