exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 45 discussion

A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.
The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.
Which solution should the Data Scientist build to satisfy the requirements?

  • A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
  • B. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
  • C. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.
  • D. Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
DonaldCMLIN
Highly Voted 3 years, 1 month ago
Kinesis Data Analytics NO PARQET FORMAT, BESIDES THAT JSON NO NEED TO STORE IN S3. RDS ISN'T serverless ingestion and analytics solution ANSWER IS A.
upvoted 32 times
...
georgeZ
Highly Voted 3 years, 1 month ago
I thinks it should be A please check https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/
upvoted 14 times
...
JonSno
Most Recent 2 months, 1 week ago
Selected Answer: A
Amazon Kinesis Data Firehose Ingests real-time data with automatic buffering. Supports built-in transformation to Apache Parquet/ORC before writing to Amazon S3. Requires minimal code and infrastructure. AWS Glue Data Catalog Catalogs the schema for structured querying. Enables Athena to directly query data in S3. Amazon Athena Serverless SQL querying on S3-based datasets. Can connect to BI tools (Tableau, QuickSight) via JDBC.
upvoted 1 times
...
Alice1234
8 months, 3 weeks ago
A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use Amazon Kinesis Data Firehose to buffer and transform the streaming JSON data to a columnar format like Apache Parquet or ORC using the AWS Glue Data Catalog before delivering to Amazon S3. Analysts can then query the data using Amazon Athena and connect to BI dashboards using the Athena JDBC connector. This solution is serverless, manages high-velocity data streams, supports SQL queries, and connects to BI tools—all while being highly available.
upvoted 3 times
...
loict
1 year, 1 month ago
Selected Answer: C
A. YES - we need a catalog to create parquet (https://docs.aws.amazon.com/firehose/latest/APIReference/API_SchemaConfiguration.html) B. NO - no need for extra staging C. NO - no need for extra staging D. NO - we need a catalog
upvoted 1 times
...
Mickey321
1 year, 1 month ago
Selected Answer: A
Option A
upvoted 1 times
...
kaike_reis
1 year, 2 months ago
Selected Answer: A
A is correct. For those selecting B, answer me: how exactly the json will be stored in the S3? It's not mentioned in the answer. For me it's an incomplete solution.
upvoted 2 times
...
AjoseO
1 year, 8 months ago
Selected Answer: A
This solution leverages AWS Glue to create a schema of the incoming data format, which helps to buffer and convert the records to a query-optimized, columnar format without data loss. The Amazon Kinesis Data Firehose delivery stream is used to stream the data and transform it to Apache Parquet or ORC format using the AWS Glue Data Catalog, and the data is stored in Amazon S3, which is highly available. The Analysts can then query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena JDBC connector. This solution provides a serverless, scalable, and cost-effective solution for real-time streaming data ingestion and analytics.
upvoted 3 times
...
sqavi
1 year, 8 months ago
Selected Answer: A
Since you want to buffer and convert data so A is correct answer. No other option is fulfilling this requirement
upvoted 2 times
...
Peeking
1 year, 10 months ago
Selected Answer: A
I go for A. However, I am not sure why AWS Glue is very important here given that Firehose can convert JSON to parquet.
upvoted 2 times
Tony_1406
1 year, 6 months ago
If I haven't remembered correctly. Athena requires a schema of the S3 object to perform SQL query. That's probably why we need Glue for the schema
upvoted 1 times
ZSun
1 year, 5 months ago
once you ingest the data using Kinesis Firehose, you can set "generate table" and automatically create Glue schema. I think both Glue and Firehose can do data conversion from JSON to parquet.
upvoted 1 times
...
...
...
itallomd
1 year, 10 months ago
Why AWS Glue is needed? Firehose could convert to parquet directly...
upvoted 2 times
587df71
3 months, 3 weeks ago
https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html Amazon Data Firehose requires a schema to determine how to interpret that data. Use AWS Glue to create a schema in the AWS Glue Data Catalog. Amazon Data Firehose then references that schema and uses it to interpret your input data
upvoted 1 times
...
...
Ccindy
1 year, 11 months ago
Selected Answer: B
Kinesis Data Analytics is near real-time, not real time
upvoted 1 times
...
ryuhei
2 years, 1 month ago
Selected Answer: A
Answer is ”A”
upvoted 1 times
...
ovokpus
2 years, 4 months ago
Selected Answer: A
The difference between "real-time" and "near-real-time" is pretty semantic(60s). The fact that the data comes through kinesis data streams (real time) is implied as the only valid input to firehose.
upvoted 1 times
ovokpus
2 years, 4 months ago
Mind you, "the ingestion process must buffer and transform incoming records from JSON to a query-optimized, columnar format" That is exactly what kinesis firehose does. "Kinesis Data Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3." See link: https://aws.amazon.com/kinesis/data-firehose/faqs/#:~:text=Kinesis%20Data%20Firehose%20buffers%20incoming,data%20delivery%20to%20Amazon%20S3.
upvoted 3 times
...
...
TerrancePythonJava
2 years, 7 months ago
Selected Answer: B
Data Firehose is always Near Real Time not Real Time. The prompt clearly states that process must be done in Real Time.
upvoted 1 times
...
anttan
2 years, 10 months ago
Why A? Firehose is near real-time, and not real-time which is a requirement
upvoted 1 times
cpal012
1 year, 7 months ago
There is no requirement for real time processing. It says the data is in real time but the processing of that data should buffer
upvoted 2 times
...
...
harmanbirstudy
3 years ago
ANSWER is A -- and every statement in it is accurate. Firehose does integrate with GLue data catalog and it also "Buffers" the data . "When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply." This is achived by glue data catalog and athena and it works on real-time data ingest.See link below. https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/
upvoted 5 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago