Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 63 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty

Question #: 63
Topic #: 1

[All AWS Certified Big Data - Specialty Questions]

A media advertising company handles a large number of real-time messages sourced from over 200 websites.
The companys data engineer needs to collect and process records in real time for analysis using Spark
Streaming on Amazon Elastic MapReduce (EMR). The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority.
Which Amazon Kinesis configuration meets these requirements?

A. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Pull messages off Firehose with Spark Streaming in parallel to persistence to Amazon S3.
B. Publish messages to Amazon Kinesis Streams. Pull messages off Streams with Spark Streaming in parallel to AWS Lambda pushing messages from Streams to Firehose backed by Amazon Simple Storage Service (S3).
C. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Use AWS Lambda to pull messages from Firehose to Streams for processing with Spark Streaming.
D. Publish messages to Amazon Kinesis Streams, pull messages off with Spark Streaming, and write row data to Amazon Simple Storage Service (S3) before and after processing.

Show Suggested Answer

Suggested Answer: C 🗳️

by mattyb123 at Aug. 11, 2019, 12:48 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Bulti

Highly Voted 3 years, 6 months ago

Not A - Because Spark Streams or any Spark component for that matter cannot read directly from KFH. Also not real-time solution. Not C - Not a real-time solution although doable if real-time is not an issue. Not D- It doesn't make sense to write the original row data one at a time to S3 when its possible to configure Kinesis stream destination as S3 and split data into multiple files organized by date as prefix. B- Is the right answer. Spark Streams can read directly from Kinesis streams and so can a Lambda function which will then insert each record into KFH to be delivered to S3.

upvoted 13 times

notcloudguru

3 years, 5 months ago

The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority.

upvoted 1 times

...

DerekKey

Most Recent 3 years, 5 months ago

A is wrong - Spark Streaming can only read from Kinesis Data Streams B doesn't make sense - Kinesis Data Firehose has direct integration with Kinesis Data Streams https://docs.aws.amazon.com/firehose/latest/dev/writing-with-kinesis-streams.html C doesn't make sense - Lambda to integrate Firehose with Data Streams. I don't belive people design such crazy things D is correct and looks best compared to the others

upvoted 1 times

...

guruguru

3 years, 6 months ago

D. Real-time message ruled out A and C. For B, my concern is that streaming data in parallel to both Lambda and Spark, will that reduce the performance of KDS, or requires more shards? Hence, I pick D.

upvoted 1 times

mbabu48

3 years, 5 months ago

I agree with D. While we are not degrading any performance in copying the stream message twice, But the no.of services(Firehouse,Lambda) we are bringing. All this is for saving RAW data to S3. Not justifying So I believe D is correct answer

upvoted 1 times

...

esalas0691

3 years, 6 months ago

Option D. "The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority" It does mention anything regarding batching, so its ok to write single rows to S3

upvoted 1 times

...

Sree_

3 years, 6 months ago

It has to be B.

upvoted 1 times

...

srirampc

3 years, 6 months ago

It could have been B, but we don't need a lambda to push messages, send the stream straight to FH. Because of this, going with D to keep the solution simple.

upvoted 3 times

...

srirampc

3 years, 6 months ago

It could have been B, but we don't need a lambda to push messages, send the stream straight to FH. Because of this, going with D to keep the solution simple.

upvoted 2 times

Corram

3 years, 6 months ago

i like Bultis point arguing against D (storing row-wise to S3 appears awful), but your point on B was also what i thought. so weird...

upvoted 1 times

...

san2020

3 years, 6 months ago

my selection B

upvoted 2 times

...

practicioner

3 years, 6 months ago

I vote for D. It looks simple and working solution

upvoted 2 times

practicioner

3 years, 6 months ago

After researching I changed for B. Why? Because we should collect batch for storing in S3 (FH is good for it) instead of storing single row events (it will be awful)

upvoted 4 times

...

ME2000

3 years, 6 months ago

Answer C Spark Streaming can't pull messages from Firehose or Streams, so options A, B and D invalid. https://docs.aws.amazon.com/solutions/latest/real-time-analytics-spark-streaming/architecture.html

upvoted 1 times

practicioner

3 years, 6 months ago

https://spark.apache.org/docs/2.3.0/streaming-kinesis-integration.html Spark streaming can pull messages from Kynesis, but it cann't dot it from firehose

upvoted 1 times

sergio1312

3 years, 6 months ago

Amazon Kinesis Firehose is not real-time

upvoted 1 times

practicioner

3 years, 6 months ago

I didn't mention about FH. Amazon Kinesis Streams is using for real-time and spark streaming have integration with it

upvoted 2 times

...

PK1234

3 years, 6 months ago

, Apache Spark can be over-burdened with file operations if it is processing a large number of small files versus fewer larger files. Each of these files has its own overhead of a few milliseconds for opening, reading metadata information, and closing. This overhead of file operations on these large numbers of files results in slow processing. This blog post shows how to use Amazon Kinesis Data Firehose to merge many small messages into larger messages for delivery to Amazon S3. This results in faster processing with Amazon EMR running Spark. Option C.

upvoted 2 times

...

shwang

3 years, 6 months ago

Nobody has thoughts on A? can spark stream pull out message from kFH directly. A lambda function is necessary for pulling data from KFH to feed spark stream?

upvoted 1 times

...

SamP

3 years, 6 months ago

I think D is correct. B looks overkill.

upvoted 1 times

...

Percival

3 years, 6 months ago

In context, this case doesn't need managed service (FH + fee) in gathering & processing. Cz FH needs buffering time for efficient bulk transfer.. For real time processing.. (not real-time gathering..) it doesn't need FH yet (in gathering & processing.) But in back up to S3.. FH is better.

upvoted 2 times

...

cybe001

3 years, 6 months ago

I choose D, kinesis stream can be used to first store the raw data in s3 and then analyze the data in real time.

upvoted 2 times

...

pkfe

3 years, 6 months ago

funny, nobody bore to search KCL checkpoint table

upvoted 1 times

...

pra276

3 years, 7 months ago

FH cannot be an answer it is not a realtime. B is correct

upvoted 2 times

pra276

3 years, 6 months ago

Sorry. The question is "The companys data engineer needs to collect and process records in real time for analysis using Spark" Real time for processing using spark so the answer in this case is C. Anyone have any other thoughts?

upvoted 3 times

mattyb123

3 years, 6 months ago

Only thinking B as when i sat the test previously i used the current answers within this guide and scored very poorly in the collection section. Cause of this i am under the impression C is incorrect and the most likely answer is B for kinesis data streams being realtime and FH having the 60 second delay.

upvoted 5 times

pra276

3 years, 6 months ago

Agreed. I miss read. Its really confusing

upvoted 2 times

...

Load full discussion...

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 63 discussion

Comments

Bulti

notcloudguru

DerekKey

guruguru

mbabu48

esalas0691

Sree_

srirampc

srirampc

Corram

san2020

practicioner

practicioner

ME2000

practicioner

sergio1312

practicioner

PK1234

shwang

SamP

Percival

cybe001

pkfe

pra276

pra276

mattyb123

pra276

SY0-701