exam questions

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 63 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty
Question #: 63
Topic #: 1
[All AWS Certified Big Data - Specialty Questions]

A media advertising company handles a large number of real-time messages sourced from over 200 websites.
The companys data engineer needs to collect and process records in real time for analysis using Spark
Streaming on Amazon Elastic MapReduce (EMR). The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority.
Which Amazon Kinesis configuration meets these requirements?

  • A. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Pull messages off Firehose with Spark Streaming in parallel to persistence to Amazon S3.
  • B. Publish messages to Amazon Kinesis Streams. Pull messages off Streams with Spark Streaming in parallel to AWS Lambda pushing messages from Streams to Firehose backed by Amazon Simple Storage Service (S3).
  • C. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Use AWS Lambda to pull messages from Firehose to Streams for processing with Spark Streaming.
  • D. Publish messages to Amazon Kinesis Streams, pull messages off with Spark Streaming, and write row data to Amazon Simple Storage Service (S3) before and after processing.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Bulti
Highly Voted 3 years, 6 months ago
Not A - Because Spark Streams or any Spark component for that matter cannot read directly from KFH. Also not real-time solution. Not C - Not a real-time solution although doable if real-time is not an issue. Not D- It doesn't make sense to write the original row data one at a time to S3 when its possible to configure Kinesis stream destination as S3 and split data into multiple files organized by date as prefix. B- Is the right answer. Spark Streams can read directly from Kinesis streams and so can a Lambda function which will then insert each record into KFH to be delivered to S3.
upvoted 13 times
notcloudguru
3 years, 5 months ago
The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority.
upvoted 1 times
...
...
DerekKey
Most Recent 3 years, 5 months ago
A is wrong - Spark Streaming can only read from Kinesis Data Streams B doesn't make sense - Kinesis Data Firehose has direct integration with Kinesis Data Streams https://docs.aws.amazon.com/firehose/latest/dev/writing-with-kinesis-streams.html C doesn't make sense - Lambda to integrate Firehose with Data Streams. I don't belive people design such crazy things D is correct and looks best compared to the others
upvoted 1 times
...
guruguru
3 years, 6 months ago
D. Real-time message ruled out A and C. For B, my concern is that streaming data in parallel to both Lambda and Spark, will that reduce the performance of KDS, or requires more shards? Hence, I pick D.
upvoted 1 times
mbabu48
3 years, 5 months ago
I agree with D. While we are not degrading any performance in copying the stream message twice, But the no.of services(Firehouse,Lambda) we are bringing. All this is for saving RAW data to S3. Not justifying So I believe D is correct answer
upvoted 1 times
...
...
esalas0691
3 years, 6 months ago
Option D. "The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority" It does mention anything regarding batching, so its ok to write single rows to S3
upvoted 1 times
...
Sree_
3 years, 6 months ago
It has to be B.
upvoted 1 times
...
srirampc
3 years, 6 months ago
It could have been B, but we don't need a lambda to push messages, send the stream straight to FH. Because of this, going with D to keep the solution simple.
upvoted 3 times
...
srirampc
3 years, 6 months ago
It could have been B, but we don't need a lambda to push messages, send the stream straight to FH. Because of this, going with D to keep the solution simple.
upvoted 2 times
Corram
3 years, 6 months ago
i like Bultis point arguing against D (storing row-wise to S3 appears awful), but your point on B was also what i thought. so weird...
upvoted 1 times
...
...
san2020
3 years, 6 months ago
my selection B
upvoted 2 times
...
practicioner
3 years, 6 months ago
I vote for D. It looks simple and working solution
upvoted 2 times
practicioner
3 years, 6 months ago
After researching I changed for B. Why? Because we should collect batch for storing in S3 (FH is good for it) instead of storing single row events (it will be awful)
upvoted 4 times
...
...
ME2000
3 years, 6 months ago
Answer C Spark Streaming can't pull messages from Firehose or Streams, so options A, B and D invalid. https://docs.aws.amazon.com/solutions/latest/real-time-analytics-spark-streaming/architecture.html
upvoted 1 times
practicioner
3 years, 6 months ago
https://spark.apache.org/docs/2.3.0/streaming-kinesis-integration.html Spark streaming can pull messages from Kynesis, but it cann't dot it from firehose
upvoted 1 times
sergio1312
3 years, 6 months ago
Amazon Kinesis Firehose is not real-time
upvoted 1 times
practicioner
3 years, 6 months ago
I didn't mention about FH. Amazon Kinesis Streams is using for real-time and spark streaming have integration with it
upvoted 2 times
...
...
...
...
PK1234
3 years, 6 months ago
, Apache Spark can be over-burdened with file operations if it is processing a large number of small files versus fewer larger files. Each of these files has its own overhead of a few milliseconds for opening, reading metadata information, and closing. This overhead of file operations on these large numbers of files results in slow processing. This blog post shows how to use Amazon Kinesis Data Firehose to merge many small messages into larger messages for delivery to Amazon S3. This results in faster processing with Amazon EMR running Spark. Option C.
upvoted 2 times
...
shwang
3 years, 6 months ago
Nobody has thoughts on A? can spark stream pull out message from kFH directly. A lambda function is necessary for pulling data from KFH to feed spark stream?
upvoted 1 times
...
SamP
3 years, 6 months ago
I think D is correct. B looks overkill.
upvoted 1 times
...
Percival
3 years, 6 months ago
In context, this case doesn't need managed service (FH + fee) in gathering & processing. Cz FH needs buffering time for efficient bulk transfer.. For real time processing.. (not real-time gathering..) it doesn't need FH yet (in gathering & processing.) But in back up to S3.. FH is better.
upvoted 2 times
...
cybe001
3 years, 6 months ago
I choose D, kinesis stream can be used to first store the raw data in s3 and then analyze the data in real time.
upvoted 2 times
...
pkfe
3 years, 6 months ago
funny, nobody bore to search KCL checkpoint table
upvoted 1 times
...
pra276
3 years, 7 months ago
FH cannot be an answer it is not a realtime. B is correct
upvoted 2 times
pra276
3 years, 6 months ago
Sorry. The question is "The companys data engineer needs to collect and process records in real time for analysis using Spark" Real time for processing using spark so the answer in this case is C. Anyone have any other thoughts?
upvoted 3 times
mattyb123
3 years, 6 months ago
Only thinking B as when i sat the test previously i used the current answers within this guide and scored very poorly in the collection section. Cause of this i am under the impression C is incorrect and the most likely answer is B for kinesis data streams being realtime and FH having the 60 second delay.
upvoted 5 times
pra276
3 years, 6 months ago
Agreed. I miss read. Its really confusing
upvoted 2 times
...
...
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago