exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 72 discussion

A banking company wants to collect large volumes of transactional data using Amazon Kinesis Data Streams for real-time analytics. The company uses
PutRecord to send data to Amazon Kinesis, and has observed network outages during certain times of the day. The company wants to obtain exactly once semantics for the entire processing pipeline.
What should the company do to obtain these characteristics?

  • A. Design the application so it can remove duplicates during processing be embedding a unique ID in each record.
  • B. Rely on the processing semantics of Amazon Kinesis Data Analytics to avoid duplicate processing of events.
  • C. Design the data producer so events are not ingested into Kinesis Data Streams multiple times.
  • D. Rely on the exactly one processing semantics of Apache Flink and Apache Spark Streaming included in Amazon EMR.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
testtaker3434
Highly Voted 3 years, 5 months ago
Agree with A.
upvoted 17 times
awssp12345
3 years, 5 months ago
me too!
upvoted 2 times
...
...
vicks316
Highly Voted 3 years, 5 months ago
A. https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html "Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing."
upvoted 9 times
...
Palee
Most Recent 1 day, 6 hours ago
Selected Answer: D
Ans D Option A doesn't talk about "Obtain exactly once semantics for the entire processing pipeline
upvoted 1 times
...
daisyli
1 year, 4 months ago
D Apache Flink provides a powerful API to transform, aggregate, and enrich events, and supports exactly-once semantics. Apache Flink is therefore a good foundation for the core of your streaming architecture. https://aws.amazon.com/blogs/big-data/streaming-etl-with-apache-flink-and-amazon-kinesis-data-analytics/
upvoted 2 times
...
pk349
1 year, 10 months ago
A: I passed the test
upvoted 1 times
...
enoted
2 years ago
Selected Answer: A
A - exactly what is requested in the description
upvoted 1 times
...
cloudlearnerhere
2 years, 4 months ago
Selected Answer: A
Correct answer is A as producer retries can result in duplicates in Kinesis Data Streams and must be handled by the producer by using a unique key for each message. There are two primary reasons why records may be delivered more than one time to your Amazon Kinesis Data Streams application: producer retries and consumer retries. Your application must anticipate and appropriately handle processing individual records multiple times. Options B & C are wrong as they would not handle the exactly-once processing semantics. Option D is wrong as although Apache Flink and Spark Streaming would work, it would need a complete change in the current application.
upvoted 3 times
...
rocky48
2 years, 7 months ago
Selected Answer: A
Selected Answer: A
upvoted 1 times
...
certificationJunkie
2 years, 9 months ago
application should be idempotent. Can be achieved by including a primary key in the record. Hence, A is correct answer
upvoted 1 times
...
MWL
2 years, 10 months ago
Selected Answer: A
The problem is the question is, the producer may put the records several times. Donell explained this very well. KDA, Flink or Spark can only make sure 'Exactly once' for every record in stream. If the record is duplicated by producer, they will process exactly once for every record in stream. So the answer should be A.
upvoted 2 times
...
jrheen
2 years, 10 months ago
Answer : A
upvoted 1 times
...
aws2019
3 years, 3 months ago
A is right
upvoted 1 times
...
Donell
3 years, 4 months ago
Answer A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record. Producer Retries Consider a producer that experiences a network-related timeout after it makes a call to PutRecord, but before it can receive an acknowledgement from Amazon Kinesis Data Streams. The producer cannot be sure if the record was delivered to Kinesis Data Streams. Assuming that every record is important to the application, the producer would have been written to retry the call with the same data. If both PutRecord calls on that same data were successfully committed to Kinesis Data Streams, then there will be two Kinesis Data Streams records. Although the two records have identical data, they also have unique sequence numbers. Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing. Note that the number of duplicates due to producer retries is usually low compared to the number of duplicates due to consumer retries. Reference: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html
upvoted 8 times
...
Heer
3 years, 4 months ago
ANSWER:D KDS and KDF has 'exactly once' semantics .Option A is a fail safe mechanism when there is a choppy network while sending data to KDS with PutRecord. Apache Flink and Apache Spark both guarantees 'Exactly once' semantics and which is what is the requirement as per the question .
upvoted 4 times
MWL
2 years, 10 months ago
The problem is the question is, the producer may put the records several times. Donell explained this very well. So the answer should be A. Flink or Spark can only make sure 'Exactly once' for every record in stream. If the record is duplicated by producer, they don't work.
upvoted 1 times
...
...
lostsoul07
3 years, 4 months ago
A is the right answer
upvoted 3 times
...
Draco31
3 years, 5 months ago
A was a good choice until i search EMR Flink...: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html Apache Flink is a streaming dataflow engine that you can use to run real-time stream processing on high-throughput data sources. Flink supports event time semantics for out-of-order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications. Can be connected to KDS. So i will pick up D
upvoted 2 times
liyungho
3 years, 4 months ago
According to Flink kinesis connector doc -- https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kinesis.html, in the Kinesis producer section, it states "Note that the producer is not participating in Flink’s checkpointing and doesn’t provide exactly-once processing guarantees. Also, the Kinesis producer does not guarantee that records are written in order to the shards (See here and here for more details). In case of a failure or a resharding, data will be written again to Kinesis, leading to duplicates. This behavior is usually called “at-least-once” semantics." So I think the answer is A.
upvoted 2 times
...
omar_bahrain
3 years, 4 months ago
There are documents that link streaming (Kafka/Kinesis) and EMR/Spark/Flink to remove deduplication in realtime. https://blog.griddynamics.com/in-stream-deduplication-with-spark-amazon-kinesis-and-s3/ in addition to the difficulty of changing running application, I would say D is a good potential candidate
upvoted 1 times
...
...
syu31svc
3 years, 5 months ago
Link provided supports A as the answer
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago