Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 72 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 72
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A banking company wants to collect large volumes of transactional data using Amazon Kinesis Data Streams for real-time analytics. The company uses
PutRecord to send data to Amazon Kinesis, and has observed network outages during certain times of the day. The company wants to obtain exactly once semantics for the entire processing pipeline.
What should the company do to obtain these characteristics?

A. Design the application so it can remove duplicates during processing be embedding a unique ID in each record.
B. Rely on the processing semantics of Amazon Kinesis Data Analytics to avoid duplicate processing of events.
C. Design the data producer so events are not ingested into Kinesis Data Streams multiple times.
D. Rely on the exactly one processing semantics of Apache Flink and Apache Spark Streaming included in Amazon EMR.

Show Suggested Answer

Suggested Answer: A 🗳️

by testtaker3434 at Aug. 28, 2020, 1:04 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

testtaker3434

Highly Voted 3 years, 9 months ago

Agree with A.

upvoted 17 times

awssp12345

3 years, 9 months ago

me too!

upvoted 2 times

...

vicks316

Highly Voted 3 years, 9 months ago

A. https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html "Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing."

upvoted 9 times

...

Palee

Most Recent 4 months ago

Selected Answer: D

Ans D Option A doesn't talk about "Obtain exactly once semantics for the entire processing pipeline

upvoted 1 times

...

daisyli

1 year, 8 months ago

D Apache Flink provides a powerful API to transform, aggregate, and enrich events, and supports exactly-once semantics. Apache Flink is therefore a good foundation for the core of your streaming architecture. https://aws.amazon.com/blogs/big-data/streaming-etl-with-apache-flink-and-amazon-kinesis-data-analytics/

upvoted 2 times

...

pk349

2 years, 2 months ago

A: I passed the test

upvoted 1 times

...

enoted

2 years, 4 months ago

Selected Answer: A

A - exactly what is requested in the description

upvoted 1 times

...

cloudlearnerhere

2 years, 8 months ago

Selected Answer: A

Correct answer is A as producer retries can result in duplicates in Kinesis Data Streams and must be handled by the producer by using a unique key for each message. There are two primary reasons why records may be delivered more than one time to your Amazon Kinesis Data Streams application: producer retries and consumer retries. Your application must anticipate and appropriately handle processing individual records multiple times. Options B & C are wrong as they would not handle the exactly-once processing semantics. Option D is wrong as although Apache Flink and Spark Streaming would work, it would need a complete change in the current application.

upvoted 3 times

...

rocky48

2 years, 12 months ago

Selected Answer: A

upvoted 1 times

...

certificationJunkie

3 years, 1 month ago

application should be idempotent. Can be achieved by including a primary key in the record. Hence, A is correct answer

upvoted 1 times

...

MWL

3 years, 2 months ago

Selected Answer: A

The problem is the question is, the producer may put the records several times. Donell explained this very well. KDA, Flink or Spark can only make sure 'Exactly once' for every record in stream. If the record is duplicated by producer, they will process exactly once for every record in stream. So the answer should be A.

upvoted 2 times

...

jrheen

3 years, 2 months ago

Answer : A

upvoted 1 times

...

aws2019

3 years, 8 months ago

A is right

upvoted 1 times

...

Donell

3 years, 8 months ago

Answer A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record. Producer Retries Consider a producer that experiences a network-related timeout after it makes a call to PutRecord, but before it can receive an acknowledgement from Amazon Kinesis Data Streams. The producer cannot be sure if the record was delivered to Kinesis Data Streams. Assuming that every record is important to the application, the producer would have been written to retry the call with the same data. If both PutRecord calls on that same data were successfully committed to Kinesis Data Streams, then there will be two Kinesis Data Streams records. Although the two records have identical data, they also have unique sequence numbers. Applications that need strict guarantees should embed a primary key within the record to remove duplicates later when processing. Note that the number of duplicates due to producer retries is usually low compared to the number of duplicates due to consumer retries. Reference: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html

upvoted 8 times

...

Heer

3 years, 8 months ago

ANSWER:D KDS and KDF has 'exactly once' semantics .Option A is a fail safe mechanism when there is a choppy network while sending data to KDS with PutRecord. Apache Flink and Apache Spark both guarantees 'Exactly once' semantics and which is what is the requirement as per the question .

upvoted 4 times

MWL

3 years, 2 months ago

The problem is the question is, the producer may put the records several times. Donell explained this very well. So the answer should be A. Flink or Spark can only make sure 'Exactly once' for every record in stream. If the record is duplicated by producer, they don't work.

upvoted 1 times

...

lostsoul07

3 years, 8 months ago

A is the right answer

upvoted 3 times

...

Draco31

3 years, 9 months ago

A was a good choice until i search EMR Flink...: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html Apache Flink is a streaming dataflow engine that you can use to run real-time stream processing on high-throughput data sources. Flink supports event time semantics for out-of-order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications. Can be connected to KDS. So i will pick up D

upvoted 2 times

liyungho

3 years, 8 months ago

According to Flink kinesis connector doc -- https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kinesis.html, in the Kinesis producer section, it states "Note that the producer is not participating in Flink’s checkpointing and doesn’t provide exactly-once processing guarantees. Also, the Kinesis producer does not guarantee that records are written in order to the shards (See here and here for more details). In case of a failure or a resharding, data will be written again to Kinesis, leading to duplicates. This behavior is usually called “at-least-once” semantics." So I think the answer is A.

upvoted 2 times

...

omar_bahrain

3 years, 8 months ago

There are documents that link streaming (Kafka/Kinesis) and EMR/Spark/Flink to remove deduplication in realtime. https://blog.griddynamics.com/in-stream-deduplication-with-spark-amazon-kinesis-and-s3/ in addition to the difficulty of changing running application, I would say D is a good potential candidate

upvoted 1 times

...

syu31svc

3 years, 9 months ago

Link provided supports A as the answer

upvoted 1 times

...

Load full discussion...