A company hosts more than 300 global websites and applications. The company requires a platform to analyze more than 30 TB of clickstream data each day. What should a solutions architect do to transmit and process the clickstream data?
A.
Design an AWS Data Pipeline to archive the data to an Amazon S3 bucket and run an Amazon EMR cluster with the data to generate analytics.
B.
Create an Auto Scaling group of Amazon EC2 instances to process the data and send it to an Amazon S3 data lake for Amazon Redshift to use for analysis.
C.
Cache the data to Amazon CloudFront. Store the data in an Amazon S3 bucket. When an object is added to the S3 bucket. run an AWS Lambda function to process the data for analysis.
D.
Collect the data from Amazon Kinesis Data Streams. Use Amazon Kinesis Data Firehose to transmit the data to an Amazon S3 data lake. Load the data in Amazon Redshift for analysis.
Option D is the most appropriate solution for transmitting and processing the clickstream data in this scenario.
Amazon Kinesis Data Streams is a highly scalable and durable service that enables real-time processing of streaming data at a high volume and high rate. You can use Kinesis Data Streams to collect and process the clickstream data in real-time.
Amazon Kinesis Data Firehose is a fully managed service that loads streaming data into data stores and analytics tools. You can use Kinesis Data Firehose to transmit the data from Kinesis Data Streams to an Amazon S3 data lake.
Once the data is in the data lake, you can use Amazon Redshift to load the data and perform analysis on it. Amazon Redshift is a fully managed, petabyte-scale data warehouse service that allows you to quickly and efficiently analyze data using SQL and your existing business intelligence tools.
Option A, which involves using AWS Data Pipeline to archive the data to an Amazon S3 bucket and running an Amazon EMR cluster with the data to generate analytics, is not the most appropriate solution because it does not involve real-time processing of the data.
Option B, which involves creating an Auto Scaling group of Amazon EC2 instances to process the data and sending it to an Amazon S3 data lake for Amazon Redshift to use for analysis, is not the most appropriate solution because it does not involve a fully managed service for transmitting the data from the processing layer to the data lake.
Option C, which involves caching the data to Amazon CloudFront, storing the data in an Amazon S3 bucket, and running an AWS Lambda function to process the data for analysis when an object is added to the S3 bucket, is not the most appropriate solution because it does not involve a scalable and durable service for collecting and processing the data in real-time.
Unsure if this is right URL for this scenario. Option D is referring to S3 and then Redshift. Whereas URL discuss about eliminating S3 :- We’re excited to launch Amazon Redshift streaming ingestion for Amazon Kinesis Data Streams, which enables you to ingest data directly from the Kinesis data stream without having to stage the data in Amazon Simple Storage Service (Amazon S3). Streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your Amazon Redshift cluster.
Ans D - using Kinesis Streams / Firehouse (data in/out) is fast and reliable. Using Redshift allows all sorts of permutations of data analyses and interfacing to user apps
A: Not sure how recent this question is but Data Pipeline is not really a product AWS is recommending anymore https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
B: 30TB of clickstream data could be done with EC2 but it would be challenging
C: CloudFront is for CDN and caching and mostly outgoing data, not incoming.
D: Kinesis, S3 data lake and Redshift will work perfectly for this case
The answer should be A. Clickstream does not mean real time, it just means they capture user interactions on the web page. Kinesis data streaming is not required. Furthermore, redshift is a data warehousing solution, it cant run complex analysis as well as EMR. My vote goes for A
Question asks how to "transmit and process the clickstream data", NOT how to analyze it. Also question does NOT ask how to archive the data (as is mentioned in A). Thus D.
The key reasons are:
Kinesis Data Streams can continuously capture and ingest high volumes of clickstream data in real-time. This handles the large 30TB daily data intake.
Kinesis Firehose can automatically load the streaming data into S3. This creates a data lake for further analysis.
Firehose can transform and analyze the data in flight before loading to S3 using Lambda. This enables real-time processing.
The data in S3 can be easily loaded into Amazon Redshift for interactive analysis at scale.
Kinesis auto scales to handle the high data volumes. Minimal effort is needed for infrastructure management.
A. This option utilizes S3 for data storage and EMR for analytics, Data Pipeline is not ideal service for real-time streaming data ingestion and processing. It is better suited for batch processing scenarios.
B. This option involves managing and scaling EC2, which adds operational overhead. It is also not real-time streaming solution. Additionally, use of Redshift for analyzing clickstream data might not be most efficient or cost-effective approach.
C. CloudFront is CDN service and is not designed for real-time data processing or analytics. While using Lambda to process data can be an option, it may not be most efficient solution for processing large volumes of clickstream data.
Therefore, collecting the data from Kinesis Data Streams, using Kinesis Data Firehose to transmit it to S3 data lake, and loading it into Redshift for analysis is the recommended approach. This combination provides scalable, real-time streaming solution with storage and analytics capabilities that can handle high volume of clickstream data.
I am going to be unpopular here and I'll go for A). Even if here are other services that offer a better experience, data Pipeline can do the job here. "you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage Service (Amazon S3) each day and then run a weekly Amazon EMR (Amazon EMR) cluster over those logs to generate traffic reports" https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html In the question there is no specific timing requirement for analytics. Also the EMR cluster job can be scheduled be executed daily.
Option D) is a valid answer too, however with Amazon Redshift Streaming Ingestion "you can connect to Amazon Kinesis Data Streams data streams and pull data directly to Amazon Redshift without staging data in S3" https://aws.amazon.com/redshift/redshift-streaming-ingestion. So in this scenario Kinesis Data Firehose and S3 are redundant.
I think I agree with you, I does not make sense in option D) using Amazon Kinesis Data Firehose to transmit the data to an Amazon S3 data lake and then to Redshift, as you can send directly the data from Firehose to Redshift.
Also the Kinesis family is related to real time or near real time services. This is not a requirement at all. We have to process data daily, but not need to do it in real time
Question asks how to "transmit and process the clickstream data", NOT how to analyze it. This picture shows exactly scenario D:
Producer - Kinesis - Intermediate S3 bucket - Redshift
https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2020/07/30/StreamTransformAnalyzeKinesisLambdaRedshift1.png
It is C.
The image in here https://aws.amazon.com/kinesis/data-firehose/ shows how kinesis can send data collected to firehose who can send it to Redshift.
It is also possible to use an intermediary S3 bucket between firehose and redshift. See image in here
https://aws.amazon.com/blogs/big-data/stream-transform-and-analyze-xml-data-in-real-time-with-amazon-kinesis-aws-lambda-and-amazon-redshift/
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
Buruguduystunstugudunstuy
Highly Voted 1 year, 11 months agoBuruguduystunstugudunstuy
1 year, 11 months agoMutiverseAgent
1 year, 4 months agopentium75
11 months agoArielSchivo
Highly Voted 2 years, 1 month agoRBSK
1 year, 11 months agoPaulGa
Most Recent 2 months, 1 week agoeffiecancode
4 months, 2 weeks agoawsgeek75
10 months, 1 week agoclumsyninja4life
11 months agopentium75
11 months agoReckless_Jas
1 year, 3 months agoGuru4Cloud
1 year, 3 months agomiki111
1 year, 4 months agocookieMr
1 year, 5 months agoRahulbit34
1 year, 6 months agoPaoloRoma
1 year, 8 months agoMutiverseAgent
1 year, 4 months agojuanrasus2
1 year, 1 month agopentium75
11 months agocareer360guru
1 year, 11 months agostudis
1 year, 11 months agopentium75
11 months agosebasta
1 year, 11 months agobearcandy
1 year, 11 months agopentium75
11 months agoWpcorgan
2 years agoPS_R
2 years ago