You need ads data to serve AI models and historical data for analytics. Longtail and outlier data points need to be identified. You want to cleanse the data in near-real time before running it through AI models. What should you do?
A.
Use Cloud Storage as a data warehouse, shell scripts for processing, and BigQuery to create views for desired datasets.
B.
Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink.
C.
Use BigQuery to ingest, prepare, and then analyze the data, and then run queries to create views.
D.
Use Cloud Composer to identify longtail and outlier data points, and then output a usable dataset to BigQuery.
Dataflow for Real-Time Processing: Dataflow allows you to process data in near-real time, making it well-suited for identifying longtail and outlier data points as they occur. You can use Dataflow to implement custom data cleansing and outlier detection algorithms that operate on streaming data.
BigQuery as a Sink: Using BigQuery as a sink allows you to store the cleaned and processed data efficiently for further analysis or use in AI models. Dataflow can write the cleaned data to BigQuery tables, enabling seamless integration with downstream processes.
B is the best option for cleansing the ads data in near real-time before running it through AI models.
The key reasons are:
• Dataflow allows for stream processing of data in near real-time. This allows you to identify and cleanse longtail and outlier data points as the data is streamed in.
• Dataflow has built-in capabilities for detecting and handling outliers and anomalies in streaming data. This makes it well-suited for programmatically identifying longtail and outlier data points.
• Using BigQuery as the output sink allows the cleansed data to be immediatley available for analysis and serving to AI models. BigQuery can act as a serving layer for the models.
• Options A, C, and D either don't provide real-time processing (A and C) or don't easily integrate with BigQuery for analysis and serving (D).
So B is the best architecture here to meet the needs of near real-time cleansing, identification of longtail/outlier data points, and integration with BigQuery for serving AI models.
- Dataflow is a fully managed service for stream and batch data processing and is well-suited for real-time data processing tasks like identifying longtail and outlier data points.
- Using BigQuery as a sink allows to efficiently store the cleansed and processed data for further analysis and serving it to AI models.
Real-time Data Processing: Dataflow excels at handling large-scale, streaming data with low latency, enabling near-real-time cleansing.
Scalability: Easily scales to handle growing data volumes and processing needs.
Programmatic Data Cleaning: Allows you to write custom logic in Apache Beam for identifying longtail and outlier data points accurately and efficiently.
Integration with BigQuery: Seamless integration with BigQuery enables you to store cleansed data for AI model training and historical analytics.
Cost-Effective: Dataflow's pay-as-you-go model optimizes costs for real-time data processing.
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
Y___ash
7 months, 2 weeks agoJyoGCP
8 months, 2 weeks agodatapassionate
9 months, 2 weeks agoMatt_108
9 months, 2 weeks agoMaxNRG
9 months, 3 weeks agoMaxNRG
9 months, 3 weeks agoraaad
10 months agoe70ea9e
10 months ago