A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:
user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING
The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.
Which solution minimizes the compute costs to propagate this batch of data?
alexvno
Highly Voted 10 months, 4 weeks agobacckom
Highly Voted 1 year agoaarora
Most Recent 1 week, 3 days agoHienlv1
1 month, 1 week agoSriramiyer92
1 month, 3 weeks agocales
3 months, 3 weeks agoSriramiyer92
1 month, 3 weeks agoshaojunni
3 months, 3 weeks agoRyanAck24
4 months, 2 weeks agoshaojunni
4 months, 2 weeks agoshaojunni
4 months, 2 weeks agospaceexplorer
1 year agoranith
1 year agodivingbell17
1 year, 1 month agoIstiaque
1 year, 1 month ago