For me the correct answer is A.
Infact the loop build a single InsertReqeust and send it.
But we can build all request in a list and use InsertAllRequest.newBuilder(tableId).setRows(rows).build() to send.
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaminginsertexamples
Response should be A, because original code pushes one row at a time, which is more time consuming in contrast to batch processing.
Proposed answer C is incorrect, because we still have more overhead in sending each row in separate request than using batch processing.
Correct answer A: Batching Rows: By batching multiple rows into a single request, you minimise the overhead of network communication and API call latency. BigQuery's InsertAllRequest supports inserting multiple rows in a single API call.
Parallel Inserts (Option B): While parallel processing can improve performance, it introduces complexity and may require handling concurrency issues. It's generally better to batch rows first.
Writing to Cloud Storage (Options C & D): Writing data to Cloud Storage and then loading it into BigQuery can be efficient for very large datasets, but it adds extra steps and complexity. It's more suitable for bulk load operations rather than small, frequent inserts.
For smaller datasets or when simplicity is paramount: Including multiple rows with each request is often sufficient.
For larger datasets or when performance is critical: Parallel inserts are the way to go.
A. Include multiple rows with each request:
This would be a very efficient way to batch the insert operations. BigQuery's insertAll method supports batched inserts, so instead of inserting each row in a separate request, you could group multiple rows into a single insertAll request. This approach reduces the number of HTTP requests made to the BigQuery service, which can improve throughput and reduce the risk of hitting rate limits.
B - I was between A and B. Both options require changes in the code and Option B requires changes in the way you are managing the Collection. If you insert multiples rows at a time, you would still need to move through the ROWS in the collection one by one (remember, this is a loop) to then insert in bulk. If you first break the Collection into (n) subsets and then run the function in (n) threats, you would be moving through (n) subsets at a time, making (n) insertions at a time, all in parallel. That was my way of viewing it.
Option A would actually not even make a change in performance (sort of), you would just be interacting with the database less. (if interacting less in faster then you would see a small decrease in insert latencies)
A. Include multiple rows with each request.
Batch inserts are more efficient than individual inserts and will increase write performance by reducing the overhead of creating and sending individual requests for each row. Parallel inserts could potentially lead to conflicting writes or cause resource exhaustion, and adding a step of writing to Cloud Storage and then loading into BigQuery can add additional overhead and complexity.
A. Include multiple rows with each request.
It is generally more efficient to insert multiple rows in a single request, rather than making a separate request for each row. This reduces the overhead of making multiple HTTP requests, and can also improve performance by allowing BigQuery to perform more efficient batch operations. You can use the InsertAllRequest.RowToInsert.of(row) method to add multiple rows to a single request
For example, you could modify the code to collect the rows in a list and insert them in batches:
List<InsertAllRequest.RowToInsert> rowsToInsert = new ArrayList<>();
for (Map<String, String> row : rows) {
rowsToInsert.add(InsertAllRequest.RowToInsert.of(row));
if (rowsToInsert.size() == BATCH_SIZE) {
InsertAllRequest insertRequest = InsertAllRequest.newBuilder(
"datasetId", "tableId", rowsToInsert).build();
service.insertAll(insertRequest);
rowsToInsert.clear();
}
}
if (!rowsToInsert.isEmpty()) {
InsertAllRequest insertRequest = InsertAllRequest.newBuilder(
"datasetId", "tableId", rowsToInsert).build();
service.insertAll(insertRequest);
}
This will insert the rows in batches of BATCH_SIZE, which you can adjust based on the desired balance between performance and resource usage.
Options B and D, which involve using multiple threads to perform the inserts or write the rows to Cloud Storage, may not necessarily improve the efficiency of the code. These options could potentially increase the complexity of the code and introduce additional overhead, without necessarily improving the performance of the inserts.
Option C, writing each row to a Cloud Storage object before loading into BigQuery, would likely be less efficient than simply inserting the rows directly into BigQuery. It would involve additional steps and potentially increase the overall time it takes to write the rows to the table.
Parallel saving to the database can increase the total addition time and depends on many system conditions. While batch saving is optimized at the database core level.
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
fraloca
Highly Voted 3 years, 11 months agoTrueCurry
Highly Voted 2 years, 11 months agod_ella2001
Most Recent 4 months, 2 weeks agothewalker
4 months, 2 weeks agosantoshchauhan
8 months, 3 weeks agogingrick
1 year ago__rajan__
1 year, 2 months agoPime13
1 year, 9 months agomrvergara
1 year, 9 months agoFoxal
1 year, 9 months agotelp
1 year, 10 months agoomermahgoub
1 year, 11 months agoomermahgoub
1 year, 11 months agoomermahgoub
1 year, 11 months agotest010101
1 year, 11 months agojcataluna
1 year, 12 months agothaipad
2 years, 2 months agotomato123
2 years, 3 months agokinoko1330
2 years, 3 months ago