You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?
A.
Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
B.
Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
C.
Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
D.
Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
You are transferring sensitive patient information, so C & D are ruled out. Choice comes down to A & B. Here it gets tricky. How to choose Transfer Appliance: (https://cloud.google.com/transfer-appliance/docs/2.0/overview)
Without knowing the bandwidth, it is not possible to determine whether the upload can be completed within 7 days, as recommended by Google. So the safest and most performant way is to use Transfer Appliance.
Therefore my choice is B.
https://cloud.google.com/solutions/migration-to-google-cloud-transferring-your-large-datasets
The table shows for 1Gbps, it takes 30 hrs for 10 TB. Generally, corporate internet speeds are over 1Gbps. I'm inclined to pick A
If you transfer 10TBs over the wire, your network will be blocked for the entire transfer time. This isn't something a company would be happy to swallow.
Answer should be B: A is also correct but it has its own limit. It allows only 5TB data upload at a time to cloud storage.
https://cloud.google.com/storage/quotas
I will go with B
Also,
For your scenario with 10 TB of data in Cloud SQL, if you export to Avro without specifying compression, you can expect the resulting Avro file to be around the same size, potentially slightly smaller depending on the data characteristics. Here in this question, there is no mentioning about compression.
So let's not assume that the data being used in Avro format will get compressed.
If Google cloud storage itself, can't handle an object of size greater than 5 TB, there is no point of using gsutil
https://cloud.google.com/storage-transfer/docs/known-limitations-transfer
Cloud Storage 5TiB object size limit
Cloud Storage supports a maximum single-object size up 5 tebibytes. If you have objects larger than 5TiB, the object transfer fails for those objects for either Cloud Storage or Storage Transfer Service.
while Option A is feasible and could work depending on specific requirements and security measures implemented, Option D (exporting as Avro, using Storage Transfer Service, and then loading into BigQuery) generally offers a more secure, efficient, and managed approach for transferring sensitive patient records into BigQuery from a relational database.Avro files uploaded to GCS will need to be secured. While GCS itself offers security features like IAM policies and access controls, using a public URL (as suggested in Option A) introduces additional security concerns.
IMO "A" is the most suitable option since the transfer appliance could take 25 days to get the appliance and then 25 days to ship it back and have the data available.
https://cloud.google.com/transfer-appliance/docs/4.0/overview#transfer-speeds
Given the sensitivity of the patient records and the large size of the data, using Google's Transfer Appliance is a secure and efficient method. The Transfer Appliance is a hardware solution provided by Google for transferring large amounts of data. It enables you to securely transfer data without exposing it over the internet.
10 TB is nothing. With a single 10 GB interconnect you could transfer the data in 3 hours or even with a 1 GB speeds without interconnect you could transfer it in one weekend. The transfer appliance will take 25 days to get the appliance and then 25 days while you wait for the data to be available that is not "time-efficient" at all. I go with A instead of B.
As per Google recommendation above 1TB of transfer from onprem or from Google cloud or other cloud storage like s3 etc we need to use storage transfer service.
Transfer Appliance would take 20 days for epected turnaround time. https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#expected%20turnaround:~:text=The%20expected%20turnaround%20time%20for%20a%20network%20appliance%20to%20be%20shipped%2C%20loaded%20with%20your%20data%2C%20shipped%20back%2C%20and%20rehydrated%20on%20Google%20Cloud%20is%2020%20days.
The best answer would be A.
If gsutil consume/leverage 100MB it would take 12 days and more time-efficient than B.
This is a reasonable assumption.
https://cloud.google.com/static/architecture/images/big-data-transfer-how-to-get-started-transfer-size-and-speed.png
I will go with " A" because of the transition time to take transfer appliance to Google and that also depends in the organisation location. gsutil works anywhere internet is available.
Transfer Appliance is not as time-efficient when you have enough bandwitdh. https://cloud.google.com/architecture/migration-to-google-cloud-transferring-your-large-datasets#transfer_appliance_for_larger_transfers
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
Ganshank
Highly Voted 4 years, 7 months agotprashanth
4 years, 4 months agoBigQuery
2 years, 11 months agoforepick
1 year, 5 months agoTNT87
4 years, 1 month agoYiouk
3 years, 3 months agoAzureDP900
1 year, 11 months agoSSV
Highly Voted 4 years, 4 months agoVASI
3 years, 10 months agoVASI
3 years, 10 months agoPreetmehta1234
Most Recent 2 months agoPreetmehta1234
2 months agohussain.sain
4 months, 4 weeks agoNaresh_4u
6 months, 3 weeks agoGCanteiro
9 months, 3 weeks agoTVH_Data_Engineer
11 months, 1 week agorocky48
11 months, 3 weeks agospicebits
1 year agospicebits
1 year agoA_Nasser
1 year, 2 months agoDineshVarma
1 year, 3 months agoarien_chen
1 year, 3 months agoColourseun
1 year, 3 months agoaewis
1 year, 4 months agoZZHZZH
1 year, 4 months agoWillemHendr
1 year, 5 months ago