You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?
A.
Create a Cloud Dataproc Workflow Template
B.
Create an initialization action to execute the jobs
C.
Create a Directed Acyclic Graph in Cloud Composer
D.
Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster
A. Create a Cloud Dataproc Workflow Template
Dataproc Workflow Template can be used to run jobs concurrently and sequentially. DAG is an overkill.
https://cloud.google.com/dataproc/docs/concepts/workflows/use-workflows
The best option for automating your scheduled Spark jobs on Cloud Dataproc, considering sequential and concurrent execution, is:
C. Create a Directed Acyclic Graph (DAG) in Cloud Composer.
Here's why:
DAG workflows: Cloud Composer excels at orchestrating complex workflows with dependencies, making it ideal for managing sequential and concurrent execution of your Spark jobs. You can define dependencies between tasks to ensure certain jobs only run after others finish.
Automation: Cloud Composer lets you schedule workflows to run automatically based on triggers like time intervals or data availability, eliminating the need for manual intervention.
Integration: Cloud Composer integrates seamlessly with Cloud Dataproc, allowing you to easily launch and manage your Spark clusters within the workflow.
Scalability: Cloud Composer scales well to handle a large number of jobs and workflows, making it suitable for managing complex data pipelines.
While the other options have some merit, they fall short in certain aspects:
A. Cloud Dataproc Workflow Templates: While workflow templates can automate job submission on a cluster, they lack the ability to define dependencies and coordinate concurrent execution effectively.
B. Initialization action: An initialization action can only run a single script before a Dataproc cluster starts, not suitable for orchestrating multiple scheduled jobs with dependencies.
D. Bash script: A Bash script might work for simple cases, but it can be cumbersome to manage and lacks the advanced scheduling and error handling capabilities of Cloud Composer.
Therefore, utilizing a Cloud Composer DAG offers the most comprehensive and flexible solution for automating your scheduled Spark jobs with sequential and concurrent execution on Cloud Dataproc.
Directed Acyclic Graph (DAG): Cloud Composer (formerly known as Cloud Composer) is a managed Apache Airflow service that allows you to create and manage workflows as DAGs. You can define a DAG that includes tasks for running Spark jobs in sequence or concurrently.
Scheduling: Cloud Composer provides built-in scheduling capabilities, allowing you to specify when and how often your DAGs should run. You can schedule the execution of your Spark jobs at specific times or intervals.
Dependency Management: In a DAG, you can define dependencies between tasks. This means you can set up tasks to run sequentially or concurrently based on your requirements. For example, you can specify that Job B runs after Job A has completed, or you can schedule jobs to run concurrently when there are no dependencies.
I would choose A if there was one more step to schedule the Template. It is like creating DAG without running it in Airflow.
So only option C is correct here.
I've would've gone for Workflow Templates as well. But those are lacking the scheduling capability. Hence you would need to use Cloud Composer (or Cloud Functions or Cloud Scheduler) anyway. Hence C seems to be the better solution.
Pls see here:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions
C is the answer.
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions#cloud_composer
Cloud Composer is a managed Apache Airflow service you can use to create, schedule, monitor, and manage workflows. Advantages:
- Supports time- and event-based scheduling
- Simplified calls to Dataproc using Operators
- Dynamically generate workflows and workflow parameters
- Build data flows that span multiple Google Cloud products
C.
Composer fits better to schedule Dataproc Workflows, check the documentation:
https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions
Also A is not enough. Dataproc Workflow Template itself don't has a native schedule option.
You have streaming and batch job, so Composer is the choice for me
upvoted 1 times
...
Log in to ExamTopics
Sign in:
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
LP_PDE
Highly Voted 1 year, 10 months agoskhaire
Most Recent 3 days, 1 hour agoMaxNRG
7 months, 3 weeks agoMaxNRG
7 months, 3 weeks agoMaxNRG
7 months, 3 weeks agoemmylou
8 months, 3 weeks agobarnac1es
10 months, 2 weeks agomidgoo
1 year, 4 months agoAzureDP900
1 year, 7 months agosaurabhsingh4k
1 year, 7 months agocaptainbu
1 year, 7 months agozellck
1 year, 8 months agodevaid
1 year, 9 months agolouisgcpde
1 year, 9 months agolouisgcpde
1 year, 9 months agodmzr
1 year, 10 months agoHarshKothari21
1 year, 11 months agoducc
1 year, 11 months ago