You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?
I would go to A.
C, D are similar. So both are excluded. B, Hive is actually a data warehouse system. I don't use Apache Pig. But, BCD are wrong. Then A should be correct.
This answer depends which language you are comfortable with.
Hadoop is your framework, where mapReduce is your Native programming model in JAVA, which is designed to scale, parallel processing, restart pipeline from any checkpoint etc. , So if you are comfortable with JAVA, you can customize your checkpoint at lowlevel in better way. otherwise, choose PIG which is another programming concept run over JAVA but then you need to learn this also, if not choose python as it can be deployed with hadoop because hadoop has been making updates for python clients regularly.
Option C: is the best one.
C. Java using MapReduce or D. Python using MapReduce
Apache Hadoop is a distributed computing framework that allows you to process large datasets using the MapReduce programming model. There are several options for writing ETL pipelines to run on a Hadoop cluster, but the most common are using Java or Python with the MapReduce programming model.
A. PigLatin using Pig is a high-level data flow language that is used to create ETL pipelines. Pig is built on top of Hadoop, and it allows you to write scripts in PigLatin, a SQL-like language that is used to process data in Hadoop. Pig is a simpler option than MapReduce but it lacks some capabilities like the control over low-level data manipulation operations.
B. HiveQL using Hive is a SQL-like language for querying and managing large datasets stored in Hadoop's distributed file system. Hive is built on top of Hadoop and it provides an SQL-like interface for querying data stored in Hadoop. Hive is more suitable for querying and managing large datasets stored in Hadoop than for ETL pipelines.
Both Java and Python using MapReduce provide low-level control over data manipulation operations, and they allow you to write custom mapper and reducer functions that can be used to process data in a Hadoop cluster. The choice between Java and Python will depend on the development team's expertise and preference.
It has to be C
because while Pig can be used to simplify the writing of complex data transformation tasks and can store intermediate results, it doesn't provide the detailed control over checkpointing and pipeline splitting in the way that is typically implied by those terms.
also, while one can write MapReduce jobs in languages other than Java (like Python) using Hadoop Streaming or other similar APIs, it may not be as efficient or as seamless as using Java due to the JVM-native nature of Hadoop.
Is this really a question that could appear in Google Cloud Professional Data Engineer Exam? What does it have to do with Google Cloud? I would use DataProc no?
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
[Removed]
Highly Voted 4 years, 8 months ago[Removed]
Highly Voted 4 years, 8 months agoSamuelTsch
Most Recent 1 month agoAnonymousPanda
1 year, 3 months agoOleksandr0501
1 year, 7 months agojuliobs
1 year, 8 months agomusumusu
1 year, 9 months agosamdhimal
1 year, 10 months agosamdhimal
1 year, 10 months agocetanx
1 year, 6 months agoKoushik25sep
2 years, 2 months agoBigDataBB
2 years, 9 months agorbeeraka
2 years, 10 months agodavidqianwen
2 years, 10 months agomaddy5835
3 years, 1 month agosumanshu
3 years, 4 months agokdiab
3 years, 9 months agoIsaB
4 years, 2 months agoPupina
4 years, 1 month agoMaxNRG
2 years, 11 months agoharoldbenites
4 years, 3 months ago