Exam DP-200 topic 2 question 16 discussion

Actual exam question from Microsoft's DP-200

Question #: 16
Topic #: 2

HOTSPOT -
A company is deploying a service-based data environment. You are developing a solution to process this data.
The solution must meet the following requirements:
✑ Use an Azure HDInsight cluster for data ingestion from a relational database in a different cloud service
✑ Use an Azure Data Lake Storage account to store processed data
✑ Allow users to download processed data
You need to recommend technologies for the solution.
Which technologies should you use? To answer, select the appropriate options in the answer area.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: Apache Sqoop -
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP).
Incorrect Answers:
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its MapReduce pedigree has endowed it with some quirks in both its semantics and execution.
RevoScaleR is a collection of proprietary functions in Machine Learning Server used for practicing data science at scale. For data scientists, RevoScaleR gives you data-related functions for import, transformation and manipulation, summarization, visualization, and analysis.

Box 2: Apache Kafka -
Apache Kafka is a distributed streaming platform.
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data

Box 3: Ambari Hive View -
You can run Hive queries by using Apache Ambari Hive View. The Hive View allows you to author, optimize, and run Hive queries from your web browser.
References:
https://sqoop.apache.org/
https://kafka.apache.org/intro
https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-use-hive-ambari-view

by tucho at April 10, 2021, 11:22 a.m.

Comments

Submit Cancel

unidigm

3 years, 11 months ago

Apache Sqoop, Apache Hive, Ambari Hive View

upvoted 4 times

...

Kratik

3 years, 11 months ago

For Process, I think it should be hive For Download, the answer seems correct. But instead of 'Ambari Hive View', I think it should be 'Apache Hive View'

upvoted 1 times

...

Hassan_Mazhar_Khan

4 years ago

For Process it should be 'Hive' as it provide full storage mechanism

upvoted 1 times

...

tucho

4 years ago

can't be "hive" for process task?

upvoted 2 times

...

Exam DP-200 All Questions

View all questions & answers for the DP-200 exam