Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 110 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 110
Topic #: 1

[All Certified Data Engineer Professional Questions]

A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.

Which of the following solutions would you implement to achieve this requirement?

A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
B. Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
C. Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
D. Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
E. Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.

Show Suggested Answer

Suggested Answer: A 🗳️

by aragorn_brego at Nov. 22, 2023, 2:38 a.m.

Comments

Submit Cancel

natadatabricksadf

4 months, 2 weeks ago

Selected Answer: B

High Concurrency clusters are depricated, so B https://learn.microsoft.com/en-us/answers/questions/1688410/are-high-concurrency-clusters-deprecated-or-rename

upvoted 3 times

...

temple1305

4 months, 2 weeks ago

Selected Answer: B

High Concurrency clusters are depricated, so B?

upvoted 3 times

hesamh

3 months, 1 week ago

doesn't it led to write small files (les than 1 GB) in each partitions?

upvoted 1 times

...

shaojunni

7 months ago

Selected Answer: A

"hundreds of pipelines with parallel updates of many tables" indicates updating many tables concurrently via many pipelines. A is the best solution for that. B is the answer for updating a few large tables with few partitions.

upvoted 2 times

...

practicioner

8 months ago

Selected Answer: B

"Which of the following solutions" I'm sure this is a question with multichoice. A and B options are correct together.

upvoted 1 times

...

BrianNguyen95

10 months, 2 weeks ago

Selected Answer: B

High volume and high-velocity data ingestion often becomes a bottleneck due to limited write parallelism. By partitioning ingestion tables based on small time durations (e.g., hourly or even minutes), you create many smaller partitions. This allows concurrent writes to different partitions, significantly increasing the overall throughput of your data ingestion.

upvoted 2 times

...

svik

11 months ago

Selected Answer: A

Since multiple pipelines are being used high concurrency cluster would give maximum resource utilization.

upvoted 1 times

...

Er5

1 year ago

A. B is only useful to improve performance of large tables ingestions.

upvoted 1 times

...

Curious76

1 year, 1 month ago

Selected Answer: D

Why not D?

upvoted 2 times

...

vctrhugo

1 year, 2 months ago

Both options A and B could be relevant depending on the specific details of the use case. If the emphasis is on optimizing concurrent queries and overall data throughput, option A might be more appropriate. If the primary concern is parallel updates of tables with high-volume, high-velocity data, option B is a more targeted approach.

upvoted 1 times

...

PrincipalJoe

1 year, 2 months ago

Selected Answer: B

The best way to deal with high volume and high velocity data is to use partitioning

upvoted 1 times

...

bacckom

1 year, 3 months ago

Selected Answer: A

Databricks High Concurrency cluster

upvoted 2 times

...

petrv

1 year, 4 months ago

Selected Answer: A

1) Partitioning by Time: Partitioning tables by a small time duration allows for efficient parallelism in data writes. Each time partition can be processed independently, enabling parallel updates to multiple partitions concurrently. 2)Optimizing for Parallelism: By partitioning the tables based on time, data can be ingested and processed in parallel, providing the ability to handle high volume and high velocity data effectively. Regarding option A, Databricks High Concurrency clusters are more focused on supporting a large number of concurrent users, which might not directly address the requirement for parallel updates of many tables with extremely high volume and high velocity data

upvoted 1 times

Isio05

10 months, 2 weeks ago

Usage of high conc. clusters can be beneficial both for mulitple users and jobs/queries running on them

upvoted 1 times

Isio05

10 months, 1 week ago

Sorry, after going through this question once more - I'll go with B also. It will allow utilize parallelism in an efficient way.

upvoted 1 times

...

petrv

1 year, 4 months ago

sorry, the selected answer should have been B

upvoted 1 times

...

aragorn_brego

1 year, 4 months ago

Selected Answer: A

High Concurrency clusters in Databricks are designed for multiple concurrent users and workloads. They provide fine-grained sharing of cluster resources and are optimized for operations such as running multiple parallel queries and updates. This would be suitable for a solution that involves many pipelines with parallel updates, especially with high volume and high velocity data.

upvoted 4 times

...

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 110 discussion

Comments

natadatabricksadf

temple1305

hesamh

shaojunni

practicioner

BrianNguyen95

svik

Er5

Curious76

vctrhugo

PrincipalJoe

bacckom

petrv

Isio05

Isio05

petrv

aragorn_brego

SY0-701