exam questions

Exam AWS Certified Data Engineer - Associate DEA-C01 All Questions

View all questions & answers for the AWS Certified Data Engineer - Associate DEA-C01 exam

Exam AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 80 discussion

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.
Which solution will meet these requirements MOST cost-effectively?

  • A. Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
  • B. Write a PySpark ETL script. Host the script on an Amazon EMR cluster.
  • C. Write an AWS Glue PySpark job. Use Apache Spark to transform the data.
  • D. Write an AWS Glue Python shell job. Use pandas to transform the data.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
halogi
Highly Voted 10 months, 2 weeks ago
Selected Answer: C
AWS Glue Python Shell Job is billed $0.44 per DPU-Hour for each job AWS Glue PySpark is billed $0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution Source: https://aws.amazon.com/glue/pricing/
upvoted 10 times
GustonMari
6 months, 4 weeks ago
thats true for the 1 DPU, but thats not good because the minimum DPU for PySpark Job is 1 DPU. But for Python Job the minimum DPU is 0.0625. So the Python job is way more cheaper for small dataset and quick ETL transformation
upvoted 4 times
...
...
atu1789
Highly Voted 11 months, 3 weeks ago
Selected Answer: D
Option D: Write an AWS Glue Python shell job and use pandas to transform the data, is the most cost-effective solution for the described scenario. AWS Glue’s Python shell jobs are a good fit for smaller-scale ETL tasks, especially when dealing with .csv files that are less than 100 MB each. The use of pandas, a powerful and efficient data manipulation library in Python, makes it an ideal tool for processing and transforming these types of files. This approach avoids the overhead and additional costs associated with more complex solutions like Amazon EKS or EMR, which are generally more suited for larger-scale, more complex data processing tasks. Given the requirements – processing daily incoming small-sized .csv files – this solution provides the necessary functionality with minimal resources, aligning well with the goal of cost-effectiveness.
upvoted 6 times
...
YUICH
Most Recent 1 week, 2 days ago
Selected Answer: D
It is important not to compare just the “price per DPU hour,” but to consider the total cost by factoring in overhead for job startup, minimum DPU count, execution time, and data volume. For a relatively lightweight workload—such as processing approximately 100 MB of CSV files on a daily basis—option (D), using an AWS Glue Python shell job, is the most cost-effective choice.
upvoted 1 times
...
LR2023
6 months, 3 weeks ago
Selected Answer: D
going with D https://docs.aws.amazon.com/whitepapers/latest/aws-glue-best-practices-build-performant-data-pipeline/additional-considerations.html
upvoted 2 times
...
pypelyncar
7 months, 4 weeks ago
Selected Answer: D
good candidate to be (2 options) for real, either spark and py have similar approaches. I would go with Pandas, although... 50/50.. it could be Spark. I hope not to find this question in the exam
upvoted 6 times
...
VerRi
8 months, 2 weeks ago
Selected Answer: C
PySpark with Spark(Flexible Execution): $0.29/hr for 1 DPU PySpark with Spark(Standard Execution): $0.44/hr for 1 DPU Python Shell with Pandas: $0.44/hr for 1 DPU
upvoted 3 times
...
cloudata
9 months ago
Selected Answer: D
Python Shell is cheaper and can handle small to medium tasks. https://docs.aws.amazon.com/whitepapers/latest/aws-glue-best-practices-build-performant-data-pipeline/additional-considerations.html
upvoted 5 times
...
chakka90
9 months, 1 week ago
D. Because the pyspark is still being the cheap you have to use minimum of 2 DPU. Which would increase the cost anyway so, i feel that d should be correct
upvoted 3 times
...
khchan123
9 months, 1 week ago
Selected Answer: D
D. While AWS Glue PySpark jobs are scalable and suitable for large workloads, C may be overkill for processing small .csv files (less than 100 MB each). The overhead of using Apache Spark may not be cost-effective for this specific use case.
upvoted 3 times
...
Leo87656789
9 months, 2 weeks ago
Selected Answer: D
Option D: Even though the Python Shell Job is more expensive on a DPU-Hour basis, you can select the option "1/16 DPU" in the Job details for a Python Shell Job, which is definetly cheaper than a Pyspark job.
upvoted 3 times
...
lucas_rfsb
10 months, 1 week ago
Selected Answer: C
AWS Glue Python Shell Job is billed $0.44 per DPU-Hour for each job AWS Glue PySpark is billed $0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution Source: https://aws.amazon.com/glue/pricing/
upvoted 6 times
...
[Removed]
10 months, 1 week ago
Selected Answer: D
https://medium.com/@navneetsamarth/reduce-aws-cost-using-glue-python-shell-jobs-70a955d4359f#:~:text=The%20cheapest%20Glue%20Spark%20ETL,1%2F16th%20of%20a%20DPU.&text=This%20can%20result%20in%20massive,just%20a%20better%20design%20overall!
upvoted 4 times
...
GiorgioGss
10 months, 3 weeks ago
Selected Answer: D
D is more cheaper than C. Not so scalable but is cheaper...
upvoted 3 times
...
rralucard_
1 year ago
Selected Answer: C
AWS Glue is a fully managed ETL service, which means you don't need to manage infrastructure, and it automatically scales to handle your data processing needs. This reduces operational overhead and cost. PySpark, as a part of AWS Glue, is a powerful and widely-used framework for distributed data processing, and it's well-suited for handling data transformations on a large scale.
upvoted 5 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago