Discover how CloudZero helps engineering and finance get on the same team — and unlock cloud cost intelligence to power cloud profitability
Learn moreDiscover the power of cloud cost intelligence
Give your team a better cost platform
Give engineering a cloud cost coach
Learn more about CloudZero and who we are
Learn more about CloudZero's pricing
Take a customized tour of CloudZero
Explore CloudZero by feature
Build fast with cost guardrails
Drive accountability and stay on budget
Manage all your discounts in one place
Organize spend to match your business
Understand your cloud unit economics and measure cost per customer
Discover and monitor your real Kubernetes and container costs
Measure and monitor the unit metrics that matter most to your business
Allocate cost and gain cost visibility even if your tagging isn’t perfect
Identify and measure your software COGS
Decentralize cost decisions to your engineering teams
Automatically identify wasted spend, then proactively build cost-effective infrastructure
CloudZero ingests data from AWS, GCP, Azure, Snowflake, Kubernetes, and more
View all cost sourcesDiscover the best cloud cost intelligence resources
Browse webinars, ebooks, press releases, and other helpful resources
Discover the best cloud cost intelligence content
Learn how we’ve helped happy customers like SeatGeek, Drift, Remitly, and more
Check out our best upcoming and past events
Gauge the health and maturity level of your cost management and optimization efforts
Compare pricing and get advice on AWS services including EC2, RDS, ElastiCache, and more
Learn moreDiscover how SeatGeek decoded its AWS bill and measures cost per customer
Read customer storyLearn how Skyscanner decentralized cloud cost to their engineering teams
Read customer storyLearn how Malwarebytes measures cloud cost per product
Read customer storyLearn how Remitly built an engineering culture of cost autonomy
Read customer storyDiscover how Ninjacat uses cloud cost intelligence to inform business decisions
Read customer storyLearn Smartbear optimized engineering use and inform go-to-market strategies
Read customer storyDiscover what AWS EMR is, how it works, the benefits and limitations of the service, and when you should use it as part of your big data strategy.
According to Statista, the mass volume of data created, stored, copied, and consumed in 2020 was over 64 zettabytes (ZB), or about 64 trillion gigabytes (GB). This is expected to rise to 181 ZB by the year 2025.
A large portion of this data is likely to be significant to your business. It can provide you with new insights that help you improve your product, communicate with consumers, and perform risk analysis. However, you’ll need the right tools to extract, sort, process, and analyze it.
That’s where tools like Amazon’s Elastic MapReduce (EMR) come in. In this guide, we’ll discuss what EMR is, how it works, and how it may benefit you. You’ll then be able to decide if it’s worth integrating as part of your big data strategy.
Table Of Contents
Amazon Elastic MapReduce provides tools and workflows for big data management in the cloud. With Amazon EMR, your data scientists get a web-based big data platform that can process massive amounts of data using a variety of open-source tools such as Presto, Apache Spark, and Apache Hive.
EMR also enables you to more easily build, scale, and optimize your cloud data environment compared to building and maintaining one on-premises. Here’s the thing:
Companies seeking to gain more insight and value from their data often struggle to capture, store, and analyze all of it. As data grows, it comes from more sources and becomes increasingly diverse. Thus, it needs to be securely accessed to be analyzed by different applications and lines of business.
AWS EMR can help solve these issues. EMR is a managed cluster platform that assists organizations in running Big Data frameworks on AWS to analyze and process large sets of data more efficiently.
By using these frameworks along with related opensource projects such as Apache Flink and Apache Pig, you can process and sort data for business intelligence and analytics purposes.
In addition, you can use AWS EMR to transform and move large sets of data into and out of other AWS data stores and databases such as Amazon Simple Storage Services (Amazon S3) and Amazon DynamoDB.
AWS designed EMR to be an easy-to-use, highly scalable, and reliable big data platform. It does that by enabling certain capabilities, such as:
These features make Amazon EMR ideal for performing big data analytics, building scalable data pipelines, and processing streaming data in real-time. Yet, those are only a few highlighted Amazon EMR features, there are other ways to use the managed big data platform.
The Amazon EMR architecture comprises several layers. Each layer provides a particular set of features and functions to the cluster:
This is the layer that contains the cluster's file systems. Amazon EMR lets you use several file systems with your cluster, such as:
About the next layer.
This is where cluster resources are managed. The EMR service uses Yet Another Resource Negotiator (YARN) to centrally manage resources for multiple data processing frameworks. The layer also schedules jobs for processing.
This is where the data processing and analyses happen using a variety of supported frameworks. So, you can pick a framework based on your processing requirement, such as batch, streaming, interactive, or in-memory. The two main supported frameworks are Hadoop MapReduce and Apache Spark.
This is where your apps are hosted, including Apache Hive and Pig. The applications let add capabilities such as building data warehouses, using ML algorithms, and creating stream processing apps.
As for how the Amazon EMR architecture works in practice, consider Amazon EMR on Amazon Elastic Kubernetes Service (EKS), as an example.
EMR on EKS loosely couples workloads to the infrastructure they run on. Each infrastructure layer supports orchestration for the following layer.
You first set up Amazon EMR on EKS. Then you assign a job to Amazon EMR through a job definition. A job run is a unit of work, such as a SparkSQL query. The job’s definition includes all of the parameters specific to the application. EKS uses these parameters to determine which pods and containers to deploy.
Credit: Amazon EMR at work
After that, Amazon EKS brings up the required Amazon EC2 and AWS Fargate resources to run the job.
This means:
Here is an illustration of how Amazon EMR on EKS interacts with other AWS services.
Credit: How Amazon EMR on the Elastic Kubernetes Service works with other AWS services.
The Amazon EMR service processes your data using Amazon Elastic Compute Cloud (Amazon EC2) instances along with open-source tools such as Apache Spark, Flink, HBase, and Presto.
You get to pull all data into a data lake and analyze it with your choice of open-source distributed processing frameworks such as:
By far, the most popular storage infrastructure for a data lake is Amazon S3. EMR allows you to store data in Amazon S3 and run compute as you need to process that data. EMR clusters can be launched in minutes. You don’t have to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning.
Once the processing is done, you can switch off your clusters. You can also automatically resize clusters to accommodate peaks and scale them down without impacting your Amazon S3 data lake storage.
Additionally, you can run multiple clusters in parallel, allowing them to share the same data set. EMR will monitor your clusters, retry failed tasks, and automatically replace poorly performing instances.
If you use Amazon Cloudwatch along with EMR, you can collect and track metrics, logs, and audits. This approach also allows you to set alarms and automatically react to changes.
Pricing for Amazon EMR is based on several factors, including the duration you use the service, how you deploy the EMR apps, and deployment type.
Check this out (we’ll explain):
This image shows how pricing for Amazon EMR on EC2 works.
Now we explain. In terms of duration, Amazon EMR billing is per second you use it with a 60-second minimum requirement. You’ll likely pay per hour, though.
In terms of how you deploy your EMR apps, you can either run Amazon EMR with EC2 instances or AWS Fargate. That means you a separate fee for the underlying EC2 or Fargate servers from the EMR rate per hour.
As for deployment type, you can choose from four options:
Pricing is based on AWS Region, instance type, duration, and purchase option (On-Demand vs Reserved Instances vs Spot Instances). For example, it costs $0.1728/hour plus $0.0432/hour to run EMR on an m6a.xlarge EC2 instance in the US East (Ohio) Region.
The service charges you based on your requested memory and vCPU resources to run a Pod or a Task (from when the image download begins to when it completes, to the nearest second). There’s a 60-second minimum requirement. For example, pricing in the US East (Ohio) Region is $0.01012/vCPU/hour and $0.00111125/GB/hour.
Amazon EMR charges similarly to cloud-based instances of EMR.
As a serverless service, pricing is based on the amount of compute (vCPU and memory) and storage resources your apps consume, aggregated across all your worker nodes. It is also based on the operating system you run them on.
For example, it costs $0.052624/vCPU/hour and $0.0057785/GB/hour for compute and memory, as well as $0.000111/GB/hour for any extra ephemeral storage you add to the default 20 GB.
Of course, you can find the latest pricing updates for Amazon EMR on the relevant AWS pricing pages.
AWS EMR makes deploying distributed data processing frameworks easy and cost-effective. Furthermore, it decouples compute and storage. This allows both to grow independently, leading to better resource utilization.
In the past, users have found operating conventional data processing frameworks like Apache Spark to be quite challenging — especially when used in conjunction with other frameworks like Hadoop.
It could be complex, expensive, and time-consuming. Organizations were required to purchase and integrate hardware (servers, computers, etc.), then install and manage software. Of course, software and hardware would require constant upgrades, further adding to expenses and complexity.
Various lines of business would often timeshare centralized cluster resources. Consequently, this led to under-utilization during idle periods and missed SLA during peak.
As your data grew, the size of your infrastructure would grow along with it. Because storage and compute are tied together, increasing storage means scaling expensive compute requirements.
AWS EMR makes deploying distributed data processing frameworks easy and cost-effective. Furthermore, it decouples compute and storage. This allows both to grow independently, leading to better resource utilization.
With EMR, you pay a per-second rate only for the cluster resources you use. Customer support is available 24/7 on your normal AWS support belt at a fraction of what other commercial distributed processing frameworks vendors would charge.
With spot pricing, you can lower your bill by up to 90%. IDC recently found that the return of investment of EMR versus on-premise is 342% over five years.
Amazon EMR is nearly unbeatable, especially when coupled with some of Amazon’s other web-based services. Nevertheless, while its benefits may be self-evident and many, it does have its limitations. In this section of the guide, we’ll summarize some of Amazon EMR’s pros and cons.
Other benefits include fast spin-up times for EC2 instances. Essentially, this is an EMR service that can be run on AWS Virtual Private Cloud (VPC). This allows for increased data security.
AWS EMR’s other limitations are service-based. For instance, Amazon EMR studio is only available in certain regions such as East US, West US, Asia-Pacific, Canada, and EU. You can only set a single Amazon VPC with a maximum of five subnets for an EMR studio. However, you can create multiple EMR studios and associate them with different VPCs and subnets.
AWS EMR can help you change your rigid in-house cluster infrastructure and provide you with hassle-free Hadoop management. It can also significantly cut the time of data processing. However, as with most AWS products, its pricing can be a little incomprehensible.
Amazon charges you a per-second rate that is also tied to the number of clusters you are running. In addition, you’ll need to pay for the EC2 server and Amazon’s Elastic Block Stores (EBS). If you’re running a large relational database, you’ll need to consider the cost of using the AWS Database Migration Service to move and host your data.
This is only just the tip of the iceberg. To get the most out of EMR, you’ll likely need to employ a host of other AWS tools such as CloudWatch and S3 (for logs). Tracking and managing these costs can be quite daunting. It’s different when you use ClouZero.
With CloudZero, however, you gain complete insight into your AWS cloud spend. CloudZero’s cost intelligence platform maps costs to your products, features, services, dev teams, and more. For example, you’ll see your cost per individual customer, per product feature, per service, per environment and more.
CloudZero also automatically detects cost issues in real time. You’ll then receive context-rich alerts via Slack so you can stop the bleeding before it runs for days or weeks. This ensures you catch potential overspending before it hurts your COGS and margins.
With cloud cost intelligence, you’ll be able to drill into cost data from a high level down to the individual components that drive your cloud spend — and see exactly how services drive your cloud costs and why.
That means you’ll know exactly who, what, and why your cloud costs are changing across AWS, Azure, GCP, Kubernetes, Snowflake, Datadog, etc — right from one platform.
Drift has saved over $3 million using CloudZero. Demandbase cut its annual AWS spend by 36%, justifying $175 million in financing. Here’s your chance to control your Amazon EMR costs. to see CloudZero in action for yourself. It’s on us at no risk to you.
Cody Slingerland, a FinOps certified practitioner, is an avid content creator with over 10 years of experience creating content for SaaS and technology companies. Cody collaborates with internal team members and subject matter experts to create expert-written content on the CloudZero blog.
CloudZero is the only solution that enables you to allocate 100% of your spend in hours — so you can align everyone around cost dimensions that matter to your business.