Overview

Discover how CloudZero helps engineering and finance get on the same team — and unlock cloud cost intelligence to power cloud profitability

Learn more Arrow Arrow

Explore CloudZero

Discover the power of cloud cost intelligence


Why Change Icon
Why Change?

Give your team a better cost platform

Services Icon
Services

Give engineering a cloud cost coach

About Icon
About

Learn more about CloudZero and who we are

Pricing Icon
Pricing

Learn more about CloudZero's pricing

Tour Icon
Tour

Take a customized tour of CloudZero

Features

Explore CloudZero by feature


Cost Anomaly Detection Icon
Cost Anomaly Detection

Build fast with cost guardrails

Budgeting Icon
Budgeting And Forecasting

Drive accountability and stay on budget

Discount Dashboard Icon
Discount Optimization Dashboard

Manage all your discounts in one place

Dimensions Icon
CloudZero Dimensions

Organize spend to match your business

By Use Case


Cost Per Customer
Cost Per Customer Analysis

Understand your cloud unit economics and measure cost per customer

Kubernetes Cost Analysis
Kubernetes Cost Analysis

Discover and monitor your real Kubernetes and container costs

Unit Cost Analysis
Unit Cost Analysis

Measure and monitor the unit metrics that matter most to your business

Cost Allocation
Tagging And Cost Allocation

Allocate cost and gain cost visibility even if your tagging isn’t perfect

SaaS COGS
SaaS COGS Measurement

Identify and measure your software COGS

Engineering Cost Awareness
Engineering Cost Awareness

Decentralize cost decisions to your engineering teams

Cloud Cost Optimization
Cloud Cost Optimization

Automatically identify wasted spend, then proactively build cost-effective infrastructure

By Role


All Your Cloud Spend, In One View

CloudZero ingests data from AWS, GCP, Azure, Snowflake, Kubernetes, and more

View all cost sources Arrow Arrow

Learn

Discover the best cloud cost intelligence resources


Resources Icon Resources

Browse webinars, ebooks, press releases, and other helpful resources

Blog Icon Blog

Discover the best cloud cost intelligence content

Case Study Icon Case Studies

Learn how we’ve helped happy customers like SeatGeek, Drift, Remitly, and more

Events Icon Events

Check out our best upcoming and past events

Cost Assessment Icon Free Cloud Cost Assessment

Gauge the health and maturity level of your cost management and optimization efforts

Featured

CloudZero Advisor

Compare pricing and get advice on AWS services including EC2, RDS, ElastiCache, and more

Learn more Arrow Arrow

How SeatGeek Measures Cost Per Customer

Discover how SeatGeek decoded its AWS bill and measures cost per customer

Read customer story orangearrow arrow-right

How Skyscanner Creates A Cost-Aware Culture

Learn how Skyscanner decentralized cloud cost to their engineering teams

Read customer story orangearrow arrow-right

How Malwarebytes Measures Cost Per Customer

Learn how Malwarebytes measures cloud cost per product

Read customer story orangearrow arrow-right

How Remitly Shifts Cloud Costs Left

Learn how Remitly built an engineering culture of cost autonomy

Read customer story orangearrow arrow-right

How Ninjacat Combines AWS And Snowflake Spend

Discover how Ninjacat uses cloud cost intelligence to inform business decisions

Read customer story orangearrow arrow-right

How Smartbear Uses Cloud Cost To Inform GTM Strategies

Learn Smartbear optimized engineering use and inform go-to-market strategies

Read customer story orangearrow arrow-right
arrow-left arrow-right
View all customer stories

What Is AWS EMR? Here's Everything You Need To Know

Discover what AWS EMR is, how it works, the benefits and limitations of the service, and when you should use it as part of your big data strategy.

Is your current cloud cost tool giving you the cost intelligence you need?  Most tools are manual, clunky, and inexact. Discover how CloudZero takes a new  approach to organizing your cloud spend.Click here to learn more.

According to Statista, the mass volume of data created, stored, copied, and consumed in 2020 was over 64 zettabytes (ZB), or about 64 trillion gigabytes (GB). This is expected to rise to 181 ZB by the year 2025.

A large portion of this data is likely to be significant to your business. It can provide you with new insights that help you improve your product, communicate with consumers, and perform risk analysis. However, you’ll need the right tools to extract, sort, process, and analyze it.

That’s where tools like Amazon’s Elastic MapReduce (EMR) come in. In this guide, we’ll discuss what EMR is, how it works, and how it may benefit you. You’ll then be able to decide if it’s worth integrating as part of your big data strategy.

Table Of Contents

What Is Amazon EMR?

Amazon Elastic MapReduce provides tools and workflows for big data management in the cloud. With Amazon EMR, your data scientists get a web-based big data platform that can process massive amounts of data using a variety of open-source tools such as Presto, Apache Spark, and Apache Hive.

EMR also enables you to more easily build, scale, and optimize your cloud data environment compared to building and maintaining one on-premises. Here’s the thing:

Companies seeking to gain more insight and value from their data often struggle to capture, store, and analyze all of it. As data grows, it comes from more sources and becomes increasingly diverse. Thus, it needs to be securely accessed to be analyzed by different applications and lines of business.

AWS EMR can help solve these issues. EMR is a managed cluster platform that assists organizations in running Big Data frameworks on AWS to analyze and process large sets of data more efficiently.

amazon emr diagram

By using these frameworks along with related opensource projects such as Apache Flink and Apache Pig, you can process and sort data for business intelligence and analytics purposes.

In addition, you can use AWS EMR to transform and move large sets of data into and out of other AWS data stores and databases such as Amazon Simple Storage Services (Amazon S3) and Amazon DynamoDB.

Amazon EMR Features: What Can EMR Do?

AWS designed EMR to be an easy-to-use, highly scalable, and reliable big data platform. It does that by enabling certain capabilities, such as:

  • Managed big data platform - Provision, configure, and launch your clusters in minutes by eliminating a lot of the manual work it would otherwise take.
  • Automated elasticity - Use custom policies to continuously scale your clusters so you can meet your workload requirements.
  • Optimize big data processing costs - Deploy multiple clusters or resize a running one to handle an increase in workload or reduce capacity if there’s less work to do, thereby reducing your costs.
  • Leverage a variety of flexible data stores - Use data stores like the Hadoop Distributed File System (HDFS), Amazon DynamoDB, Amazon RedShift, and Amazon Relational Database Service (Amazon RDS).
  • Take advantage of your favorite big data solutions - Select and use the latest version of your choicest open-source platform such as Apache Spark or Hadoop applications.
  • Manage your data with Amazon S3 - Use Apache Hudi to manage incremental data processing and pipeline development.
  • Processing large data sets fast - EMR uses in-memory, fault-tolerant resilient distributed datasets (RDDs) along with directed, acyclic graphs (DAGs) to specify how the data transformations happen.
  • Secure your data with access controls - Amazon EMR application processes call other AWS services using the EC2 instance profile by default. There are three ways Amazon EMR manages access to Amazon S3 data in multi-tenant clusters; by integrating with AWS Lake Formation, integrating natively with Apache Ranger, or with User Role Mapper.

These features make Amazon EMR ideal for performing big data analytics, building scalable data pipelines, and processing streaming data in real-time. Yet, those are only a few highlighted Amazon EMR features, there are other ways to use the managed big data platform.

How Does The Amazon EMR Architecture Work?

The Amazon EMR architecture comprises several layers. Each layer provides a particular set of features and functions to the cluster:

Storage layer

This is the layer that contains the cluster's file systems. Amazon EMR lets you use several file systems with your cluster, such as:

  • The location file system - A locally connected storage on which data persists only as long as an Amazon EC2 instance is running.
  • Hadoop Distributed File System (HDFS) - The ephemeral, scalable, and distributed file system for Hadoop distributes data in its storage across clusters, retaining multiple copies of the data on different instances as a backup in case any instance fails.
  • Elastic MapReduce File System - EMRFS extends Hadoop’s ability to access data directly in Amazon S3 as you would in HDFS. S3 stores the input and output data while HDFS stores intermediate results.

About the next layer.

Cluster resource management layer

This is where cluster resources are managed. The EMR service uses Yet Another Resource Negotiator (YARN) to centrally manage resources for multiple data processing frameworks. The layer also schedules jobs for processing.

Data processing frameworks layer

This is where the data processing and analyses happen using a variety of supported frameworks. So, you can pick a framework based on your processing requirement, such as batch, streaming, interactive, or in-memory. The two main supported frameworks are Hadoop MapReduce and Apache Spark.

App and programs layer

This is where your apps are hosted, including Apache Hive and Pig. The applications let add capabilities such as building data warehouses, using ML algorithms, and creating stream processing apps.

As for how the Amazon EMR architecture works in practice, consider Amazon EMR on Amazon Elastic Kubernetes Service (EKS), as an example.

EMR on EKS loosely couples workloads to the infrastructure they run on. Each infrastructure layer supports orchestration for the following layer.

You first set up Amazon EMR on EKS. Then you assign a job to Amazon EMR through a job definition. A job run is a unit of work, such as a SparkSQL query. The job’s definition includes all of the parameters specific to the application. EKS uses these parameters to determine which pods and containers to deploy.

amazon emr at work diagram

Credit: Amazon EMR at work

After that, Amazon EKS brings up the required Amazon EC2 and AWS Fargate resources to run the job.

This means:

  • You can perform multiple isolated jobs concurrently thanks to this loose coupling.
  • You can also use different backends to benchmark the same job.
  • Or, you can also spread your job across multiple Amazon Availability Zones (AZ) to maximize availability.

Here is an illustration of how Amazon EMR on EKS interacts with other AWS services.

Amazon EMR diagram

Credit: How Amazon EMR on the Elastic Kubernetes Service works with other AWS services.

How Does Amazon EMR Actually Work?

The Amazon EMR service processes your data using Amazon Elastic Compute Cloud (Amazon EC2) instances along with open-source tools such as Apache Spark, Flink, HBase, and Presto.

You get to pull all data into a data lake and analyze it with your choice of open-source distributed processing frameworks such as:

  • Apache Spark
  • Apache Hadoop
  • Apache Storm
  • Presto

By far, the most popular storage infrastructure for a data lake is Amazon S3. EMR allows you to store data in Amazon S3 and run compute as you need to process that data. EMR clusters can be launched in minutes. You don’t have to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning.

Once the processing is done, you can switch off your clusters. You can also automatically resize clusters to accommodate peaks and scale them down without impacting your Amazon S3 data lake storage.

Additionally, you can run multiple clusters in parallel, allowing them to share the same data set. EMR will monitor your clusters, retry failed tasks, and automatically replace poorly performing instances.

If you use Amazon Cloudwatch along with EMR, you can collect and track metrics, logs, and audits. This approach also allows you to set alarms and automatically react to changes.

Amazon EMR Pricing

Pricing for Amazon EMR is based on several factors, including the duration you use the service, how you deploy the EMR apps, and deployment type.

Check this out (we’ll explain):

Pricing Table

This image shows how pricing for Amazon EMR on EC2 works.

Now we explain. In terms of duration, Amazon EMR billing is per second you use it with a 60-second minimum requirement. You’ll likely pay per hour, though.

In terms of how you deploy your EMR apps, you can either run Amazon EMR with EC2 instances or AWS Fargate. That means you a separate fee for the underlying EC2 or Fargate servers from the EMR rate per hour.

As for deployment type, you can choose from four options:

Pricing for Amazon EMR on EC2 instances

Pricing is based on AWS Region, instance type, duration, and purchase option (On-Demand vs Reserved Instances vs Spot Instances). For example, it costs $0.1728/hour plus $0.0432/hour to run EMR on an m6a.xlarge EC2 instance in the US East (Ohio) Region.

Pricing for Amazon EMR on EKS clusters

The service charges you based on your requested memory and vCPU resources to run a Pod or a Task (from when the image download begins to when it completes, to the nearest second). There’s a 60-second minimum requirement. For example, pricing in the US East (Ohio) Region is $0.01012/vCPU/hour and $0.00111125/GB/hour.

Pricing for Amazon EMR on AWS OutPosts

Amazon EMR charges similarly to cloud-based instances of EMR.

Pricing for Amazon EMR serverless

As a serverless service, pricing is based on the amount of compute (vCPU and memory) and storage resources your apps consume, aggregated across all your worker nodes. It is also based on the operating system you run them on.

For example, it costs $0.052624/vCPU/hour and $0.0057785/GB/hour for compute and memory, as well as $0.000111/GB/hour for any extra ephemeral storage you add to the default 20 GB.

Of course, you can find the latest pricing updates for Amazon EMR on the relevant AWS pricing pages.

When To Use AWS EMR

AWS EMR makes deploying distributed data processing frameworks easy and cost-effective. Furthermore, it decouples compute and storage. This allows both to grow independently, leading to better resource utilization.

In the past, users have found operating conventional data processing frameworks like Apache Spark to be quite challenging — especially when used in conjunction with other frameworks like Hadoop.

It could be complex, expensive, and time-consuming. Organizations were required to purchase and integrate hardware (servers, computers, etc.), then install and manage software. Of course, software and hardware would require constant upgrades, further adding to expenses and complexity.

Various lines of business would often timeshare centralized cluster resources. Consequently, this led to under-utilization during idle periods and missed SLA during peak.

As your data grew, the size of your infrastructure would grow along with it. Because storage and compute are tied together, increasing storage means scaling expensive compute requirements.

AWS EMR makes deploying distributed data processing frameworks easy and cost-effective. Furthermore, it decouples compute and storage. This allows both to grow independently, leading to better resource utilization.

With EMR, you pay a per-second rate only for the cluster resources you use. Customer support is available 24/7 on your normal AWS support belt at a fraction of what other commercial distributed processing frameworks vendors would charge.

With spot pricing, you can lower your bill by up to 90%. IDC recently found that the return of investment of EMR versus on-premise is 342% over five years.

What Are The Benefits And Limitations Of Amazon EMR?

Amazon EMR is nearly unbeatable, especially when coupled with some of Amazon’s other web-based services. Nevertheless, while its benefits may be self-evident and many, it does have its limitations. In this section of the guide, we’ll summarize some of Amazon EMR’s pros and cons.

Amazon EMR Pros:

  • Cost reduction of physical infrastructure - EMR eliminates the need for organizations to purchase and maintain physical servers. Instead, Amazon EMR charges you on a per-second basis for the features you use.
  • Time-saving - Because EMR eliminates the need to provision and configure in-house servers for Big Data computational tasks, it can save time for system administrators. Amazon EMR will handle most of these operational details for you. This means your company will spend less time configuring manual administrative tasks. Furthermore, because AWS EMR will automatically scale both compute and storage resources for you, you won’t have to spend time manually provisioning these elements.
  • Optimal resource utilization - EMR decouples storage and compute. This allows you to automatically increase and decrease Amazon Elastic Compute Cloud (EC2) instances and clusters when needed. You can then release resources as soon as you're done.
  • Excellent customer support - Amazon EMR includes 24/7 customer service as a standard.

Other benefits include fast spin-up times for EC2 instances. Essentially, this is an EMR service that can be run on AWS Virtual Private Cloud (VPC). This allows for increased data security.

Amazon EMR Cons:

  • Complicated interface - This seems to be a reoccurring complaint with most AWS products. The interface can be incomprehensible for beginners. Organizations will often have to opt to pay for training or hire certified professionals to help migrate their resources and configure Amazon EMR. Online documentation and tutorials are also quite limited. Initially, you may have to spend some time getting acquainted with the service and all its intricacies.
  • Exclusive to Amazon cloud storage - You cannot use Amazon EMR to analyze or mine data stored with other cloud storage platforms. If you are already storing your data with another cloud provider, you’ll have to move it to one of Amazon’s cloud storage or database solutions.

AWS EMR’s other limitations are service-based. For instance, Amazon EMR studio is only available in certain regions such as East US, West US, Asia-Pacific, Canada, and EU. You can only set a single Amazon VPC with a maximum of five subnets for an EMR studio. However, you can create multiple EMR studios and associate them with different VPCs and subnets.

How to Really Understand Amazon EMR Costs

AWS EMR can help you change your rigid in-house cluster infrastructure and provide you with hassle-free Hadoop management. It can also significantly cut the time of data processing. However, as with most AWS products, its pricing can be a little incomprehensible.

Amazon charges you a per-second rate that is also tied to the number of clusters you are running. In addition, you’ll need to pay for the EC2 server and Amazon’s Elastic Block Stores (EBS). If you’re running a large relational database, you’ll need to consider the cost of using the AWS Database Migration Service to move and host your data.

This is only just the tip of the iceberg. To get the most out of EMR, you’ll likely need to employ a host of other AWS tools such as CloudWatch and S3 (for logs). Tracking and managing these costs can be quite daunting. It’s different when you use ClouZero.

How CloudZero Can Help You

With CloudZero, however, you gain complete insight into your AWS cloud spend. CloudZero’s cost intelligence platform maps costs to your products, features, services, dev teams, and more. For example, you’ll see your cost per individual customer, per product feature, per service, per environment and more.

CloudZero Dimensions

CloudZero also automatically detects cost issues in real time. You’ll then receive context-rich alerts via Slack so you can stop the bleeding before it runs for days or weeks. This ensures you catch potential overspending before it hurts your COGS and margins.

With cloud cost intelligence, you’ll be able to drill into cost data from a high level down to the individual components that drive your cloud spend — and see exactly how services drive your cloud costs and why.

That means you’ll know exactly who, what, and why your cloud costs are changing across AWS, Azure, GCP, Kubernetes, Snowflake, Datadog, etc — right from one platform.

Drift has saved over $3 million using CloudZero. Demandbase cut its annual AWS spend by 36%, justifying $175 million in financing. Here’s your chance to control your Amazon EMR costs. Schedule a demo today to see CloudZero in action for yourself. It’s on us at no risk to you.

Cody Slingerland

Author: Cody Slingerland

Cody Slingerland, a FinOps certified practitioner, is an avid content creator with over 10 years of experience creating content for SaaS and technology companies. Cody collaborates with internal team members and subject matter experts to create expert-written content on the CloudZero blog.

STAY IN THE LOOP


Join thousands of engineers who already receive the best AWS and cloud cost intelligence content.