Amazon EMR Tutorial
Complete guide to Amazon Elastic MapReduce - architecture, deployment options, storage, and workflow.
Overview
Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks such as Hadoop or Apache Spark.
What is EMR?
EMR is a managed service that helps you:
- Set up clusters to process and analyze data within minutes
- Avoid downloading, configuring, and installing big data components manually
- Deploy EMR clusters quickly and get started faster
Where Does EMR Fit?
EMR fits into the processing side of the data analytics pipeline:
Data Ingestion → Data Storage → **Data Processing (EMR)** → Data Analytics → Visualization
When you have collected data and want to process it, that's where EMR as a service helps.
Why Use Amazon EMR?
The Challenge with Big Data Frameworks
When working with big data frameworks like Hadoop, Spark, Hive, etc., you face challenges:
- Multiple frameworks and components to maintain
- Compatibility issues between different versions
- Patching and updates management
- Support and maintenance overhead
Benefits of Amazon EMR
- Reduced Admin Time
- Less time required to manage or support Hadoop clusters
- AWS handles infrastructure management
- No Upfront Costs
- No need to purchase hardware and software upfront
- Pay only for what you use
- Cost Savings
- Save on operating costs
- Eliminate data center costs
- No power and cooling costs
- Business Value
- Reduce cost of delays
- Mitigate risks
- Faster time to market
- Governance
- Built-in governance capabilities
- Security and compliance features
EMR Architecture
Core Concept: Clusters
In EMR, everything revolves around the concept of a cluster.
A cluster is a group of Amazon EC2 instances working together with different roles but functioning as a team.
Cluster Components
An EMR cluster consists of three types of nodes:
1. Primary Node (Master Node)
Role: Management and coordination
Characteristics:
- Minimum requirement: At least one primary node is required
- A cluster can be as small as one node (single-node cluster)
- Acts as the leader and coordinator
Responsibilities:
- Manages all software components
- Coordinates the distribution of tasks
- Runs MapReduce logic
- Distributes work, combines results, and delivers output
- Runs YARN Resource Manager (Yet Another Resource Negotiator)
- Manages resources for applications
Behind the scenes: Runs as an EC2 instance
2. Core Nodes
Role: Data storage and computation
Characteristics:
- Follower nodes that work under the primary node
- Have storage associated with them
- Run DataNode daemon for Hadoop Distributed File System (HDFS)
Responsibilities:
- Coordinate data storage as part of HDFS
- Store data in a distributed fashion with multiple copies
- Provide fault tolerance (if one node fails, data is still available)
- Run TaskTracker daemon
- Perform parallel computation tasks
- Execute work distributed by the primary node
Storage: Core nodes provide storage for HDFS, where data is stored with replication for reliability.
3. Task Nodes (Optional)
Role: Additional processing power
Characteristics:
- Optional - not mandatory for cluster operation
- Workhorses for computation
- Do NOT store data in HDFS
- Can be added or removed dynamically
Responsibilities:
- Perform calculations and processing
- Receive work broken down into smaller pieces from primary node
- Execute parallel computations
- Aggregate results and deliver them back
Use Cases:
- When you have huge workloads requiring more processing power
- When you need more processing power but don't need additional storage
- For temporary processing needs
Best Practices:
- Task nodes can join and leave the cluster as needed
- Commonly use Spot Instances for cost optimization
- Can run only for the duration needed, then be terminated
- Primary and core nodes typically run longer (core nodes store data)
Node Type Summary
| Node Type | Required | Storage | Processing | Use Case |
|---|---|---|---|---|
| Primary Node | Yes | No | Yes (coordination) | Management & coordination |
| Core Node | Yes | Yes (HDFS) | Yes | Storage & computation |
| Task Node | Optional | No | Yes | Additional processing power |
How EMR Works
- EMR installs different components/daemons on each node type
- Each node gets a specific role in distributed applications like Apache Hadoop
- Primary node distributes work to core and task nodes
- Core nodes store data and perform computation
- Task nodes provide additional processing power
- Results are aggregated and returned
EMR Deployment Options
Amazon EMR provides three flexible deployment options:
1. EMR on EC2 Instances
Description: Run EMR workloads on EC2 instances
Characteristics:
- Maximum flexibility - you control everything
- You decide:
- Instance sizes
- Number of instances
- Settings and configurations
- Full control over deployment
Best For:
- Continuously running workloads
- Long-running clusters
- Higher processing power requirements
- When you need full control over infrastructure
Use Case: Production workloads that need to run continuously
2. EMR on EKS (Elastic Kubernetes Service)
Description: Run EMR on containerized platform using Amazon EKS
Characteristics:
- Integrates with Amazon EKS
- Automates provisioning, management, and scaling
- Runs on Kubernetes cluster
- Containerized deployment
Best For:
- Organizations already using Kubernetes
- Containerized applications
- When you want to leverage existing EKS infrastructure
- Microservices architecture
Use Case: When you have containerized applications and want to perform processing on EKS
3. EMR Serverless
Description: Simplest way to run EMR workloads without managing infrastructure
Characteristics:
- Easiest and simplest deployment option
- No infrastructure management required
- AWS handles availability and scaling
- Submit job → Process → Get results
- No need to worry about behind-the-scenes infrastructure
How It Works:
- Submit your job
- Specify required components
- EMR processes the job
- Get results
- No infrastructure concerns
Best For:
- Ad-hoc processing jobs
- When you don't want to manage infrastructure
- Quick processing tasks
- Cost-effective for intermittent workloads
Use Case: One-time jobs or periodic processing without maintaining clusters
Deployment Option Comparison
| Option | Control Level | Management | Best For |
|---|---|---|---|
| EMR on EC2 | Full control | You manage | Continuous, long-running workloads |
| EMR on EKS | Container control | Kubernetes managed | Containerized environments |
| EMR Serverless | No infrastructure | Fully managed | Ad-hoc, intermittent jobs |
Transient vs Long-Running EMR Clusters
When deploying EMR clusters, you need to decide whether your cluster should be transient (terminate after job completion) or long-running (stay up continuously). This is a critical decision that affects cost, operations, and architecture.
Transient EMR Clusters
What is a Transient Cluster?
A transient cluster is an EMR cluster that automatically terminates after completing its assigned work (steps). The cluster starts, runs bootstrap actions, executes specified steps, and then automatically shuts down when the last step completes.
How It Works:
- Cluster starts and runs bootstrap actions
- Executes specified steps (jobs)
- Automatically terminates when the last step completes
- EC2 instances are terminated automatically
Characteristics:
- ✅ Cost-effective - Pay only for the time needed to process data
- ✅ Automatic cleanup - No manual termination required
- ✅ Ideal for batch processing - Periodic tasks like daily data runs
- ✅ No idle time costs - Cluster doesn't stay running when not needed
- ✅ Default behavior - Clusters launched via EMR API have this enabled by default
Configuration:
Via AWS Console:
- Select "Terminate cluster after last step completes" checkbox
- Located under "Cluster termination" section when creating cluster
Via EMR API:
- Default behavior for clusters launched with EMR API
- Step execution is enabled by default
Use Cases:
- ✅ Daily batch processing - ETL jobs that run on schedule
- ✅ Periodic data transformations - Weekly/monthly data processing
- ✅ One-time data processing - Ad-hoc analysis jobs
- ✅ Cost-sensitive workloads - When minimizing costs is critical
- ✅ Workflows with defined steps - Jobs with clear start and end
Long-Running EMR Clusters
What is a Long-Running Cluster?
A long-running cluster is an EMR cluster that stays active continuously, waiting for jobs to be submitted. It doesn't automatically terminate after completing steps.
Characteristics:
- ✅ Always available - Ready to process jobs immediately
- ✅ Faster job startup - No cluster provisioning delay
- ✅ Interactive workloads - Supports notebooks, ad-hoc queries
- ✅ Shared resources - Multiple users/teams can use the same cluster
- ⚠️ Ongoing costs - Pay for cluster even when idle
- ⚠️ Manual management - Need to manually terminate when done
Use Cases:
- ✅ Interactive analytics - Ad-hoc queries and exploration
- ✅ EMR Notebooks - Jupyter notebook-based development
- ✅ Continuous data processing - Real-time or near-real-time workloads
- ✅ Shared development environment - Multiple data engineers/analysts
- ✅ Low-latency requirements - When job startup time matters
Auto-Termination Policy (Idle-Based)
What is Auto-Termination Policy?
An auto-termination policy automatically terminates a cluster after it has been idle for a specified period. This is different from step-based termination and is useful for long-running clusters that may have idle periods.
How It Works:
- Cluster monitors its activity
- If cluster is idle for specified duration, it automatically terminates
- No manual monitoring required
Cluster is Considered Idle When:
- ✅ No active YARN applications
- ✅ HDFS utilization is below 10%
- ✅ No active EMR Notebook or EMR Studio connections
- ✅ No on-cluster application user interfaces in use
- ✅ No pending steps
Availability:
- Available in EMR versions 5.30.0 and later
- Available in most major AWS regions
Transient vs Long-Running Comparison
| Feature | Transient Cluster | Long-Running Cluster |
|---|---|---|
| Termination | Automatic after last step | Manual termination required |
| Cost | ✅ Pay only during processing | ⚠️ Pay even when idle |
| Startup Time | ⚠️ Cluster provisioning delay | ✅ Immediate job execution |
| Use Case | Batch/periodic processing | Interactive/continuous workloads |
| Best For | Scheduled jobs, cost optimization | Interactive analysis, notebooks |
EMR Storage Options
When processing big data workloads with EMR, you need to understand where data is stored before and after processing. EMR provides different storage options, each with specific use cases.
This section focuses on EMR on EC2 storage options.
1. HDFS (Hadoop Distributed File System)
What is HDFS?
HDFS is the default file system that comes automatically with Apache Hadoop clusters. When you set up an EMR cluster on EC2 with core nodes, you automatically get HDFS.
How It Works:
- Local storage on core nodes is combined to create HDFS
- Distributed, scalable, and portable file system for Hadoop
- Data is replicated across different nodes for fault tolerance
Advantages of HDFS
- Fast Performance
- Data is stored within the core nodes themselves
- Local storage access is faster than network storage
- Data Awareness
- Primary node is aware of where data is stored
- Can distribute MapReduce jobs efficiently
- Jobs can be processed from local storage when possible
- Enables data locality optimization
- Suitable for Caching
- Excellent for caching results from intermediate job flow steps
- When jobs have multiple steps (aggregation, filtering, ETL, etc.)
- Iterative Workloads
- Perfect for iterative reads on the same dataset
- Ideal for disk I/O intensive workloads
Disadvantages of HDFS
- Ephemeral Storage
- Data is reclaimed when the cluster terminates
- If you store data on HDFS and terminate the cluster, data is lost
- Not persistent for long-term storage
2. EMRFS (EMR File System) - S3 Storage
What is EMRFS?
EMRFS (EMR File System) is a connector provided by AWS that allows Hadoop to use Amazon S3 as a storage layer. It's essentially a connector that enables Hadoop applications to read from and write to S3.
How It Works:
- Uses Amazon S3 as the underlying storage
- S3 is durable storage spread across multiple Availability Zones
- Provides durability and availability
- Decouples storage from compute
Advantages of EMRFS (S3)
- Durability and Availability
- S3 is durable storage across multiple Availability Zones
- High availability and reliability
- Cost-Effective
- S3 is comparatively very cheap compared to EC2 storage
- Pay only for storage used
- Decoupled Storage and Compute
- Isolate compute from storage
- Can scale them independently
- More efficient resource utilization
- Persistent Storage
- Data persists after cluster shutdown
- Can retain data for long-term storage
- Multi-Cluster Access
- Data available to multiple clusters
- One cluster can use the data, and other clusters can also access it
- No need to duplicate data across clusters
- Single-Read Workloads
- Perfect for workloads that read data once per run
- Ideal for ETL pipelines
Storage Option Comparison
| Feature | HDFS | EMRFS (S3) |
|---|---|---|
| Storage Location | Local storage on core nodes | Amazon S3 |
| Persistence | Ephemeral | Persistent |
| Performance | Fast (local) | Slower (network) |
| Cost | Included with EC2 | Cost-effective |
| Multi-Cluster Access | No | Yes |
| Best For | Caching, iterative workloads | Long-term storage, multi-cluster |
Getting Started with EMR: Practical Workflow
This section covers the practical workflow for getting started with Amazon EMR, including planning, cluster creation, and job submission.
EMR Workflow Overview
When starting with EMR workload execution, follow this workflow:
- Plan Your Workload
- Determine storage requirements (S3 vs HDFS)
- Choose big data framework (Spark, Hive, Pig, etc.)
- Develop applications and scripts
- Select hardware configuration
- Configure networking and security
- Launch Cluster
- Create and configure EMR cluster
- Set up instance types and counts
- Configure applications and components
- Connect to Cluster
- Use AWS Console, SSH, or other methods
- Access cluster resources
- Submit Work
- Submit steps/jobs to the cluster
- Monitor execution
- View Results
- Check output in S3 or HDFS
- Analyze results
- Optional: Monitor, Troubleshoot, Scale
- Monitor cluster performance
- Troubleshoot issues
- Scale as needed
- Terminate Cluster
- Clean up resources
- Avoid unnecessary costs
Key Configuration Decisions
| Configuration | Options | Recommendation |
|---|---|---|
| Instance Groups | One type per node | Simple workloads |
| Instance Fleets | Multiple types per node | Spot Instances, cost optimization |
| Scaling | EMR Managed, Custom, Manual | EMR Managed for most cases |
| Spot Instances | Task nodes only | Use for cost savings |
| Termination | Auto-terminate after steps | Enable for cost optimization |
Summary
In this tutorial, you learned about Amazon EMR, its architecture, deployment options, storage choices, and practical workflow for getting started.
Key Takeaways
- EMR simplifies big data processing by managing Hadoop/Spark clusters
- Clusters consist of three node types: Primary (required), Core (required), Task (optional)
- Three deployment options: EC2 (flexible), EKS (containerized), Serverless (simplest)
- Use Spot Instances for task nodes to optimize costs
- Choose deployment based on: Control needs, workload type, and infrastructure preferences
- Two cluster lifecycle models:
- Transient clusters: Auto-terminate after last step (cost-effective for batch jobs)
- Long-running clusters: Stay active continuously (for interactive workloads)
- Two main storage options:
- HDFS: Fast, ephemeral, best for caching and iterative workloads
- EMRFS (S3): Persistent, cost-effective, best for long-term storage and multi-cluster access
- Storage strategy: Use HDFS for intermediate results, S3 for input/output and persistence
- Decouple storage from compute using S3 to scale independently and reduce costs
- Transient clusters are default for EMR API launches and ideal for cost optimization
- EMR workflow: Plan → Launch → Connect → Submit → View Results → Terminate
- Instance Groups vs Fleets: Use Groups for simplicity, Fleets for Spot Instances and flexibility
- IAM roles: EMR Service Role (service access) + Instance Profile (node access to S3/AWS)
- Always enable logging to S3 for troubleshooting and optimization
Related AWS Services
- Amazon EC2: Compute instances for EMR clusters
- Amazon S3: Data storage (often used with EMR)
- Amazon EKS: Kubernetes service for containerized EMR
- AWS Glue: Alternative ETL service (serverless option)
- Amazon Athena: Query data processed by EMR
Last Updated: January 2025

