Amazon EMR Tutorial

Complete guide to Amazon Elastic MapReduce - architecture, deployment options, storage, and workflow.

Overview

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks such as Hadoop or Apache Spark.

What is EMR?

EMR is a managed service that helps you:

Set up clusters to process and analyze data within minutes
Avoid downloading, configuring, and installing big data components manually
Deploy EMR clusters quickly and get started faster

Where Does EMR Fit?

EMR fits into the processing side of the data analytics pipeline:

Data Ingestion → Data Storage → **Data Processing (EMR)** → Data Analytics → Visualization

When you have collected data and want to process it, that's where EMR as a service helps.

Why Use Amazon EMR?

The Challenge with Big Data Frameworks

When working with big data frameworks like Hadoop, Spark, Hive, etc., you face challenges:

Multiple frameworks and components to maintain
Compatibility issues between different versions
Patching and updates management
Support and maintenance overhead

Benefits of Amazon EMR

Reduced Admin Time
- Less time required to manage or support Hadoop clusters
- AWS handles infrastructure management
No Upfront Costs
- No need to purchase hardware and software upfront
- Pay only for what you use
Cost Savings
- Save on operating costs
- Eliminate data center costs
- No power and cooling costs
Business Value
- Reduce cost of delays
- Mitigate risks
- Faster time to market
Governance
- Built-in governance capabilities
- Security and compliance features

EMR Architecture

Core Concept: Clusters

In EMR, everything revolves around the concept of a cluster.

A cluster is a group of Amazon EC2 instances working together with different roles but functioning as a team.

Cluster Components

An EMR cluster consists of three types of nodes:

1. Primary Node (Master Node)

Role: Management and coordination

Characteristics:

Minimum requirement: At least one primary node is required
A cluster can be as small as one node (single-node cluster)
Acts as the leader and coordinator

Responsibilities:

Manages all software components
Coordinates the distribution of tasks
Runs MapReduce logic
Distributes work, combines results, and delivers output
Runs YARN Resource Manager (Yet Another Resource Negotiator)
Manages resources for applications

Behind the scenes: Runs as an EC2 instance

2. Core Nodes

Role: Data storage and computation

Characteristics:

Follower nodes that work under the primary node
Have storage associated with them
Run DataNode daemon for Hadoop Distributed File System (HDFS)

Responsibilities:

Coordinate data storage as part of HDFS
Store data in a distributed fashion with multiple copies
Provide fault tolerance (if one node fails, data is still available)
Run TaskTracker daemon
Perform parallel computation tasks
Execute work distributed by the primary node

Storage: Core nodes provide storage for HDFS, where data is stored with replication for reliability.

3. Task Nodes (Optional)

Role: Additional processing power

Characteristics:

Optional - not mandatory for cluster operation
Workhorses for computation
Do NOT store data in HDFS
Can be added or removed dynamically

Responsibilities:

Perform calculations and processing
Receive work broken down into smaller pieces from primary node
Execute parallel computations
Aggregate results and deliver them back

Use Cases:

When you have huge workloads requiring more processing power
When you need more processing power but don't need additional storage
For temporary processing needs

Best Practices:

Task nodes can join and leave the cluster as needed
Commonly use Spot Instances for cost optimization
Can run only for the duration needed, then be terminated
Primary and core nodes typically run longer (core nodes store data)

Node Type Summary

Node Type	Required	Storage	Processing	Use Case
Primary Node	Yes	No	Yes (coordination)	Management & coordination
Core Node	Yes	Yes (HDFS)	Yes	Storage & computation
Task Node	Optional	No	Yes	Additional processing power

How EMR Works

EMR installs different components/daemons on each node type
Each node gets a specific role in distributed applications like Apache Hadoop
Primary node distributes work to core and task nodes
Core nodes store data and perform computation
Task nodes provide additional processing power
Results are aggregated and returned

💡 Tip: Use Spot Instances for task nodes because they are expected to join and leave as needed, can be terminated when work is finished, and provide significant cost savings. Perfect for transient workloads.

EMR Deployment Options

Amazon EMR provides three flexible deployment options:

1. EMR on EC2 Instances

Description: Run EMR workloads on EC2 instances

Characteristics:

Maximum flexibility - you control everything
You decide:
- Instance sizes
- Number of instances
- Settings and configurations
Full control over deployment

Best For:

Continuously running workloads
Long-running clusters
Higher processing power requirements
When you need full control over infrastructure

Use Case: Production workloads that need to run continuously

2. EMR on EKS (Elastic Kubernetes Service)

Description: Run EMR on containerized platform using Amazon EKS

Characteristics:

Integrates with Amazon EKS
Automates provisioning, management, and scaling
Runs on Kubernetes cluster
Containerized deployment

Best For:

Organizations already using Kubernetes
Containerized applications
When you want to leverage existing EKS infrastructure
Microservices architecture

Use Case: When you have containerized applications and want to perform processing on EKS

3. EMR Serverless

Description: Simplest way to run EMR workloads without managing infrastructure

Characteristics:

Easiest and simplest deployment option
No infrastructure management required
AWS handles availability and scaling
Submit job → Process → Get results
No need to worry about behind-the-scenes infrastructure

How It Works:

Submit your job
Specify required components
EMR processes the job
Get results
No infrastructure concerns

Best For:

Ad-hoc processing jobs
When you don't want to manage infrastructure
Quick processing tasks
Cost-effective for intermittent workloads

Use Case: One-time jobs or periodic processing without maintaining clusters

Deployment Option Comparison

Option	Control Level	Management	Best For
EMR on EC2	Full control	You manage	Continuous, long-running workloads
EMR on EKS	Container control	Kubernetes managed	Containerized environments
EMR Serverless	No infrastructure	Fully managed	Ad-hoc, intermittent jobs

Transient vs Long-Running EMR Clusters

When deploying EMR clusters, you need to decide whether your cluster should be transient (terminate after job completion) or long-running (stay up continuously). This is a critical decision that affects cost, operations, and architecture.

Transient EMR Clusters

What is a Transient Cluster?

A transient cluster is an EMR cluster that automatically terminates after completing its assigned work (steps). The cluster starts, runs bootstrap actions, executes specified steps, and then automatically shuts down when the last step completes.

How It Works:

Cluster starts and runs bootstrap actions
Executes specified steps (jobs)
Automatically terminates when the last step completes
EC2 instances are terminated automatically

Characteristics:

✅ Cost-effective - Pay only for the time needed to process data
✅ Automatic cleanup - No manual termination required
✅ Ideal for batch processing - Periodic tasks like daily data runs
✅ No idle time costs - Cluster doesn't stay running when not needed
✅ Default behavior - Clusters launched via EMR API have this enabled by default

Configuration:

Via AWS Console:

Select "Terminate cluster after last step completes" checkbox
Located under "Cluster termination" section when creating cluster

Via EMR API:

Default behavior for clusters launched with EMR API
Step execution is enabled by default

Use Cases:

✅ Daily batch processing - ETL jobs that run on schedule
✅ Periodic data transformations - Weekly/monthly data processing
✅ One-time data processing - Ad-hoc analysis jobs
✅ Cost-sensitive workloads - When minimizing costs is critical
✅ Workflows with defined steps - Jobs with clear start and end

Long-Running EMR Clusters

What is a Long-Running Cluster?

A long-running cluster is an EMR cluster that stays active continuously, waiting for jobs to be submitted. It doesn't automatically terminate after completing steps.

Characteristics:

✅ Always available - Ready to process jobs immediately
✅ Faster job startup - No cluster provisioning delay
✅ Interactive workloads - Supports notebooks, ad-hoc queries
✅ Shared resources - Multiple users/teams can use the same cluster
⚠️ Ongoing costs - Pay for cluster even when idle
⚠️ Manual management - Need to manually terminate when done

Use Cases:

✅ Interactive analytics - Ad-hoc queries and exploration
✅ EMR Notebooks - Jupyter notebook-based development
✅ Continuous data processing - Real-time or near-real-time workloads
✅ Shared development environment - Multiple data engineers/analysts
✅ Low-latency requirements - When job startup time matters

Auto-Termination Policy (Idle-Based)

What is Auto-Termination Policy?

An auto-termination policy automatically terminates a cluster after it has been idle for a specified period. This is different from step-based termination and is useful for long-running clusters that may have idle periods.

How It Works:

Cluster monitors its activity
If cluster is idle for specified duration, it automatically terminates
No manual monitoring required

Cluster is Considered Idle When:

✅ No active YARN applications
✅ HDFS utilization is below 10%
✅ No active EMR Notebook or EMR Studio connections
✅ No on-cluster application user interfaces in use
✅ No pending steps

Availability:

Available in EMR versions 5.30.0 and later
Available in most major AWS regions

Transient vs Long-Running Comparison

Feature	Transient Cluster	Long-Running Cluster
Termination	Automatic after last step	Manual termination required
Cost	✅ Pay only during processing	⚠️ Pay even when idle
Startup Time	⚠️ Cluster provisioning delay	✅ Immediate job execution
Use Case	Batch/periodic processing	Interactive/continuous workloads
Best For	Scheduled jobs, cost optimization	Interactive analysis, notebooks

💡 Best Practice: Use transient clusters for batch processing and auto-termination policy for long-running clusters with idle periods to optimize costs.

EMR Storage Options

When processing big data workloads with EMR, you need to understand where data is stored before and after processing. EMR provides different storage options, each with specific use cases.

📌 Key Consideration: Data needs to be stored before processing and output needs to be stored after processing. Storage choice impacts performance, cost, and persistence.

This section focuses on EMR on EC2 storage options.

1. HDFS (Hadoop Distributed File System)

What is HDFS?

HDFS is the default file system that comes automatically with Apache Hadoop clusters. When you set up an EMR cluster on EC2 with core nodes, you automatically get HDFS.

How It Works:

Local storage on core nodes is combined to create HDFS
Distributed, scalable, and portable file system for Hadoop
Data is replicated across different nodes for fault tolerance

Advantages of HDFS

Fast Performance
- Data is stored within the core nodes themselves
- Local storage access is faster than network storage
Data Awareness
- Primary node is aware of where data is stored
- Can distribute MapReduce jobs efficiently
- Jobs can be processed from local storage when possible
- Enables data locality optimization
Suitable for Caching
- Excellent for caching results from intermediate job flow steps
- When jobs have multiple steps (aggregation, filtering, ETL, etc.)
Iterative Workloads
- Perfect for iterative reads on the same dataset
- Ideal for disk I/O intensive workloads

Disadvantages of HDFS

Ephemeral Storage
- Data is reclaimed when the cluster terminates
- If you store data on HDFS and terminate the cluster, data is lost
- Not persistent for long-term storage

2. EMRFS (EMR File System) - S3 Storage

What is EMRFS?

EMRFS (EMR File System) is a connector provided by AWS that allows Hadoop to use Amazon S3 as a storage layer. It's essentially a connector that enables Hadoop applications to read from and write to S3.

How It Works:

Uses Amazon S3 as the underlying storage
S3 is durable storage spread across multiple Availability Zones
Provides durability and availability
Decouples storage from compute

Advantages of EMRFS (S3)

Durability and Availability
- S3 is durable storage across multiple Availability Zones
- High availability and reliability
Cost-Effective
- S3 is comparatively very cheap compared to EC2 storage
- Pay only for storage used
Decoupled Storage and Compute
- Isolate compute from storage
- Can scale them independently
- More efficient resource utilization
Persistent Storage
- Data persists after cluster shutdown
- Can retain data for long-term storage
Multi-Cluster Access
- Data available to multiple clusters
- One cluster can use the data, and other clusters can also access it
- No need to duplicate data across clusters
Single-Read Workloads
- Perfect for workloads that read data once per run
- Ideal for ETL pipelines

Storage Option Comparison

Feature	HDFS	EMRFS (S3)
Storage Location	Local storage on core nodes	Amazon S3
Persistence	Ephemeral	Persistent
Performance	Fast (local)	Slower (network)
Cost	Included with EC2	Cost-effective
Multi-Cluster Access	No	Yes
Best For	Caching, iterative workloads	Long-term storage, multi-cluster

💡 Best Practice: Use a hybrid approach - Use HDFS for intermediate results and caching, and use S3 (EMRFS) for input data and final output. This gives you the best of both worlds: performance + persistence.

Getting Started with EMR: Practical Workflow

This section covers the practical workflow for getting started with Amazon EMR, including planning, cluster creation, and job submission.

EMR Workflow Overview

When starting with EMR workload execution, follow this workflow:

Plan Your Workload
- Determine storage requirements (S3 vs HDFS)
- Choose big data framework (Spark, Hive, Pig, etc.)
- Develop applications and scripts
- Select hardware configuration
- Configure networking and security
Launch Cluster
- Create and configure EMR cluster
- Set up instance types and counts
- Configure applications and components
Connect to Cluster
- Use AWS Console, SSH, or other methods
- Access cluster resources
Submit Work
- Submit steps/jobs to the cluster
- Monitor execution
View Results
- Check output in S3 or HDFS
- Analyze results
Optional: Monitor, Troubleshoot, Scale
- Monitor cluster performance
- Troubleshoot issues
- Scale as needed
Terminate Cluster
- Clean up resources
- Avoid unnecessary costs

📌 Note: Steps 5-6 are optional. Steps 1-4 and 7 are essential.

Key Configuration Decisions

Configuration	Options	Recommendation
Instance Groups	One type per node	Simple workloads
Instance Fleets	Multiple types per node	Spot Instances, cost optimization
Scaling	EMR Managed, Custom, Manual	EMR Managed for most cases
Spot Instances	Task nodes only	Use for cost savings
Termination	Auto-terminate after steps	Enable for cost optimization

Summary

In this tutorial, you learned about Amazon EMR, its architecture, deployment options, storage choices, and practical workflow for getting started.

Key Takeaways

EMR simplifies big data processing by managing Hadoop/Spark clusters
Clusters consist of three node types: Primary (required), Core (required), Task (optional)
Three deployment options: EC2 (flexible), EKS (containerized), Serverless (simplest)
Use Spot Instances for task nodes to optimize costs
Choose deployment based on: Control needs, workload type, and infrastructure preferences
Two cluster lifecycle models:
- Transient clusters: Auto-terminate after last step (cost-effective for batch jobs)
- Long-running clusters: Stay active continuously (for interactive workloads)
Two main storage options:
- HDFS: Fast, ephemeral, best for caching and iterative workloads
- EMRFS (S3): Persistent, cost-effective, best for long-term storage and multi-cluster access
Storage strategy: Use HDFS for intermediate results, S3 for input/output and persistence
Decouple storage from compute using S3 to scale independently and reduce costs
Transient clusters are default for EMR API launches and ideal for cost optimization
EMR workflow: Plan → Launch → Connect → Submit → View Results → Terminate
Instance Groups vs Fleets: Use Groups for simplicity, Fleets for Spot Instances and flexibility
IAM roles: EMR Service Role (service access) + Instance Profile (node access to S3/AWS)
Always enable logging to S3 for troubleshooting and optimization

🚀 Next Steps: Now that you understand EMR fundamentals, you can create your first cluster, submit jobs, monitor performance, and explore advanced features like bootstrap actions and custom configurations.

Related AWS Services

Amazon EC2: Compute instances for EMR clusters
Amazon S3: Data storage (often used with EMR)
Amazon EKS: Kubernetes service for containerized EMR
AWS Glue: Alternative ETL service (serverless option)
Amazon Athena: Query data processed by EMR

Last Updated: January 2025

S D L

AWS EMR

Amazon EMR Tutorial

Overview

What is EMR?

Where Does EMR Fit?

Why Use Amazon EMR?

The Challenge with Big Data Frameworks

Benefits of Amazon EMR

EMR Architecture

Core Concept: Clusters

Cluster Components

1. Primary Node (Master Node)

2. Core Nodes

3. Task Nodes (Optional)

Node Type Summary

How EMR Works

EMR Deployment Options

1. EMR on EC2 Instances

2. EMR on EKS (Elastic Kubernetes Service)

3. EMR Serverless

Deployment Option Comparison

Transient vs Long-Running EMR Clusters

Transient EMR Clusters

Long-Running EMR Clusters

Auto-Termination Policy (Idle-Based)

Transient vs Long-Running Comparison

EMR Storage Options

1. HDFS (Hadoop Distributed File System)

Advantages of HDFS

Disadvantages of HDFS

2. EMRFS (EMR File System) - S3 Storage

Advantages of EMRFS (S3)

Storage Option Comparison

Getting Started with EMR: Practical Workflow

EMR Workflow Overview

Key Configuration Decisions

Summary

Key Takeaways

Related AWS Services

You may like these posts

Post a Comment

Footer Copyright

Contact form