AWS EMR

Admin, Student's Library
0

Amazon EMR Tutorial

Complete guide to Amazon Elastic MapReduce - architecture, deployment options, storage, and workflow.

Overview

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks such as Hadoop or Apache Spark.

What is EMR?

EMR is a managed service that helps you:

  • Set up clusters to process and analyze data within minutes
  • Avoid downloading, configuring, and installing big data components manually
  • Deploy EMR clusters quickly and get started faster

Where Does EMR Fit?

EMR fits into the processing side of the data analytics pipeline:

Data Ingestion → Data Storage → **Data Processing (EMR)** → Data Analytics → Visualization

When you have collected data and want to process it, that's where EMR as a service helps.


Why Use Amazon EMR?

The Challenge with Big Data Frameworks

When working with big data frameworks like Hadoop, Spark, Hive, etc., you face challenges:

  • Multiple frameworks and components to maintain
  • Compatibility issues between different versions
  • Patching and updates management
  • Support and maintenance overhead

Benefits of Amazon EMR

  1. Reduced Admin Time
    • Less time required to manage or support Hadoop clusters
    • AWS handles infrastructure management
  2. No Upfront Costs
    • No need to purchase hardware and software upfront
    • Pay only for what you use
  3. Cost Savings
    • Save on operating costs
    • Eliminate data center costs
    • No power and cooling costs
  4. Business Value
    • Reduce cost of delays
    • Mitigate risks
    • Faster time to market
  5. Governance
    • Built-in governance capabilities
    • Security and compliance features

EMR Architecture

Core Concept: Clusters

In EMR, everything revolves around the concept of a cluster.

A cluster is a group of Amazon EC2 instances working together with different roles but functioning as a team.

Cluster Components

An EMR cluster consists of three types of nodes:

1. Primary Node (Master Node)

Role: Management and coordination

Characteristics:

  • Minimum requirement: At least one primary node is required
  • A cluster can be as small as one node (single-node cluster)
  • Acts as the leader and coordinator

Responsibilities:

  • Manages all software components
  • Coordinates the distribution of tasks
  • Runs MapReduce logic
  • Distributes work, combines results, and delivers output
  • Runs YARN Resource Manager (Yet Another Resource Negotiator)
  • Manages resources for applications

Behind the scenes: Runs as an EC2 instance

2. Core Nodes

Role: Data storage and computation

Characteristics:

  • Follower nodes that work under the primary node
  • Have storage associated with them
  • Run DataNode daemon for Hadoop Distributed File System (HDFS)

Responsibilities:

  • Coordinate data storage as part of HDFS
  • Store data in a distributed fashion with multiple copies
  • Provide fault tolerance (if one node fails, data is still available)
  • Run TaskTracker daemon
  • Perform parallel computation tasks
  • Execute work distributed by the primary node

Storage: Core nodes provide storage for HDFS, where data is stored with replication for reliability.

3. Task Nodes (Optional)

Role: Additional processing power

Characteristics:

  • Optional - not mandatory for cluster operation
  • Workhorses for computation
  • Do NOT store data in HDFS
  • Can be added or removed dynamically

Responsibilities:

  • Perform calculations and processing
  • Receive work broken down into smaller pieces from primary node
  • Execute parallel computations
  • Aggregate results and deliver them back

Use Cases:

  • When you have huge workloads requiring more processing power
  • When you need more processing power but don't need additional storage
  • For temporary processing needs

Best Practices:

  • Task nodes can join and leave the cluster as needed
  • Commonly use Spot Instances for cost optimization
  • Can run only for the duration needed, then be terminated
  • Primary and core nodes typically run longer (core nodes store data)

Node Type Summary

Node Type Required Storage Processing Use Case
Primary Node Yes No Yes (coordination) Management & coordination
Core Node Yes Yes (HDFS) Yes Storage & computation
Task Node Optional No Yes Additional processing power

How EMR Works

  1. EMR installs different components/daemons on each node type
  2. Each node gets a specific role in distributed applications like Apache Hadoop
  3. Primary node distributes work to core and task nodes
  4. Core nodes store data and perform computation
  5. Task nodes provide additional processing power
  6. Results are aggregated and returned
💡 Tip: Use Spot Instances for task nodes because they are expected to join and leave as needed, can be terminated when work is finished, and provide significant cost savings. Perfect for transient workloads.

EMR Deployment Options

Amazon EMR provides three flexible deployment options:

1. EMR on EC2 Instances

Description: Run EMR workloads on EC2 instances

Characteristics:

  • Maximum flexibility - you control everything
  • You decide:
    • Instance sizes
    • Number of instances
    • Settings and configurations
  • Full control over deployment

Best For:

  • Continuously running workloads
  • Long-running clusters
  • Higher processing power requirements
  • When you need full control over infrastructure

Use Case: Production workloads that need to run continuously

2. EMR on EKS (Elastic Kubernetes Service)

Description: Run EMR on containerized platform using Amazon EKS

Characteristics:

  • Integrates with Amazon EKS
  • Automates provisioning, management, and scaling
  • Runs on Kubernetes cluster
  • Containerized deployment

Best For:

  • Organizations already using Kubernetes
  • Containerized applications
  • When you want to leverage existing EKS infrastructure
  • Microservices architecture

Use Case: When you have containerized applications and want to perform processing on EKS

3. EMR Serverless

Description: Simplest way to run EMR workloads without managing infrastructure

Characteristics:

  • Easiest and simplest deployment option
  • No infrastructure management required
  • AWS handles availability and scaling
  • Submit job → Process → Get results
  • No need to worry about behind-the-scenes infrastructure

How It Works:

  1. Submit your job
  2. Specify required components
  3. EMR processes the job
  4. Get results
  5. No infrastructure concerns

Best For:

  • Ad-hoc processing jobs
  • When you don't want to manage infrastructure
  • Quick processing tasks
  • Cost-effective for intermittent workloads

Use Case: One-time jobs or periodic processing without maintaining clusters

Deployment Option Comparison

Option Control Level Management Best For
EMR on EC2 Full control You manage Continuous, long-running workloads
EMR on EKS Container control Kubernetes managed Containerized environments
EMR Serverless No infrastructure Fully managed Ad-hoc, intermittent jobs

Transient vs Long-Running EMR Clusters

When deploying EMR clusters, you need to decide whether your cluster should be transient (terminate after job completion) or long-running (stay up continuously). This is a critical decision that affects cost, operations, and architecture.


Transient EMR Clusters

What is a Transient Cluster?

A transient cluster is an EMR cluster that automatically terminates after completing its assigned work (steps). The cluster starts, runs bootstrap actions, executes specified steps, and then automatically shuts down when the last step completes.

How It Works:

  1. Cluster starts and runs bootstrap actions
  2. Executes specified steps (jobs)
  3. Automatically terminates when the last step completes
  4. EC2 instances are terminated automatically

Characteristics:

  • Cost-effective - Pay only for the time needed to process data
  • Automatic cleanup - No manual termination required
  • Ideal for batch processing - Periodic tasks like daily data runs
  • No idle time costs - Cluster doesn't stay running when not needed
  • Default behavior - Clusters launched via EMR API have this enabled by default

Configuration:

Via AWS Console:

  • Select "Terminate cluster after last step completes" checkbox
  • Located under "Cluster termination" section when creating cluster

Via EMR API:

  • Default behavior for clusters launched with EMR API
  • Step execution is enabled by default

Use Cases:

  • Daily batch processing - ETL jobs that run on schedule
  • Periodic data transformations - Weekly/monthly data processing
  • One-time data processing - Ad-hoc analysis jobs
  • Cost-sensitive workloads - When minimizing costs is critical
  • Workflows with defined steps - Jobs with clear start and end

Long-Running EMR Clusters

What is a Long-Running Cluster?

A long-running cluster is an EMR cluster that stays active continuously, waiting for jobs to be submitted. It doesn't automatically terminate after completing steps.

Characteristics:

  • Always available - Ready to process jobs immediately
  • Faster job startup - No cluster provisioning delay
  • Interactive workloads - Supports notebooks, ad-hoc queries
  • Shared resources - Multiple users/teams can use the same cluster
  • ⚠️ Ongoing costs - Pay for cluster even when idle
  • ⚠️ Manual management - Need to manually terminate when done

Use Cases:

  • Interactive analytics - Ad-hoc queries and exploration
  • EMR Notebooks - Jupyter notebook-based development
  • Continuous data processing - Real-time or near-real-time workloads
  • Shared development environment - Multiple data engineers/analysts
  • Low-latency requirements - When job startup time matters

Auto-Termination Policy (Idle-Based)

What is Auto-Termination Policy?

An auto-termination policy automatically terminates a cluster after it has been idle for a specified period. This is different from step-based termination and is useful for long-running clusters that may have idle periods.

How It Works:

  • Cluster monitors its activity
  • If cluster is idle for specified duration, it automatically terminates
  • No manual monitoring required

Cluster is Considered Idle When:

  • No active YARN applications
  • HDFS utilization is below 10%
  • No active EMR Notebook or EMR Studio connections
  • No on-cluster application user interfaces in use
  • No pending steps

Availability:

  • Available in EMR versions 5.30.0 and later
  • Available in most major AWS regions

Transient vs Long-Running Comparison

Feature Transient Cluster Long-Running Cluster
Termination Automatic after last step Manual termination required
Cost Pay only during processing ⚠️ Pay even when idle
Startup Time ⚠️ Cluster provisioning delay Immediate job execution
Use Case Batch/periodic processing Interactive/continuous workloads
Best For Scheduled jobs, cost optimization Interactive analysis, notebooks
💡 Best Practice: Use transient clusters for batch processing and auto-termination policy for long-running clusters with idle periods to optimize costs.

EMR Storage Options

When processing big data workloads with EMR, you need to understand where data is stored before and after processing. EMR provides different storage options, each with specific use cases.

📌 Key Consideration: Data needs to be stored before processing and output needs to be stored after processing. Storage choice impacts performance, cost, and persistence.

This section focuses on EMR on EC2 storage options.

1. HDFS (Hadoop Distributed File System)

What is HDFS?

HDFS is the default file system that comes automatically with Apache Hadoop clusters. When you set up an EMR cluster on EC2 with core nodes, you automatically get HDFS.

How It Works:

  • Local storage on core nodes is combined to create HDFS
  • Distributed, scalable, and portable file system for Hadoop
  • Data is replicated across different nodes for fault tolerance

Advantages of HDFS

  1. Fast Performance
    • Data is stored within the core nodes themselves
    • Local storage access is faster than network storage
  2. Data Awareness
    • Primary node is aware of where data is stored
    • Can distribute MapReduce jobs efficiently
    • Jobs can be processed from local storage when possible
    • Enables data locality optimization
  3. Suitable for Caching
    • Excellent for caching results from intermediate job flow steps
    • When jobs have multiple steps (aggregation, filtering, ETL, etc.)
  4. Iterative Workloads
    • Perfect for iterative reads on the same dataset
    • Ideal for disk I/O intensive workloads

Disadvantages of HDFS

  1. Ephemeral Storage
    • Data is reclaimed when the cluster terminates
    • If you store data on HDFS and terminate the cluster, data is lost
    • Not persistent for long-term storage

2. EMRFS (EMR File System) - S3 Storage

What is EMRFS?

EMRFS (EMR File System) is a connector provided by AWS that allows Hadoop to use Amazon S3 as a storage layer. It's essentially a connector that enables Hadoop applications to read from and write to S3.

How It Works:

  • Uses Amazon S3 as the underlying storage
  • S3 is durable storage spread across multiple Availability Zones
  • Provides durability and availability
  • Decouples storage from compute

Advantages of EMRFS (S3)

  1. Durability and Availability
    • S3 is durable storage across multiple Availability Zones
    • High availability and reliability
  2. Cost-Effective
    • S3 is comparatively very cheap compared to EC2 storage
    • Pay only for storage used
  3. Decoupled Storage and Compute
    • Isolate compute from storage
    • Can scale them independently
    • More efficient resource utilization
  4. Persistent Storage
    • Data persists after cluster shutdown
    • Can retain data for long-term storage
  5. Multi-Cluster Access
    • Data available to multiple clusters
    • One cluster can use the data, and other clusters can also access it
    • No need to duplicate data across clusters
  6. Single-Read Workloads
    • Perfect for workloads that read data once per run
    • Ideal for ETL pipelines

Storage Option Comparison

Feature HDFS EMRFS (S3)
Storage Location Local storage on core nodes Amazon S3
Persistence Ephemeral Persistent
Performance Fast (local) Slower (network)
Cost Included with EC2 Cost-effective
Multi-Cluster Access No Yes
Best For Caching, iterative workloads Long-term storage, multi-cluster
💡 Best Practice: Use a hybrid approach - Use HDFS for intermediate results and caching, and use S3 (EMRFS) for input data and final output. This gives you the best of both worlds: performance + persistence.

Getting Started with EMR: Practical Workflow

This section covers the practical workflow for getting started with Amazon EMR, including planning, cluster creation, and job submission.

EMR Workflow Overview

When starting with EMR workload execution, follow this workflow:

  1. Plan Your Workload
    • Determine storage requirements (S3 vs HDFS)
    • Choose big data framework (Spark, Hive, Pig, etc.)
    • Develop applications and scripts
    • Select hardware configuration
    • Configure networking and security
  2. Launch Cluster
    • Create and configure EMR cluster
    • Set up instance types and counts
    • Configure applications and components
  3. Connect to Cluster
    • Use AWS Console, SSH, or other methods
    • Access cluster resources
  4. Submit Work
    • Submit steps/jobs to the cluster
    • Monitor execution
  5. View Results
    • Check output in S3 or HDFS
    • Analyze results
  6. Optional: Monitor, Troubleshoot, Scale
    • Monitor cluster performance
    • Troubleshoot issues
    • Scale as needed
  7. Terminate Cluster
    • Clean up resources
    • Avoid unnecessary costs
📌 Note: Steps 5-6 are optional. Steps 1-4 and 7 are essential.

Key Configuration Decisions

Configuration Options Recommendation
Instance Groups One type per node Simple workloads
Instance Fleets Multiple types per node Spot Instances, cost optimization
Scaling EMR Managed, Custom, Manual EMR Managed for most cases
Spot Instances Task nodes only Use for cost savings
Termination Auto-terminate after steps Enable for cost optimization

Summary

In this tutorial, you learned about Amazon EMR, its architecture, deployment options, storage choices, and practical workflow for getting started.

Key Takeaways

  1. EMR simplifies big data processing by managing Hadoop/Spark clusters
  2. Clusters consist of three node types: Primary (required), Core (required), Task (optional)
  3. Three deployment options: EC2 (flexible), EKS (containerized), Serverless (simplest)
  4. Use Spot Instances for task nodes to optimize costs
  5. Choose deployment based on: Control needs, workload type, and infrastructure preferences
  6. Two cluster lifecycle models:
    • Transient clusters: Auto-terminate after last step (cost-effective for batch jobs)
    • Long-running clusters: Stay active continuously (for interactive workloads)
  7. Two main storage options:
    • HDFS: Fast, ephemeral, best for caching and iterative workloads
    • EMRFS (S3): Persistent, cost-effective, best for long-term storage and multi-cluster access
  8. Storage strategy: Use HDFS for intermediate results, S3 for input/output and persistence
  9. Decouple storage from compute using S3 to scale independently and reduce costs
  10. Transient clusters are default for EMR API launches and ideal for cost optimization
  11. EMR workflow: Plan → Launch → Connect → Submit → View Results → Terminate
  12. Instance Groups vs Fleets: Use Groups for simplicity, Fleets for Spot Instances and flexibility
  13. IAM roles: EMR Service Role (service access) + Instance Profile (node access to S3/AWS)
  14. Always enable logging to S3 for troubleshooting and optimization
🚀 Next Steps: Now that you understand EMR fundamentals, you can create your first cluster, submit jobs, monitor performance, and explore advanced features like bootstrap actions and custom configurations.

Related AWS Services

  • Amazon EC2: Compute instances for EMR clusters
  • Amazon S3: Data storage (often used with EMR)
  • Amazon EKS: Kubernetes service for containerized EMR
  • AWS Glue: Alternative ETL service (serverless option)
  • Amazon Athena: Query data processed by EMR

Last Updated: January 2025

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.
Post a Comment (0)
Our website uses cookies to enhance your experience. Learn More
Accept !