Amazon Kinesis Tutorial

Overview

Amazon Kinesis is a fully managed platform for real-time data streaming on AWS. It captures, processes, and analyzes streaming data such as application logs, IoT telemetry, clickstreams, and events.

  • Ingest events at scale with low latency
  • Process streams in real time or near-real time
  • Deliver data to durable storage and analytics services
Use cases: Real-time analytics dashboards, anomaly detection, log ingestion, clickstream analysis, IoT telemetry, ETL pipelines.

Core Services

Kinesis Data Streams (KDS)

  • Durable, ordered event stream
  • Producers write records; consumers read and process
  • Provisioned or on-demand capacity
  • Retention configurable (24 hours to multiple days)

Kinesis Data Firehose

  • Fully managed delivery to destinations
  • Buffering and batching for efficient writes
  • Transforms via Lambda, record format conversion
  • Destinations: S3, Redshift, OpenSearch Service, Splunk, HTTP endpoints

Kinesis Data Analytics

  • Real-time analytics using SQL or Apache Flink
  • Windows, aggregations, joins, pattern detection
  • Inputs from KDS or Firehose; outputs to KDS, Firehose, S3, others

Architecture

Typical architecture: producers send records to KDS → optional transformations or analytics → delivery to storage/analytics via Firehose or consumer applications.

Producers → Kinesis Data Streams → Consumers / Kinesis Data Analytics → Firehose → S3/Redshift/OpenSearch
  • Records are appended to shards with sequence numbers
  • Partition keys determine routing to shards
  • Consumers track progress via checkpoints

Producers & Consumers

Producers

  • SDK PutRecord/PutRecords APIs
  • Kinesis Producer Library (KPL) for high-throughput batching
  • Kinesis Agent for log files
  • Amazon EventBridge, CloudWatch Logs subscription, IoT, custom apps

Consumers

  • Kinesis Client Library (KCL) for checkpointing and scaling
  • Enhanced fan-out consumers for dedicated 2 MB/s per consumer per shard
  • Lambda event source mapping for serverless processing
  • Custom applications (Spark, Flink, Python, Java)

Shards & Partition Keys

  • A stream consists of shards; each shard provides write and read capacity
  • Partition key is hashed to route records to a shard
  • Avoid hot keys; distribute keys to spread load across shards
Capacity Metric Per Shard (typical) Notes
Write throughput ~1 MB/s or 1000 records/s Producers aggregated across shard
Read throughput ~2 MB/s Shared unless using enhanced fan-out
Enhanced fan-out 2 MB/s per consumer Dedicated throughput channel
Tip: Use partition keys that reflect natural distribution (userId, deviceId) and consider salting for hotspots.

Scaling & Throughput

  • On-demand mode scales automatically based on traffic
  • Provisioned mode lets you control shard count
  • Reshard operations: split to increase capacity, merge to reduce
  • Monitor with CloudWatch metrics (IncomingBytes, WriteProvisionedThroughputExceeded, IteratorAge)
Pattern: Start with on-demand for unknown workloads; switch to provisioned when you need predictable shard-level control.

Security & IAM

  • Server-side encryption with AWS KMS
  • Fine-grained IAM policies for producers and consumers
  • VPC endpoints for private access
  • Data protection: client-side encryption, data masking, PII handling

Delivery Destinations

  • S3 for durable data lake storage
  • Redshift via Firehose
  • OpenSearch Service for log/search analytics
  • Splunk integration
  • HTTP endpoints for custom destinations

Firehose supports buffering by size and time, and optional Lambda transformations and format conversion (CSV/JSON/Parquet).


Kinesis Data Analytics

  • SQL applications for windowed aggregations, filtering, joins
  • Apache Flink for advanced event-time processing, stateful operators
  • Common outputs: KDS, Firehose, S3, Lambda, OpenSearch
Feature SQL Flink
Complex event processing Basic Advanced
State management Limited Rich keyed state
Ease of use High Moderate

Hands-On Workflow

  1. Plan
    • Choose KDS or Firehose based on destination needs
    • Define partition key strategy
    • Select capacity mode (on-demand vs provisioned)
  2. Create Stream/Delivery Stream
    • KDS: create stream and configure capacity
    • Firehose: choose destination, buffering, IAM roles
  3. Produce Data
    • Use SDK/KPL to put records
    • Validate throughput and partition distribution
  4. Process
    • Consumers via KCL/Lambda
    • Optional Kinesis Data Analytics application
  5. Deliver
    • Firehose writes to destinations
    • Configure retry and S3 backup
  6. Monitor & Scale
    • CloudWatch metrics and alarms
    • Reshard or switch capacity mode

Pricing & Limits

  • KDS: pay per shard hour (provisioned) or per stream data volume (on-demand)
  • Enhanced fan-out billed per consumer-shard hour and data
  • Firehose: pay per data ingested, transformations, and destination-specific costs
  • Data Analytics: pay for application runtime and resources
Optimization: Prefer Firehose for delivery to S3/Redshift/OpenSearch; use KDS when custom processing or ordered replay is required.

Summary

  1. Kinesis Data Streams for durable ordered streaming
  2. Firehose for managed delivery and transformations
  3. Data Analytics for real-time insights with SQL or Flink
  4. Choose capacity mode based on predictability and control
  5. Design partition keys to avoid hotspots and meet throughput

Last Updated: January 2025