Amazon Kinesis Tutorial
Overview
Amazon Kinesis is a fully managed platform for real-time data streaming on AWS. It captures, processes, and analyzes streaming data such as application logs, IoT telemetry, clickstreams, and events.
- Ingest events at scale with low latency
- Process streams in real time or near-real time
- Deliver data to durable storage and analytics services
Use cases: Real-time analytics dashboards, anomaly detection, log ingestion, clickstream analysis, IoT telemetry, ETL pipelines.
Core Services
Kinesis Data Streams (KDS)
- Durable, ordered event stream
- Producers write records; consumers read and process
- Provisioned or on-demand capacity
- Retention configurable (24 hours to multiple days)
Kinesis Data Firehose
- Fully managed delivery to destinations
- Buffering and batching for efficient writes
- Transforms via Lambda, record format conversion
- Destinations: S3, Redshift, OpenSearch Service, Splunk, HTTP endpoints
Kinesis Data Analytics
- Real-time analytics using SQL or Apache Flink
- Windows, aggregations, joins, pattern detection
- Inputs from KDS or Firehose; outputs to KDS, Firehose, S3, others
Architecture
Typical architecture: producers send records to KDS → optional transformations or analytics → delivery to storage/analytics via Firehose or consumer applications.
Producers → Kinesis Data Streams → Consumers / Kinesis Data Analytics → Firehose → S3/Redshift/OpenSearch
- Records are appended to shards with sequence numbers
- Partition keys determine routing to shards
- Consumers track progress via checkpoints
Producers & Consumers
Producers
- SDK PutRecord/PutRecords APIs
- Kinesis Producer Library (KPL) for high-throughput batching
- Kinesis Agent for log files
- Amazon EventBridge, CloudWatch Logs subscription, IoT, custom apps
Consumers
- Kinesis Client Library (KCL) for checkpointing and scaling
- Enhanced fan-out consumers for dedicated 2 MB/s per consumer per shard
- Lambda event source mapping for serverless processing
- Custom applications (Spark, Flink, Python, Java)
Scaling & Throughput
- On-demand mode scales automatically based on traffic
- Provisioned mode lets you control shard count
- Reshard operations: split to increase capacity, merge to reduce
- Monitor with CloudWatch metrics (IncomingBytes, WriteProvisionedThroughputExceeded, IteratorAge)
Pattern: Start with on-demand for unknown workloads; switch to provisioned when you need predictable shard-level control.
Security & IAM
- Server-side encryption with AWS KMS
- Fine-grained IAM policies for producers and consumers
- VPC endpoints for private access
- Data protection: client-side encryption, data masking, PII handling
Delivery Destinations
- S3 for durable data lake storage
- Redshift via Firehose
- OpenSearch Service for log/search analytics
- Splunk integration
- HTTP endpoints for custom destinations
Firehose supports buffering by size and time, and optional Lambda transformations and format conversion (CSV/JSON/Parquet).
Kinesis Data Analytics
- SQL applications for windowed aggregations, filtering, joins
- Apache Flink for advanced event-time processing, stateful operators
- Common outputs: KDS, Firehose, S3, Lambda, OpenSearch
| Feature | SQL | Flink |
|---|---|---|
| Complex event processing | Basic | Advanced |
| State management | Limited | Rich keyed state |
| Ease of use | High | Moderate |
Hands-On Workflow
- Plan
- Choose KDS or Firehose based on destination needs
- Define partition key strategy
- Select capacity mode (on-demand vs provisioned)
- Create Stream/Delivery Stream
- KDS: create stream and configure capacity
- Firehose: choose destination, buffering, IAM roles
- Produce Data
- Use SDK/KPL to put records
- Validate throughput and partition distribution
- Process
- Consumers via KCL/Lambda
- Optional Kinesis Data Analytics application
- Deliver
- Firehose writes to destinations
- Configure retry and S3 backup
- Monitor & Scale
- CloudWatch metrics and alarms
- Reshard or switch capacity mode
Pricing & Limits
- KDS: pay per shard hour (provisioned) or per stream data volume (on-demand)
- Enhanced fan-out billed per consumer-shard hour and data
- Firehose: pay per data ingested, transformations, and destination-specific costs
- Data Analytics: pay for application runtime and resources
Optimization: Prefer Firehose for delivery to S3/Redshift/OpenSearch; use KDS when custom processing or ordered replay is required.
Summary
- Kinesis Data Streams for durable ordered streaming
- Firehose for managed delivery and transformations
- Data Analytics for real-time insights with SQL or Flink
- Choose capacity mode based on predictability and control
- Design partition keys to avoid hotspots and meet throughput
Last Updated: January 2025
