AWS Glue Tutorial
Overview
AWS Glue is a serverless data integration service for discovering, preparing, and transforming data for analytics, machine learning, and application development.
- Build ETL pipelines using Python or Spark
- Centralize metadata with Glue Data Catalog
- Automate schema discovery using Glue Crawlers
- Orchestrate multi-step jobs with Glue Workflows
- Design visually with Glue Studio
Hands-on: Build ETL pipelines, transform data, catalog schemas.
Core Components
- Jobs: Serverless Spark or Python shell workloads
- Data Catalog: Databases, tables, partitions, and schema versions
- Crawlers: Scan data sources and infer schemas
- Workflows: Coordinate jobs and triggers
- Studio: Visual authoring and monitoring
- Data Quality: Rules, profiles, and metrics for datasets
- Schema Registry: Manage schemas for streaming platforms like Kafka
Glue ETL Jobs
Job Types
- Spark: Distributed ETL with DynamicFrame and DataFrame APIs
- Python Shell: Lightweight scripts for small tasks
- Ray: Parallel compute for ML/feature engineering
Key Concepts
- DynamicFrame: Schema-flexible abstraction with transforms
- Bookmarks: Track processed data for incremental loads
- Connections: VPC, JDBC, and marketplace connectors
- Job Parameters: Runtime configuration and environment variables
- Formats: Parquet, ORC, JSON, CSV, Avro, Iceberg
Sample Spark Job
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
source = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://my-bucket/raw/events/"]},
format="json"
)
transformed = source.rename_field("userId", "user_id")
df = transformed.toDF()
df = df.filter(df["event_type"] != "test")
glueContext.write_dynamic_frame.from_options(
frame=glueContext.create_dynamic_frame.from_df(df, glueContext, "df"),
connection_type="s3",
connection_options={"path": "s3://my-bucket/curated/events/"},
format="parquet"
)
job.commit()
Glue Data Catalog
- Persistent metadata store for databases, tables, and partitions
- Backed by Hive Metastore compatible APIs
- Integrates with Athena, EMR, Redshift Spectrum, and Lake Formation
- Supports schema versions and column-level metadata
| Entity | Description | Common Use |
|---|---|---|
| Database | Logical grouping of tables | Organize by domain or environment |
| Table | Schema definition for datasets | Query via Athena and ETL jobs |
| Partition | Subsets of data by keys | Improve query performance and cost |
Glue Crawlers
- Automatically discover datasets in sources like S3, JDBC, DynamoDB
- Infer schema, data types, partitions, and update the Data Catalog
- Schedule for continuous updates; supports classifiers for custom formats
Glue Workflows
- Define orchestrated pipelines of crawlers, triggers, and jobs
- Support conditional branching and dependencies
- Track runs, status, and lineage
Glue Studio
- Visual job authoring with drag-and-drop nodes
- Monitors runs, logs, and metrics
- Supports data previews and autogenerated code
Streaming ETL
- Process streaming data from Kinesis Data Streams and Kafka
- Stateful aggregations and windowing with Spark Structured Streaming
- Write outputs to S3, Redshift, or other sinks
Data Quality
- Define rules to validate datasets
- Generate profiles to understand data distributions
- Produce metrics and alerts for pipeline health
Schema Registry
- Central registry for streaming schemas (Kafka producers/consumers)
- Supports schema evolution and compatibility checks
- Integrates with Glue Data Catalog and ETL jobs
Best Practices
- Use Parquet with partitioning for query efficiency
- Optimize Spark with pushdown, coalesce, and repartition strategies
- Prefer Bookmarks for incremental loads
- Separate raw, staged, and curated S3 zones
- Store scripts in S3 and version with Git
Security & Governance
- Use Lake Formation for fine-grained table and column permissions
- IAM roles for jobs, crawlers, and Catalog access
- KMS encryption for data at rest; TLS for data in transit
- VPC endpoints and private subnets for secure connectivity
- Enable logging to CloudWatch and S3
Hands-On Workflow
- Plan
- Define source and target formats
- Identify partitioning keys
- Choose job type (Spark, Python Shell, Ray)
- Create Catalog
- Create database and run Crawler on S3
- Verify table schema and partitions
- Develop ETL
- Author in Glue Studio or script in S3
- Use DynamicFrame transforms
- Write to curated S3 in Parquet
- Orchestrate
- Create Workflow with crawler → job → quality checks
- Schedule with triggers
- Validate
- Run Data Quality rules
- Query with Athena or load to Redshift
- Monitor
- CloudWatch metrics and logs
- Optimize job parameters and parallelism
Pricing & Limits
- Jobs billed per Data Processing Unit hour
- Data Catalog charges per request and storage beyond free tier
- Crawler runtime billed per DPU hour
- Data Quality and Schema Registry incur additional usage-based costs
Optimization: Use Parquet, partitioning, and bookmarks to reduce runtime and cost.
Summary
- Glue ETL Jobs with Python/Spark for scalable transformations
- Glue Data Catalog centralizes metadata for analytics
- Glue Crawlers automate schema discovery
- Glue Workflows orchestrate pipelines
- Glue Studio simplifies visual authoring
- Data Quality and Schema Registry improve reliability
Last Updated: January 2025
