AWS Glue Scenario-Based Questions for AWS Data Engineer Certification
50 scenario-based Q&As covering Glue ETL Jobs, Data Catalog, Crawlers, Workflows, Studio, Streaming ETL, Data Quality, Schema Registry, and governance. Click "Show Answer" to reveal the correct answer and explanation. Use the search box to filter.
1) Nightly ETL Job Cost Optimization
ETL Jobs
Scenario: A nightly Glue Spark job transforms 2 TB of S3 data and writes Parquet outputs. It runs for 2 hours at 1 AM daily. The team wants lowest cost without sacrificing reliability.
Show Answer
▶
Correct Answer: B
Explanation:
- B uses serverless Glue Spark with pay-per-runtime DPUs, supports scheduling via triggers and bookmarks for incremental loads.
- A incurs idle costs and is unnecessary.
- C is not suitable for large-scale distributed ETL.
- D moves to EMR and 24/7 cost risk.
- B uses serverless Glue Spark with pay-per-runtime DPUs, supports scheduling via triggers and bookmarks for incremental loads.
- A incurs idle costs and is unnecessary.
- C is not suitable for large-scale distributed ETL.
- D moves to EMR and 24/7 cost risk.
Key Concepts: Glue Spark jobs, DPUs, bookmarks, scheduled triggers
2) Centralized Metadata for Multiple Services
Data Catalog
Scenario: You need a single metadata store for Athena, EMR, and Redshift Spectrum to query the same S3 datasets.
Show Answer
▶
Correct Answer: B
Explanation:
Glue Data Catalog is the centralized Hive-compatible metastore integrated with Athena, EMR, and Spectrum.
Glue Data Catalog is the centralized Hive-compatible metastore integrated with Athena, EMR, and Spectrum.
Key Concepts: Glue Data Catalog, cross-service metadata, Hive metastore
3) Automating Schema Discovery
Crawlers
Scenario: New datasets arrive daily in S3 under date-partitioned folders. You want schemas and partitions to be maintained automatically.
Show Answer
▶
Correct Answer: B
Explanation:
Crawlers infer schema, partitions, and keep the Data Catalog synced with arriving data.
Crawlers infer schema, partitions, and keep the Data Catalog synced with arriving data.
Key Concepts: Crawlers, partition inference, scheduled catalog updates
4) Orchestrating Multi-Step ETL
Workflows
Scenario: You have a pipeline: crawl raw → job transform → job quality checks → publish. Steps have dependencies and must run in order.
Show Answer
▶
Correct Answer: A
Explanation:
Glue Workflows model job/crawler dependencies, triggers, and track run state.
Glue Workflows model job/crawler dependencies, triggers, and track run state.
Key Concepts: Workflow DAG, triggers, orchestration
5) Visual Authoring Requirement
Glue Studio
Scenario: Team wants drag-and-drop ETL authoring with monitoring and code generation.
Show Answer
▶
Correct Answer: A
Explanation:
Glue Studio provides visual ETL nodes, job monitoring, and generated code.
Glue Studio provides visual ETL nodes, job monitoring, and generated code.
Key Concepts: Visual ETL, authoring, monitoring
6) Incremental Loads with Bookmarks
ETL Jobs
Scenario: Your job must process only newly arrived S3 objects each run.
Show Answer
▶
Correct Answer: B
Explanation:
Bookmarks track processed objects and enable incremental reading.
Bookmarks track processed objects and enable incremental reading.
Key Concepts: Bookmarks, incremental ETL, S3 inputs
7) Choosing DynamicFrame vs DataFrame
ETL Jobs
Scenario: You need schema-flexible transforms with semi-structured JSON, then advanced Spark operations.
Show Answer
▶
Correct Answer: B
Explanation:
DynamicFrame handles schema drift; convert to DataFrame for Spark SQL and advanced ops.
DynamicFrame handles schema drift; convert to DataFrame for Spark SQL and advanced ops.
Key Concepts: DynamicFrame, DataFrame, schema drift
8) Output Format for Analytics
Best Practices
Scenario: You want efficient Athena queries and low storage cost for curated zone.
Show Answer
▶
Correct Answer: B
Explanation:
Columnar Parquet plus partitioning reduces scan size and cost and improves performance.
Columnar Parquet plus partitioning reduces scan size and cost and improves performance.
Key Concepts: Parquet, partitioning, columnar storage
9) Reading from Private RDS
Connections
Scenario: You must read data from an RDS instance in a private subnet.
Show Answer
▶
Correct Answer: B
Explanation:
Glue connections configure subnet, security groups, and VPC access to private RDS.
Glue connections configure subnet, security groups, and VPC access to private RDS.
Key Concepts: Glue connections, VPC, private networking
10) Column-Level Access Control
Governance
Scenario: Analysts should not see PII columns in the curated tables.
Show Answer
▶
Correct Answer: C
Explanation:
Lake Formation enforces fine-grained permissions on Data Catalog tables and columns.
Lake Formation enforces fine-grained permissions on Data Catalog tables and columns.
Key Concepts: Lake Formation, column permissions, governance
11) Real-Time Stream Processing
Streaming ETL
Scenario: Ingest IoT telemetry from Kinesis Data Streams and aggregate in near real time.
Show Answer
▶
Correct Answer: B
Explanation:
Glue Spark streaming integrates with Kinesis/Kafka for windowed aggregations and low-latency ETL.
Glue Spark streaming integrates with Kinesis/Kafka for windowed aggregations and low-latency ETL.
Key Concepts: Structured Streaming, Kinesis source, windowing
12) Managing Streaming Schemas
Schema Registry
Scenario: Kafka producers evolve message schemas; consumers must remain compatible.
Show Answer
▶
Correct Answer: A
Explanation:
Glue Schema Registry enforces compatibility, versioning, and integrates with streaming pipelines.
Glue Schema Registry enforces compatibility, versioning, and integrates with streaming pipelines.
Key Concepts: Schema Registry, compatibility, evolution
13) Ensuring Dataset Quality
Data Quality
Scenario: You must validate incoming datasets for nulls, ranges, and referential integrity before loading.
Show Answer
▶
Correct Answer: A
Explanation:
Glue Data Quality provides declarative rules, profiling, and metrics for validation.
Glue Data Quality provides declarative rules, profiling, and metrics for validation.
Key Concepts: Data Quality, rules, profiling
14) Right-Sizing DPU Capacity
Performance
Scenario: A job is slow; metrics show high shuffle and skew.
Show Answer
▶
Correct Answer: A
Explanation:
More executors and better partitioning alleviate skew and shuffle bottlenecks.
More executors and better partitioning alleviate skew and shuffle bottlenecks.
Key Concepts: DPU sizing, skew, partitioning
15) Designing Lake Zones
Architecture
Scenario: You need a maintainable data lake structure with clear lifecycle.
Show Answer
▶
Correct Answer: B
Explanation:
Zoning supports governance, lineage, and progressive refinement.
Zoning supports governance, lineage, and progressive refinement.
Key Concepts: Lake zones, lifecycle, governance
16) Supporting Updates and Deletes
Formats
Scenario: You need ACID semantics for lake tables with upserts.
Show Answer
▶
Correct Answer: B
Explanation:
Iceberg adds table abstraction and ACID operations for lakehouse patterns.
Iceberg adds table abstraction and ACID operations for lakehouse patterns.
Key Concepts: Iceberg, ACID, lakehouse
17) Parallel Feature Engineering
Ray
Scenario: You need Python-native parallelism for ML feature generation, not full Spark.
Show Answer
▶
Correct Answer: A
Explanation:
Glue for Ray provides distributed Python compute for ML workloads.
Glue for Ray provides distributed Python compute for ML workloads.
Key Concepts: Ray, Python parallelism, ML
18) Custom File Format Discovery
Crawlers
Scenario: Files are custom delimited; default inference fails.
Show Answer
▶
Correct Answer: B
Explanation:
Custom classifiers parse nonstandard formats for schema inference.
Custom classifiers parse nonstandard formats for schema inference.
Key Concepts: Crawler classifiers, schema inference
19) Handling Transient Failures
Reliability
Scenario: Job occasionally fails due to transient network issues.
Show Answer
▶
Correct Answer: B
Explanation:
Glue supports retry; ensure idempotency to avoid duplicate outputs.
Glue supports retry; ensure idempotency to avoid duplicate outputs.
Key Concepts: Retry policy, idempotency
20) Reprocessing After Logic Fix
Bookmarks
Scenario: You changed transformation logic and must reprocess past data.
Show Answer
▶
Correct Answer: A
Explanation:
Bookmark reset enables controlled reprocessing of historical inputs.
Bookmark reset enables controlled reprocessing of historical inputs.
Key Concepts: Bookmark reset, reprocessing
21) Tracking Schema Changes
Data Catalog
Scenario: Columns are added over time; downstream queries must adapt.
Show Answer
▶
Correct Answer: A
Explanation:
Catalog versions document schema evolution for governed analytics.
Catalog versions document schema evolution for governed analytics.
Key Concepts: Catalog versions, evolution
22) Troubleshooting Slow Jobs
Monitoring
Scenario: You need runtime metrics and logs to analyze performance.
Show Answer
▶
Correct Answer: A
Explanation:
Glue integrates with CloudWatch Logs and Studio UI for metrics and traces.
Glue integrates with CloudWatch Logs and Studio UI for metrics and traces.
Key Concepts: CloudWatch Logs, metrics, Studio
23) Scheduling Pipelines
Triggers
Scenario: Pipeline must run hourly and on-demand on arrival events.
Show Answer
▶
Correct Answer: A
Explanation:
Glue Triggers support time-based and event-based execution tied to workflows.
Glue Triggers support time-based and event-based execution tied to workflows.
Key Concepts: Triggers, schedule, events
24) Missing Partitions in Athena
Catalog
Scenario: New S3 partitions aren't visible for querying.
Show Answer
▶
Correct Answer: A
Explanation:
Crawlers or MSCK commands add new partition metadata to the Catalog.
Crawlers or MSCK commands add new partition metadata to the Catalog.
Key Concepts: Partitions, Catalog sync, MSCK
25) Handling S3 Eventual Consistency
Reliability
Scenario: Immediately after writes, queries sometimes miss files.
Show Answer
▶
Correct Answer: A
Explanation:
Account for eventual consistency; manifest/committer writes aid atomicity.
Account for eventual consistency; manifest/committer writes aid atomicity.
Key Concepts: S3 consistency, manifests
26) Reducing Read Volume
Performance
Scenario: Your job reads excessive data from S3.
Show Answer
▶
Correct Answer: A
Explanation:
Partition keys and pushdown minimize IO and runtime.
Partition keys and pushdown minimize IO and runtime.
Key Concepts: Pushdown, pruning, IO reduction
27) Cross-Account Access to Catalog Tables
Governance
Scenario: Another account needs read-only access to curated tables.
Show Answer
▶
Correct Answer: A
Explanation:
Lake Formation supports cross-account grants for governed sharing.
Lake Formation supports cross-account grants for governed sharing.
Key Concepts: Cross-account, Lake Formation sharing
28) No-Code Data Preparation
DataBrew
Scenario: Business users need simple transformations and profiling without code.
Show Answer
▶
Correct Answer: A
Explanation:
DataBrew provides visual, no-code preparation integrated with Glue.
DataBrew provides visual, no-code preparation integrated with Glue.
Key Concepts: DataBrew, visual prep
29) Preventing Partial Outputs
Reliability
Scenario: Incomplete files appear when jobs fail mid-write.
Show Answer
▶
Correct Answer: A
Explanation:
Committer pattern ensures atomic publish of successful outputs.
Committer pattern ensures atomic publish of successful outputs.
Key Concepts: Output committers, atomic writes
30) Parameterizing Jobs per Environment
ETL Jobs
Scenario: Same script should run in dev/stage/prod with different paths.
Show Answer
▶
Correct Answer: A
Explanation:
Parameterization enables environment-specific configuration without code changes.
Parameterization enables environment-specific configuration without code changes.
Key Concepts: Parameters, config, environments
31) Encrypting Data at Rest
Security
Scenario: Regulatory requirements mandate encryption of outputs in S3.
Show Answer
▶
Correct Answer: A
Explanation:
KMS-backed default encryption plus job-level encryption meet at-rest controls.
KMS-backed default encryption plus job-level encryption meet at-rest controls.
Key Concepts: KMS, encryption, compliance
32) Running Multiple Jobs Concurrently
Capacity
Scenario: Several workflows need to run at the same time.
Show Answer
▶
Correct Answer: A
Explanation:
Plan concurrency against DPU and job run quotas; request limit increases as needed.
Plan concurrency against DPU and job run quotas; request limit increases as needed.
Key Concepts: DPU quotas, concurrency
33) Capturing Bad Records
Quality
Scenario: Some records fail parsing; you must isolate them.
Show Answer
▶
Correct Answer: A
Explanation:
Dead-letter handling preserves observability and enables remediation.
Dead-letter handling preserves observability and enables remediation.
Key Concepts: Dead-letter, observability
34) From Visual to Code
Studio
Scenario: You prototype in Studio and need version-controlled scripts.
Show Answer
▶
Correct Answer: A
Explanation:
Studio produces runnable code; store and CI/CD enable automation.
Studio produces runnable code; store and CI/CD enable automation.
Key Concepts: Code generation, versioning
35) Avoiding Crawl/Job Conflicts
Operations
Scenario: Crawler updates tables while job writes outputs; queries break.
Show Answer
▶
Correct Answer: A
Explanation:
Orchestrate order to avoid metadata inconsistency during writes.
Orchestrate order to avoid metadata inconsistency during writes.
Key Concepts: Orchestration, metadata consistency
36) Optimizing Skewed Joins
Performance
Scenario: Large-to-small table join suffers from skew.
Show Answer
▶
Correct Answer: A
Explanation:
Broadcast joins reduce shuffle; repartition distributes load.
Broadcast joins reduce shuffle; repartition distributes load.
Key Concepts: Broadcast join, repartition
37) Recovering Streaming State
Streaming ETL
Scenario: Streaming ETL must recover after failure.
Show Answer
▶
Correct Answer: A
Explanation:
Checkpoints persist progress for fault-tolerant streaming.
Checkpoints persist progress for fault-tolerant streaming.
Key Concepts: Checkpoints, recovery
38) Retrying Failed Pipeline Step
Workflows
Scenario: The transform job failed; rerun only that step.
Show Answer
▶
Correct Answer: A
Explanation:
Workflows track state and allow targeted retries.
Workflows track state and allow targeted retries.
Key Concepts: Workflow state, retries
39) Speeding Up Athena Queries
Analytics
Scenario: Athena scans too much data.
Show Answer
▶
Correct Answer: A
Explanation:
Partitioning and columnar formats reduce scanned bytes and cost.
Partitioning and columnar formats reduce scanned bytes and cost.
Key Concepts: Partitioning, Parquet, scan reduction
40) Selecting Glue Version
ETL Jobs
Scenario: Library requires a newer Spark/Python runtime.
Show Answer
▶
Correct Answer: A
Explanation:
Runtime versions control Spark/Python compatibility for jobs.
Runtime versions control Spark/Python compatibility for jobs.
Key Concepts: Glue version, compatibility
41) Renaming Tables Safely
Catalog
Scenario: You must rename a Catalog table with minimal disruption.
Show Answer
▶
Correct Answer: A
Explanation:
Controlled migration keeps dependent queries stable.
Controlled migration keeps dependent queries stable.
Key Concepts: Catalog change management
42) Choosing Compression
Storage
Scenario: Balance performance and storage for Parquet outputs.
Show Answer
▶
Correct Answer: A
Explanation:
Snappy is splittable and optimized for Parquet analytics.
Snappy is splittable and optimized for Parquet analytics.
Key Concepts: Compression, Parquet
43) Prevent Unauthorized Table Access
Security
Scenario: A user accessed a restricted table.
Show Answer
▶
Correct Answer: A
Explanation:
Lake Formation centralizes governed access; IAM should not bypass table-level controls.
Lake Formation centralizes governed access; IAM should not bypass table-level controls.
Key Concepts: Governance, IAM, Lake Formation
44) Finalizing Job Successfully
ETL Jobs
Scenario: Ensure job status reflects success for Workflow chaining.
Show Answer
▶
Correct Answer: A
Explanation:
job.commit() finalizes the run and enables dependent triggers.
job.commit() finalizes the run and enables dependent triggers.
Key Concepts: Job lifecycle, commit
45) Syncing Large Partition Trees
Operations
Scenario: Millions of partitions need periodic reconciliation.
Show Answer
▶
Correct Answer: A
Explanation:
Inventory scales discovery for very large catalogs.
Inventory scales discovery for very large catalogs.
Key Concepts: S3 Inventory, partition reconciliation
46) Private Access to S3
Networking
Scenario: Glue jobs must access S3 without traversing public internet.
Show Answer
▶
Correct Answer: A
Explanation:
Gateway endpoints enable private S3 access from VPC subnets.
Gateway endpoints enable private S3 access from VPC subnets.
Key Concepts: VPC endpoints, private S3
47) Enforcing Referential Integrity
Data Quality
Scenario: Fact dataset must reference valid dimension keys.
Show Answer
▶
Correct Answer: A
Explanation:
Rules validate keys and produce metrics and violations.
Rules validate keys and produce metrics and violations.
Key Concepts: Referential checks, quality rules
48) Unpredictable Workload Capacity
Scaling
Scenario: Workload traffic fluctuates significantly.
Show Answer
▶
Correct Answer: A
Explanation:
On-demand capacity adjusts resources to variable demand.
On-demand capacity adjusts resources to variable demand.
Key Concepts: On-demand capacity, autoscaling
49) Avoiding Duplicate Streaming Outputs
Streaming ETL
Scenario: Failures cause duplicate writes in streaming ETL.
Show Answer
▶
Correct Answer: A
Explanation:
Idempotent writes and transactional sinks mitigate duplicates.
Idempotent writes and transactional sinks mitigate duplicates.
Key Concepts: Idempotency, transactional sinks
50) Auditing Historical Table State
Lakehouse
Scenario: You need to audit a table as-of last week.
Show Answer
▶
Correct Answer: A
Explanation:
Iceberg maintains snapshots enabling consistent historical reads.
Iceberg maintains snapshots enabling consistent historical reads.
Key Concepts: Time travel, snapshots
Tip: Press Ctrl/Cmd + F to search in-page. Use the search box above to filter dynamically.

