AWS Glue Scenario Based Questions

AWS Glue Scenario-Based Questions for AWS Data Engineer Certification

50 scenario-based Q&As covering Glue ETL Jobs, Data Catalog, Crawlers, Workflows, Studio, Streaming ETL, Data Quality, Schema Registry, and governance. Click "Show Answer" to reveal the correct answer and explanation. Use the search box to filter.

1) Nightly ETL Job Cost Optimization ETL Jobs

Scenario: A nightly Glue Spark job transforms 2 TB of S3 data and writes Parquet outputs. It runs for 2 hours at 1 AM daily. The team wants lowest cost without sacrificing reliability.

Choose the BEST approach:
A. Run a continuous Glue job with long timeout
B. Schedule a Glue Spark job with appropriate DPUs and job bookmarks
C. Use a Python shell job looping all night
D. Run EMR long-running cluster and submit steps

Show Answer ▶

Correct Answer: B

Explanation:
- B uses serverless Glue Spark with pay-per-runtime DPUs, supports scheduling via triggers and bookmarks for incremental loads.
- A incurs idle costs and is unnecessary.
- C is not suitable for large-scale distributed ETL.
- D moves to EMR and 24/7 cost risk.

Key Concepts: Glue Spark jobs, DPUs, bookmarks, scheduled triggers

2) Centralized Metadata for Multiple Services Data Catalog

Scenario: You need a single metadata store for Athena, EMR, and Redshift Spectrum to query the same S3 datasets.

What should you use?
A. DynamoDB
B. Glue Data Catalog
C. S3 object tags
D. RDS PostgreSQL

Show Answer ▶

Correct Answer: B

Explanation:
Glue Data Catalog is the centralized Hive-compatible metastore integrated with Athena, EMR, and Spectrum.

Key Concepts: Glue Data Catalog, cross-service metadata, Hive metastore

3) Automating Schema Discovery Crawlers

Scenario: New datasets arrive daily in S3 under date-partitioned folders. You want schemas and partitions to be maintained automatically.

Best solution?
A. Manual CREATE TABLE statements for each date
B. Glue Crawler scheduled to update the Catalog
C. Lambda to write Hive DDL
D. Upload CSV headers to S3 object tags

Show Answer ▶

Correct Answer: B

Explanation:
Crawlers infer schema, partitions, and keep the Data Catalog synced with arriving data.

Key Concepts: Crawlers, partition inference, scheduled catalog updates

4) Orchestrating Multi-Step ETL Workflows

Scenario: You have a pipeline: crawl raw → job transform → job quality checks → publish. Steps have dependencies and must run in order.

Choose the service:
A. Glue Workflows
B. CloudWatch Events only
C. EC2 cron
D. S3 event notifications

Show Answer ▶

Correct Answer: A

Explanation:
Glue Workflows model job/crawler dependencies, triggers, and track run state.

Key Concepts: Workflow DAG, triggers, orchestration

5) Visual Authoring Requirement Glue Studio

Scenario: Team wants drag-and-drop ETL authoring with monitoring and code generation.

Use:
A. Glue Studio
B. EMR Notebooks
C. Cloud9 IDE
D. Lambda console editor

Show Answer ▶

Correct Answer: A

Explanation:
Glue Studio provides visual ETL nodes, job monitoring, and generated code.

Key Concepts: Visual ETL, authoring, monitoring

6) Incremental Loads with Bookmarks ETL Jobs

Scenario: Your job must process only newly arrived S3 objects each run.

What feature helps?
A. Triggers
B. Job bookmarks
C. DataBrew profiles
D. Schema Registry

Show Answer ▶

Correct Answer: B

Explanation:
Bookmarks track processed objects and enable incremental reading.

Key Concepts: Bookmarks, incremental ETL, S3 inputs

7) Choosing DynamicFrame vs DataFrame ETL Jobs

Scenario: You need schema-flexible transforms with semi-structured JSON, then advanced Spark operations.

Best pattern?
A. Use DataFrame only
B. Use DynamicFrame transforms then convert to DataFrame
C. Use RDDs only
D. Use Python lists

Show Answer ▶

Correct Answer: B

Explanation:
DynamicFrame handles schema drift; convert to DataFrame for Spark SQL and advanced ops.

Key Concepts: DynamicFrame, DataFrame, schema drift

8) Output Format for Analytics Best Practices

Scenario: You want efficient Athena queries and low storage cost for curated zone.

Choose:
A. JSON without partitioning
B. Parquet with partitioning
C. CSV with gzip
D. Avro without partitioning

Show Answer ▶

Correct Answer: B

Explanation:
Columnar Parquet plus partitioning reduces scan size and cost and improves performance.

Key Concepts: Parquet, partitioning, columnar storage

9) Reading from Private RDS Connections

Scenario: You must read data from an RDS instance in a private subnet.

What is required?
A. Public RDS endpoint
B. Glue connection with VPC configuration and security groups
C. S3 VPC endpoint
D. NAT Gateway only

Show Answer ▶

Correct Answer: B

Explanation:
Glue connections configure subnet, security groups, and VPC access to private RDS.

Key Concepts: Glue connections, VPC, private networking

10) Column-Level Access Control Governance

Scenario: Analysts should not see PII columns in the curated tables.

Use:
A. S3 bucket ACLs only
B. Glue job IAM policy
C. Lake Formation grants with column-level permissions
D. CloudWatch Logs

Show Answer ▶

Correct Answer: C

Explanation:
Lake Formation enforces fine-grained permissions on Data Catalog tables and columns.

Key Concepts: Lake Formation, column permissions, governance

11) Real-Time Stream Processing Streaming ETL

Scenario: Ingest IoT telemetry from Kinesis Data Streams and aggregate in near real time.

Choose:
A. Glue Python Shell job
B. Glue Spark Streaming ETL job with structured streaming
C. EMR batch job
D. Redshift COPY

Show Answer ▶

Correct Answer: B

Explanation:
Glue Spark streaming integrates with Kinesis/Kafka for windowed aggregations and low-latency ETL.

Key Concepts: Structured Streaming, Kinesis source, windowing

12) Managing Streaming Schemas Schema Registry

Scenario: Kafka producers evolve message schemas; consumers must remain compatible.

Use:
A. Glue Schema Registry with compatibility rules
B. Hardcode JSON versions in code
C. Save schema in S3 README files
D. Use DynamoDB only

Show Answer ▶

Correct Answer: A

Explanation:
Glue Schema Registry enforces compatibility, versioning, and integrates with streaming pipelines.

Key Concepts: Schema Registry, compatibility, evolution

13) Ensuring Dataset Quality Data Quality

Scenario: You must validate incoming datasets for nulls, ranges, and referential integrity before loading.

Choose:
A. Glue Data Quality rules and profiles
B. CloudWatch alarms only
C. EMR Ganglia
D. S3 event notifications

Show Answer ▶

Correct Answer: A

Explanation:
Glue Data Quality provides declarative rules, profiling, and metrics for validation.

Key Concepts: Data Quality, rules, profiling

14) Right-Sizing DPU Capacity Performance

Scenario: A job is slow; metrics show high shuffle and skew.

Best actions?
A. Increase DPUs and optimize partitioning
B. Decrease DPUs
C. Switch to Python shell
D. Disable Spark shuffle

Show Answer ▶

Correct Answer: A

Explanation:
More executors and better partitioning alleviate skew and shuffle bottlenecks.

Key Concepts: DPU sizing, skew, partitioning

15) Designing Lake Zones Architecture

Scenario: You need a maintainable data lake structure with clear lifecycle.

Best approach?
A. Single S3 bucket for all data
B. Separate raw, staged, curated zones
C. Use EFS for lake storage
D. Store data in Lambda environment

Show Answer ▶

Correct Answer: B

Explanation:
Zoning supports governance, lineage, and progressive refinement.

Key Concepts: Lake zones, lifecycle, governance

16) Supporting Updates and Deletes Formats

Scenario: You need ACID semantics for lake tables with upserts.

Choose:
A. Plain Parquet files only
B. Iceberg tables in Glue/Athena
C. CSV files
D. JSON objects

Show Answer ▶

Correct Answer: B

Explanation:
Iceberg adds table abstraction and ACID operations for lakehouse patterns.

Key Concepts: Iceberg, ACID, lakehouse

17) Parallel Feature Engineering Ray

Scenario: You need Python-native parallelism for ML feature generation, not full Spark.

Use:
A. Glue for Ray
B. Python shell
C. EMR Hive
D. Lambda

Show Answer ▶

Correct Answer: A

Explanation:
Glue for Ray provides distributed Python compute for ML workloads.

Key Concepts: Ray, Python parallelism, ML

18) Custom File Format Discovery Crawlers

Scenario: Files are custom delimited; default inference fails.

Fix:
A. Write manual DDL
B. Add custom classifier to the Crawler
C. Rename files
D. Use zip archives

Show Answer ▶

Correct Answer: B

Explanation:
Custom classifiers parse nonstandard formats for schema inference.

Key Concepts: Crawler classifiers, schema inference

19) Handling Transient Failures Reliability

Scenario: Job occasionally fails due to transient network issues.

Best action?
A. Disable retries
B. Configure job retry and backoff; idempotent writes to target
C. Switch to EMR
D. Ignore errors

Show Answer ▶

Correct Answer: B

Explanation:
Glue supports retry; ensure idempotency to avoid duplicate outputs.

Key Concepts: Retry policy, idempotency

20) Reprocessing After Logic Fix Bookmarks

Scenario: You changed transformation logic and must reprocess past data.

How?
A. Reset job bookmarks and rerun for the range
B. Delete outputs only
C. Update IAM
D. Wait for new data

Show Answer ▶

Correct Answer: A

Explanation:
Bookmark reset enables controlled reprocessing of historical inputs.

Key Concepts: Bookmark reset, reprocessing

21) Tracking Schema Changes Data Catalog

Scenario: Columns are added over time; downstream queries must adapt.

Use:
A. Glue Data Catalog schema versioning
B. Rename S3 prefixes
C. CSV always
D. Hardcode column order

Show Answer ▶

Correct Answer: A

Explanation:
Catalog versions document schema evolution for governed analytics.

Key Concepts: Catalog versions, evolution

22) Troubleshooting Slow Jobs Monitoring

Scenario: You need runtime metrics and logs to analyze performance.

Use:
A. Glue Studio job run details and CloudWatch Logs
B. S3 object tags
C. IAM policy simulator
D. SNS email only

Show Answer ▶

Correct Answer: A

Explanation:
Glue integrates with CloudWatch Logs and Studio UI for metrics and traces.

Key Concepts: CloudWatch Logs, metrics, Studio

23) Scheduling Pipelines Triggers

Scenario: Pipeline must run hourly and on-demand on arrival events.

Implement:
A. Glue Triggers (schedule and event-based)
B. EC2 cron only
C. Manual clicks
D. Redshift schedule

Show Answer ▶

Correct Answer: A

Explanation:
Glue Triggers support time-based and event-based execution tied to workflows.

Key Concepts: Triggers, schedule, events

24) Missing Partitions in Athena Catalog

Scenario: New S3 partitions aren't visible for querying.

Fix:
A. Run Crawler or MSCK REPAIR TABLE
B. Re-upload files
C. Change IAM role
D. Use CSV instead

Show Answer ▶

Correct Answer: A

Explanation:
Crawlers or MSCK commands add new partition metadata to the Catalog.

Key Concepts: Partitions, Catalog sync, MSCK

25) Handling S3 Eventual Consistency Reliability

Scenario: Immediately after writes, queries sometimes miss files.

Approach:
A. Add appropriate delays and use manifest writes
B. Assume immediate consistency
C. Disable partitioning
D. Use local disks

Show Answer ▶

Correct Answer: A

Explanation:
Account for eventual consistency; manifest/committer writes aid atomicity.

Key Concepts: S3 consistency, manifests

26) Reducing Read Volume Performance

Scenario: Your job reads excessive data from S3.

Fix:
A. Use predicate pushdown and partition pruning
B. Read entire dataset then filter
C. Convert to CSV
D. Disable compression

Show Answer ▶

Correct Answer: A

Explanation:
Partition keys and pushdown minimize IO and runtime.

Key Concepts: Pushdown, pruning, IO reduction

27) Cross-Account Access to Catalog Tables Governance

Scenario: Another account needs read-only access to curated tables.

Approach:
A. Share via Lake Formation resource sharing
B. Copy data to their S3
C. Email CSVs
D. Public S3 bucket

Show Answer ▶

Correct Answer: A

Explanation:
Lake Formation supports cross-account grants for governed sharing.

Key Concepts: Cross-account, Lake Formation sharing

28) No-Code Data Preparation DataBrew

Scenario: Business users need simple transformations and profiling without code.

Use:
A. AWS Glue DataBrew
B. Glue Spark jobs
C. EMR cluster
D. Lambda

Show Answer ▶

Correct Answer: A

Explanation:
DataBrew provides visual, no-code preparation integrated with Glue.

Key Concepts: DataBrew, visual prep

29) Preventing Partial Outputs Reliability

Scenario: Incomplete files appear when jobs fail mid-write.

Best solution:
A. Use S3 committer/manifest-based writes
B. Write uncommitted files directly
C. Disable retries
D. Store on local disk only

Show Answer ▶

Correct Answer: A

Explanation:
Committer pattern ensures atomic publish of successful outputs.

Key Concepts: Output committers, atomic writes

30) Parameterizing Jobs per Environment ETL Jobs

Scenario: Same script should run in dev/stage/prod with different paths.

Implement:
A. Job parameters and environment variables
B. Copy-paste scripts per env
C. Hardcode values
D. Rename buckets

Show Answer ▶

Correct Answer: A

Explanation:
Parameterization enables environment-specific configuration without code changes.

Key Concepts: Parameters, config, environments

31) Encrypting Data at Rest Security

Scenario: Regulatory requirements mandate encryption of outputs in S3.

Do:
A. Use S3 bucket default encryption with KMS and Glue job encryption
B. Rely on object ACLs
C. Use plain text files
D. Store in public bucket

Show Answer ▶

Correct Answer: A

Explanation:
KMS-backed default encryption plus job-level encryption meet at-rest controls.

Key Concepts: KMS, encryption, compliance

32) Running Multiple Jobs Concurrently Capacity

Scenario: Several workflows need to run at the same time.

Consider:
A. Account DPU limits and concurrent run quotas
B. Ignore limits
C. Single-thread everything
D. Use EC2 instead

Show Answer ▶

Correct Answer: A

Explanation:
Plan concurrency against DPU and job run quotas; request limit increases as needed.

Key Concepts: DPU quotas, concurrency

33) Capturing Bad Records Quality

Scenario: Some records fail parsing; you must isolate them.

Pattern:
A. Write invalid rows to a dead-letter S3 prefix
B. Drop silently
C. Stop the pipeline
D. Email CSV

Show Answer ▶

Correct Answer: A

Explanation:
Dead-letter handling preserves observability and enables remediation.

Key Concepts: Dead-letter, observability

34) From Visual to Code Studio

Scenario: You prototype in Studio and need version-controlled scripts.

Action:
A. Export/generated code to store in Git and S3
B. Screenshot the UI
C. Rebuild manually
D. Use CSVs

Show Answer ▶

Correct Answer: A

Explanation:
Studio produces runnable code; store and CI/CD enable automation.

Key Concepts: Code generation, versioning

35) Avoiding Crawl/Job Conflicts Operations

Scenario: Crawler updates tables while job writes outputs; queries break.

Fix:
A. Sequence with Workflow dependencies
B. Run both anytime
C. Disable partitions
D. Use public bucket

Show Answer ▶

Correct Answer: A

Explanation:
Orchestrate order to avoid metadata inconsistency during writes.

Key Concepts: Orchestration, metadata consistency

36) Optimizing Skewed Joins Performance

Scenario: Large-to-small table join suffers from skew.

Approach:
A. Broadcast small table and repartition large
B. Full shuffle both
C. Use RDD map-only
D. Convert to JSON

Show Answer ▶

Correct Answer: A

Explanation:
Broadcast joins reduce shuffle; repartition distributes load.

Key Concepts: Broadcast join, repartition

37) Recovering Streaming State Streaming ETL

Scenario: Streaming ETL must recover after failure.

Use:
A. Structured streaming checkpoints in durable storage
B. No checkpoints
C. Local temp dir
D. Email alerts only

Show Answer ▶

Correct Answer: A

Explanation:
Checkpoints persist progress for fault-tolerant streaming.

Key Concepts: Checkpoints, recovery

38) Retrying Failed Pipeline Step Workflows

Scenario: The transform job failed; rerun only that step.

Do:
A. Rerun failed node in Workflow respecting dependencies
B. Rerun entire pipeline always
C. Delete outputs
D. Ignore failures

Show Answer ▶

Correct Answer: A

Explanation:
Workflows track state and allow targeted retries.

Key Concepts: Workflow state, retries

39) Speeding Up Athena Queries Analytics

Scenario: Athena scans too much data.

Best practices:
A. Partition by common predicates; use Parquet
B. Use CSV and no partitions
C. Read raw zone only
D. Disable compression

Show Answer ▶

Correct Answer: A

Explanation:
Partitioning and columnar formats reduce scanned bytes and cost.

Key Concepts: Partitioning, Parquet, scan reduction

40) Selecting Glue Version ETL Jobs

Scenario: Library requires a newer Spark/Python runtime.

Action:
A. Choose appropriate Glue version runtime supporting the library
B. Use any default
C. Switch to CSV
D. Disable libraries

Show Answer ▶

Correct Answer: A

Explanation:
Runtime versions control Spark/Python compatibility for jobs.

Key Concepts: Glue version, compatibility

41) Renaming Tables Safely Catalog

Scenario: You must rename a Catalog table with minimal disruption.

Approach:
A. Create new table and update consumers; deprecate old
B. Delete old without notice
C. Change S3 path only
D. Use random names

Show Answer ▶

Correct Answer: A

Explanation:
Controlled migration keeps dependent queries stable.

Key Concepts: Catalog change management

42) Choosing Compression Storage

Scenario: Balance performance and storage for Parquet outputs.

Prefer:
A. Snappy for Parquet
B. Gzip for Parquet
C. No compression
D. Zip archives

Show Answer ▶

Correct Answer: A

Explanation:
Snappy is splittable and optimized for Parquet analytics.

Key Concepts: Compression, Parquet

43) Prevent Unauthorized Table Access Security

Scenario: A user accessed a restricted table.

Fix:
A. Enforce Lake Formation grants and remove broad IAM permissions
B. Public bucket
C. CSV email
D. Remove all users

Show Answer ▶

Correct Answer: A

Explanation:
Lake Formation centralizes governed access; IAM should not bypass table-level controls.

Key Concepts: Governance, IAM, Lake Formation

44) Finalizing Job Successfully ETL Jobs

Scenario: Ensure job status reflects success for Workflow chaining.

Do:
A. Call job.commit() after successful writes
B. Skip commit
C. Throw generic exceptions
D. Loop forever

Show Answer ▶

Correct Answer: A

Explanation:
job.commit() finalizes the run and enables dependent triggers.

Key Concepts: Job lifecycle, commit

45) Syncing Large Partition Trees Operations

Scenario: Millions of partitions need periodic reconciliation.

Approach:
A. Use S3 Inventory + batch update to Catalog
B. Manual listing every run
C. Disable partitions
D. Store on EBS

Show Answer ▶

Correct Answer: A

Explanation:
Inventory scales discovery for very large catalogs.

Key Concepts: S3 Inventory, partition reconciliation

46) Private Access to S3 Networking

Scenario: Glue jobs must access S3 without traversing public internet.

Use:
A. S3 Gateway VPC endpoint
B. Public internet only
C. NAT always
D. Local disks

Show Answer ▶

Correct Answer: A

Explanation:
Gateway endpoints enable private S3 access from VPC subnets.

Key Concepts: VPC endpoints, private S3

47) Enforcing Referential Integrity Data Quality

Scenario: Fact dataset must reference valid dimension keys.

Implement:
A. Data Quality rules with join checks
B. Ignore mismatches
C. Drop dimensions
D. Email admins

Show Answer ▶

Correct Answer: A

Explanation:
Rules validate keys and produce metrics and violations.

Key Concepts: Referential checks, quality rules

48) Unpredictable Workload Capacity Scaling

Scenario: Workload traffic fluctuates significantly.

Choose:
A. On-demand job capacity mode
B. Fixed DPUs only
C. CSV output
D. No scaling

Show Answer ▶

Correct Answer: A

Explanation:
On-demand capacity adjusts resources to variable demand.

Key Concepts: On-demand capacity, autoscaling

49) Avoiding Duplicate Streaming Outputs Streaming ETL

Scenario: Failures cause duplicate writes in streaming ETL.

Approach:
A. Idempotent sinks and transactional writers where available
B. Blind append
C. Disable retries
D. Local files

Show Answer ▶

Correct Answer: A

Explanation:
Idempotent writes and transactional sinks mitigate duplicates.

Key Concepts: Idempotency, transactional sinks

50) Auditing Historical Table State Lakehouse

Scenario: You need to audit a table as-of last week.

Use:
A. Iceberg time travel queries
B. Rewrite all data
C. CSV snapshots
D. Manually filter dates

Show Answer ▶

Correct Answer: A

Explanation:
Iceberg maintains snapshots enabling consistent historical reads.

Key Concepts: Time travel, snapshots

Tip: Press Ctrl/Cmd + F to search in-page. Use the search box above to filter dynamically.

S D L

AWS Glue Scenario Based Questions