AWS Glue Scenario Based Questions

Admin, Student's Library
0

AWS Glue Scenario-Based Questions for AWS Data Engineer Certification

50 scenario-based Q&As covering Glue ETL Jobs, Data Catalog, Crawlers, Workflows, Studio, Streaming ETL, Data Quality, Schema Registry, and governance. Click "Show Answer" to reveal the correct answer and explanation. Use the search box to filter.

1) Nightly ETL Job Cost Optimization ETL Jobs
Scenario: A nightly Glue Spark job transforms 2 TB of S3 data and writes Parquet outputs. It runs for 2 hours at 1 AM daily. The team wants lowest cost without sacrificing reliability.
Choose the BEST approach:
A. Run a continuous Glue job with long timeout
B. Schedule a Glue Spark job with appropriate DPUs and job bookmarks
C. Use a Python shell job looping all night
D. Run EMR long-running cluster and submit steps
Correct Answer: B
Explanation:
- B uses serverless Glue Spark with pay-per-runtime DPUs, supports scheduling via triggers and bookmarks for incremental loads.
- A incurs idle costs and is unnecessary.
- C is not suitable for large-scale distributed ETL.
- D moves to EMR and 24/7 cost risk.
Key Concepts: Glue Spark jobs, DPUs, bookmarks, scheduled triggers
2) Centralized Metadata for Multiple Services Data Catalog
Scenario: You need a single metadata store for Athena, EMR, and Redshift Spectrum to query the same S3 datasets.
What should you use?
A. DynamoDB
B. Glue Data Catalog
C. S3 object tags
D. RDS PostgreSQL
Correct Answer: B
Explanation:
Glue Data Catalog is the centralized Hive-compatible metastore integrated with Athena, EMR, and Spectrum.
Key Concepts: Glue Data Catalog, cross-service metadata, Hive metastore
3) Automating Schema Discovery Crawlers
Scenario: New datasets arrive daily in S3 under date-partitioned folders. You want schemas and partitions to be maintained automatically.
Best solution?
A. Manual CREATE TABLE statements for each date
B. Glue Crawler scheduled to update the Catalog
C. Lambda to write Hive DDL
D. Upload CSV headers to S3 object tags
Correct Answer: B
Explanation:
Crawlers infer schema, partitions, and keep the Data Catalog synced with arriving data.
Key Concepts: Crawlers, partition inference, scheduled catalog updates
4) Orchestrating Multi-Step ETL Workflows
Scenario: You have a pipeline: crawl raw → job transform → job quality checks → publish. Steps have dependencies and must run in order.
Choose the service:
A. Glue Workflows
B. CloudWatch Events only
C. EC2 cron
D. S3 event notifications
Correct Answer: A
Explanation:
Glue Workflows model job/crawler dependencies, triggers, and track run state.
Key Concepts: Workflow DAG, triggers, orchestration
5) Visual Authoring Requirement Glue Studio
Scenario: Team wants drag-and-drop ETL authoring with monitoring and code generation.
Use:
A. Glue Studio
B. EMR Notebooks
C. Cloud9 IDE
D. Lambda console editor
Correct Answer: A
Explanation:
Glue Studio provides visual ETL nodes, job monitoring, and generated code.
Key Concepts: Visual ETL, authoring, monitoring
6) Incremental Loads with Bookmarks ETL Jobs
Scenario: Your job must process only newly arrived S3 objects each run.
What feature helps?
A. Triggers
B. Job bookmarks
C. DataBrew profiles
D. Schema Registry
Correct Answer: B
Explanation:
Bookmarks track processed objects and enable incremental reading.
Key Concepts: Bookmarks, incremental ETL, S3 inputs
7) Choosing DynamicFrame vs DataFrame ETL Jobs
Scenario: You need schema-flexible transforms with semi-structured JSON, then advanced Spark operations.
Best pattern?
A. Use DataFrame only
B. Use DynamicFrame transforms then convert to DataFrame
C. Use RDDs only
D. Use Python lists
Correct Answer: B
Explanation:
DynamicFrame handles schema drift; convert to DataFrame for Spark SQL and advanced ops.
Key Concepts: DynamicFrame, DataFrame, schema drift
8) Output Format for Analytics Best Practices
Scenario: You want efficient Athena queries and low storage cost for curated zone.
Choose:
A. JSON without partitioning
B. Parquet with partitioning
C. CSV with gzip
D. Avro without partitioning
Correct Answer: B
Explanation:
Columnar Parquet plus partitioning reduces scan size and cost and improves performance.
Key Concepts: Parquet, partitioning, columnar storage
9) Reading from Private RDS Connections
Scenario: You must read data from an RDS instance in a private subnet.
What is required?
A. Public RDS endpoint
B. Glue connection with VPC configuration and security groups
C. S3 VPC endpoint
D. NAT Gateway only
Correct Answer: B
Explanation:
Glue connections configure subnet, security groups, and VPC access to private RDS.
Key Concepts: Glue connections, VPC, private networking
10) Column-Level Access Control Governance
Scenario: Analysts should not see PII columns in the curated tables.
Use:
A. S3 bucket ACLs only
B. Glue job IAM policy
C. Lake Formation grants with column-level permissions
D. CloudWatch Logs
Correct Answer: C
Explanation:
Lake Formation enforces fine-grained permissions on Data Catalog tables and columns.
Key Concepts: Lake Formation, column permissions, governance
11) Real-Time Stream Processing Streaming ETL
Scenario: Ingest IoT telemetry from Kinesis Data Streams and aggregate in near real time.
Choose:
A. Glue Python Shell job
B. Glue Spark Streaming ETL job with structured streaming
C. EMR batch job
D. Redshift COPY
Correct Answer: B
Explanation:
Glue Spark streaming integrates with Kinesis/Kafka for windowed aggregations and low-latency ETL.
Key Concepts: Structured Streaming, Kinesis source, windowing
12) Managing Streaming Schemas Schema Registry
Scenario: Kafka producers evolve message schemas; consumers must remain compatible.
Use:
A. Glue Schema Registry with compatibility rules
B. Hardcode JSON versions in code
C. Save schema in S3 README files
D. Use DynamoDB only
Correct Answer: A
Explanation:
Glue Schema Registry enforces compatibility, versioning, and integrates with streaming pipelines.
Key Concepts: Schema Registry, compatibility, evolution
13) Ensuring Dataset Quality Data Quality
Scenario: You must validate incoming datasets for nulls, ranges, and referential integrity before loading.
Choose:
A. Glue Data Quality rules and profiles
B. CloudWatch alarms only
C. EMR Ganglia
D. S3 event notifications
Correct Answer: A
Explanation:
Glue Data Quality provides declarative rules, profiling, and metrics for validation.
Key Concepts: Data Quality, rules, profiling
14) Right-Sizing DPU Capacity Performance
Scenario: A job is slow; metrics show high shuffle and skew.
Best actions?
A. Increase DPUs and optimize partitioning
B. Decrease DPUs
C. Switch to Python shell
D. Disable Spark shuffle
Correct Answer: A
Explanation:
More executors and better partitioning alleviate skew and shuffle bottlenecks.
Key Concepts: DPU sizing, skew, partitioning
15) Designing Lake Zones Architecture
Scenario: You need a maintainable data lake structure with clear lifecycle.
Best approach?
A. Single S3 bucket for all data
B. Separate raw, staged, curated zones
C. Use EFS for lake storage
D. Store data in Lambda environment
Correct Answer: B
Explanation:
Zoning supports governance, lineage, and progressive refinement.
Key Concepts: Lake zones, lifecycle, governance
16) Supporting Updates and Deletes Formats
Scenario: You need ACID semantics for lake tables with upserts.
Choose:
A. Plain Parquet files only
B. Iceberg tables in Glue/Athena
C. CSV files
D. JSON objects
Correct Answer: B
Explanation:
Iceberg adds table abstraction and ACID operations for lakehouse patterns.
Key Concepts: Iceberg, ACID, lakehouse
17) Parallel Feature Engineering Ray
Scenario: You need Python-native parallelism for ML feature generation, not full Spark.
Use:
A. Glue for Ray
B. Python shell
C. EMR Hive
D. Lambda
Correct Answer: A
Explanation:
Glue for Ray provides distributed Python compute for ML workloads.
Key Concepts: Ray, Python parallelism, ML
18) Custom File Format Discovery Crawlers
Scenario: Files are custom delimited; default inference fails.
Fix:
A. Write manual DDL
B. Add custom classifier to the Crawler
C. Rename files
D. Use zip archives
Correct Answer: B
Explanation:
Custom classifiers parse nonstandard formats for schema inference.
Key Concepts: Crawler classifiers, schema inference
19) Handling Transient Failures Reliability
Scenario: Job occasionally fails due to transient network issues.
Best action?
A. Disable retries
B. Configure job retry and backoff; idempotent writes to target
C. Switch to EMR
D. Ignore errors
Correct Answer: B
Explanation:
Glue supports retry; ensure idempotency to avoid duplicate outputs.
Key Concepts: Retry policy, idempotency
20) Reprocessing After Logic Fix Bookmarks
Scenario: You changed transformation logic and must reprocess past data.
How?
A. Reset job bookmarks and rerun for the range
B. Delete outputs only
C. Update IAM
D. Wait for new data
Correct Answer: A
Explanation:
Bookmark reset enables controlled reprocessing of historical inputs.
Key Concepts: Bookmark reset, reprocessing
21) Tracking Schema Changes Data Catalog
Scenario: Columns are added over time; downstream queries must adapt.
Use:
A. Glue Data Catalog schema versioning
B. Rename S3 prefixes
C. CSV always
D. Hardcode column order
Correct Answer: A
Explanation:
Catalog versions document schema evolution for governed analytics.
Key Concepts: Catalog versions, evolution
22) Troubleshooting Slow Jobs Monitoring
Scenario: You need runtime metrics and logs to analyze performance.
Use:
A. Glue Studio job run details and CloudWatch Logs
B. S3 object tags
C. IAM policy simulator
D. SNS email only
Correct Answer: A
Explanation:
Glue integrates with CloudWatch Logs and Studio UI for metrics and traces.
Key Concepts: CloudWatch Logs, metrics, Studio
23) Scheduling Pipelines Triggers
Scenario: Pipeline must run hourly and on-demand on arrival events.
Implement:
A. Glue Triggers (schedule and event-based)
B. EC2 cron only
C. Manual clicks
D. Redshift schedule
Correct Answer: A
Explanation:
Glue Triggers support time-based and event-based execution tied to workflows.
Key Concepts: Triggers, schedule, events
24) Missing Partitions in Athena Catalog
Scenario: New S3 partitions aren't visible for querying.
Fix:
A. Run Crawler or MSCK REPAIR TABLE
B. Re-upload files
C. Change IAM role
D. Use CSV instead
Correct Answer: A
Explanation:
Crawlers or MSCK commands add new partition metadata to the Catalog.
Key Concepts: Partitions, Catalog sync, MSCK
25) Handling S3 Eventual Consistency Reliability
Scenario: Immediately after writes, queries sometimes miss files.
Approach:
A. Add appropriate delays and use manifest writes
B. Assume immediate consistency
C. Disable partitioning
D. Use local disks
Correct Answer: A
Explanation:
Account for eventual consistency; manifest/committer writes aid atomicity.
Key Concepts: S3 consistency, manifests
26) Reducing Read Volume Performance
Scenario: Your job reads excessive data from S3.
Fix:
A. Use predicate pushdown and partition pruning
B. Read entire dataset then filter
C. Convert to CSV
D. Disable compression
Correct Answer: A
Explanation:
Partition keys and pushdown minimize IO and runtime.
Key Concepts: Pushdown, pruning, IO reduction
27) Cross-Account Access to Catalog Tables Governance
Scenario: Another account needs read-only access to curated tables.
Approach:
A. Share via Lake Formation resource sharing
B. Copy data to their S3
C. Email CSVs
D. Public S3 bucket
Correct Answer: A
Explanation:
Lake Formation supports cross-account grants for governed sharing.
Key Concepts: Cross-account, Lake Formation sharing
28) No-Code Data Preparation DataBrew
Scenario: Business users need simple transformations and profiling without code.
Use:
A. AWS Glue DataBrew
B. Glue Spark jobs
C. EMR cluster
D. Lambda
Correct Answer: A
Explanation:
DataBrew provides visual, no-code preparation integrated with Glue.
Key Concepts: DataBrew, visual prep
29) Preventing Partial Outputs Reliability
Scenario: Incomplete files appear when jobs fail mid-write.
Best solution:
A. Use S3 committer/manifest-based writes
B. Write uncommitted files directly
C. Disable retries
D. Store on local disk only
Correct Answer: A
Explanation:
Committer pattern ensures atomic publish of successful outputs.
Key Concepts: Output committers, atomic writes
30) Parameterizing Jobs per Environment ETL Jobs
Scenario: Same script should run in dev/stage/prod with different paths.
Implement:
A. Job parameters and environment variables
B. Copy-paste scripts per env
C. Hardcode values
D. Rename buckets
Correct Answer: A
Explanation:
Parameterization enables environment-specific configuration without code changes.
Key Concepts: Parameters, config, environments
31) Encrypting Data at Rest Security
Scenario: Regulatory requirements mandate encryption of outputs in S3.
Do:
A. Use S3 bucket default encryption with KMS and Glue job encryption
B. Rely on object ACLs
C. Use plain text files
D. Store in public bucket
Correct Answer: A
Explanation:
KMS-backed default encryption plus job-level encryption meet at-rest controls.
Key Concepts: KMS, encryption, compliance
32) Running Multiple Jobs Concurrently Capacity
Scenario: Several workflows need to run at the same time.
Consider:
A. Account DPU limits and concurrent run quotas
B. Ignore limits
C. Single-thread everything
D. Use EC2 instead
Correct Answer: A
Explanation:
Plan concurrency against DPU and job run quotas; request limit increases as needed.
Key Concepts: DPU quotas, concurrency
33) Capturing Bad Records Quality
Scenario: Some records fail parsing; you must isolate them.
Pattern:
A. Write invalid rows to a dead-letter S3 prefix
B. Drop silently
C. Stop the pipeline
D. Email CSV
Correct Answer: A
Explanation:
Dead-letter handling preserves observability and enables remediation.
Key Concepts: Dead-letter, observability
34) From Visual to Code Studio
Scenario: You prototype in Studio and need version-controlled scripts.
Action:
A. Export/generated code to store in Git and S3
B. Screenshot the UI
C. Rebuild manually
D. Use CSVs
Correct Answer: A
Explanation:
Studio produces runnable code; store and CI/CD enable automation.
Key Concepts: Code generation, versioning
35) Avoiding Crawl/Job Conflicts Operations
Scenario: Crawler updates tables while job writes outputs; queries break.
Fix:
A. Sequence with Workflow dependencies
B. Run both anytime
C. Disable partitions
D. Use public bucket
Correct Answer: A
Explanation:
Orchestrate order to avoid metadata inconsistency during writes.
Key Concepts: Orchestration, metadata consistency
36) Optimizing Skewed Joins Performance
Scenario: Large-to-small table join suffers from skew.
Approach:
A. Broadcast small table and repartition large
B. Full shuffle both
C. Use RDD map-only
D. Convert to JSON
Correct Answer: A
Explanation:
Broadcast joins reduce shuffle; repartition distributes load.
Key Concepts: Broadcast join, repartition
37) Recovering Streaming State Streaming ETL
Scenario: Streaming ETL must recover after failure.
Use:
A. Structured streaming checkpoints in durable storage
B. No checkpoints
C. Local temp dir
D. Email alerts only
Correct Answer: A
Explanation:
Checkpoints persist progress for fault-tolerant streaming.
Key Concepts: Checkpoints, recovery
38) Retrying Failed Pipeline Step Workflows
Scenario: The transform job failed; rerun only that step.
Do:
A. Rerun failed node in Workflow respecting dependencies
B. Rerun entire pipeline always
C. Delete outputs
D. Ignore failures
Correct Answer: A
Explanation:
Workflows track state and allow targeted retries.
Key Concepts: Workflow state, retries
39) Speeding Up Athena Queries Analytics
Scenario: Athena scans too much data.
Best practices:
A. Partition by common predicates; use Parquet
B. Use CSV and no partitions
C. Read raw zone only
D. Disable compression
Correct Answer: A
Explanation:
Partitioning and columnar formats reduce scanned bytes and cost.
Key Concepts: Partitioning, Parquet, scan reduction
40) Selecting Glue Version ETL Jobs
Scenario: Library requires a newer Spark/Python runtime.
Action:
A. Choose appropriate Glue version runtime supporting the library
B. Use any default
C. Switch to CSV
D. Disable libraries
Correct Answer: A
Explanation:
Runtime versions control Spark/Python compatibility for jobs.
Key Concepts: Glue version, compatibility
41) Renaming Tables Safely Catalog
Scenario: You must rename a Catalog table with minimal disruption.
Approach:
A. Create new table and update consumers; deprecate old
B. Delete old without notice
C. Change S3 path only
D. Use random names
Correct Answer: A
Explanation:
Controlled migration keeps dependent queries stable.
Key Concepts: Catalog change management
42) Choosing Compression Storage
Scenario: Balance performance and storage for Parquet outputs.
Prefer:
A. Snappy for Parquet
B. Gzip for Parquet
C. No compression
D. Zip archives
Correct Answer: A
Explanation:
Snappy is splittable and optimized for Parquet analytics.
Key Concepts: Compression, Parquet
43) Prevent Unauthorized Table Access Security
Scenario: A user accessed a restricted table.
Fix:
A. Enforce Lake Formation grants and remove broad IAM permissions
B. Public bucket
C. CSV email
D. Remove all users
Correct Answer: A
Explanation:
Lake Formation centralizes governed access; IAM should not bypass table-level controls.
Key Concepts: Governance, IAM, Lake Formation
44) Finalizing Job Successfully ETL Jobs
Scenario: Ensure job status reflects success for Workflow chaining.
Do:
A. Call job.commit() after successful writes
B. Skip commit
C. Throw generic exceptions
D. Loop forever
Correct Answer: A
Explanation:
job.commit() finalizes the run and enables dependent triggers.
Key Concepts: Job lifecycle, commit
45) Syncing Large Partition Trees Operations
Scenario: Millions of partitions need periodic reconciliation.
Approach:
A. Use S3 Inventory + batch update to Catalog
B. Manual listing every run
C. Disable partitions
D. Store on EBS
Correct Answer: A
Explanation:
Inventory scales discovery for very large catalogs.
Key Concepts: S3 Inventory, partition reconciliation
46) Private Access to S3 Networking
Scenario: Glue jobs must access S3 without traversing public internet.
Use:
A. S3 Gateway VPC endpoint
B. Public internet only
C. NAT always
D. Local disks
Correct Answer: A
Explanation:
Gateway endpoints enable private S3 access from VPC subnets.
Key Concepts: VPC endpoints, private S3
47) Enforcing Referential Integrity Data Quality
Scenario: Fact dataset must reference valid dimension keys.
Implement:
A. Data Quality rules with join checks
B. Ignore mismatches
C. Drop dimensions
D. Email admins
Correct Answer: A
Explanation:
Rules validate keys and produce metrics and violations.
Key Concepts: Referential checks, quality rules
48) Unpredictable Workload Capacity Scaling
Scenario: Workload traffic fluctuates significantly.
Choose:
A. On-demand job capacity mode
B. Fixed DPUs only
C. CSV output
D. No scaling
Correct Answer: A
Explanation:
On-demand capacity adjusts resources to variable demand.
Key Concepts: On-demand capacity, autoscaling
49) Avoiding Duplicate Streaming Outputs Streaming ETL
Scenario: Failures cause duplicate writes in streaming ETL.
Approach:
A. Idempotent sinks and transactional writers where available
B. Blind append
C. Disable retries
D. Local files
Correct Answer: A
Explanation:
Idempotent writes and transactional sinks mitigate duplicates.
Key Concepts: Idempotency, transactional sinks
50) Auditing Historical Table State Lakehouse
Scenario: You need to audit a table as-of last week.
Use:
A. Iceberg time travel queries
B. Rewrite all data
C. CSV snapshots
D. Manually filter dates
Correct Answer: A
Explanation:
Iceberg maintains snapshots enabling consistent historical reads.
Key Concepts: Time travel, snapshots
Tip: Press Ctrl/Cmd + F to search in-page. Use the search box above to filter dynamically.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.
Post a Comment (0)
Our website uses cookies to enhance your experience. Learn More
Accept !