Apache Spark Interview Questions (Basic ➜ Intermediate)
30 frequently asked Q&As. Click a question to reveal the answer. Use the search box to filter.
1) What is Apache Spark?
Basic
▶
Apache Spark is an open-source, distributed computing system for large-scale data processing and analytics. It offers high-level APIs (Scala, Java, Python, R) and libraries for SQL, machine learning, graph processing, and streaming.
2) What are the main features of Spark?
Basic
▶
- In-memory computation for speed
- Fault tolerance
- APIs in multiple languages
- Rich libraries (SQL, MLlib, GraphX, Streaming)
- Integrates with Hadoop and cloud storage
3) Explain RDD in Spark.
Basic
▶
RDD (Resilient Distributed Dataset) is Spark’s fundamental abstraction: an immutable, partitioned collection with lineage-based fault tolerance. It supports transformations (lazy) and actions (eager).
4) What are transformations and actions in RDD?
Basic
▶
Transformations (e.g.,
map, filter) build a new RDD lazily. Actions (e.g., count, collect, saveAsTextFile) trigger DAG execution and return a result or write output.
5) What is lazy evaluation in Spark?
Basic
▶
Spark delays computation until an action is invoked. It first builds a Directed Acyclic Graph (DAG) of transformations, then optimizes and executes them.
6) What are Spark DataFrames?
Basic
▶
A DataFrame is a distributed table with named columns. It enables SQL-like operations and benefits from the Catalyst optimizer for efficient execution.
7) Difference between RDD and DataFrame?
Basic
▶
RDD: low-level, unstructured, maximal control. DataFrame: high-level, schema-aware, optimized, supports SQL; generally preferred for analytics.
8) What is Dataset in Spark?
Basic
▶
Dataset is a strongly typed, distributed collection combining RDD’s type-safety with DataFrame’s optimizations. Available in Scala/Java (not in PySpark).
9) What is the Catalyst Optimizer?
Intermediate
▶
Catalyst is Spark SQL’s query optimizer. It applies rule-based and cost-based optimizations to produce efficient physical plans from logical queries.
10) What is Tungsten in Spark?
Intermediate
▶
Tungsten is a set of execution engine optimizations: efficient memory management (incl. off-heap), cache-friendly formats, and runtime code generation for tight CPU loops.
11) Explain narrow vs wide transformations.
Intermediate
▶
Narrow: each child partition depends on a single parent partition (e.g.,
map, filter). Wide: requires shuffle; child partitions aggregate from many parents (e.g., groupByKey, repartition).
12) What is a job, stage, and task?
Basic
▶
Job: triggered by an action. Stage: set of tasks separated by shuffle boundaries. Task: unit of work on a partition.
13) What is SparkContext?
Basic
▶
Entry point for the RDD API. It connects to the cluster manager, allocates resources, and coordinates job execution.
14) What is SparkSession?
Basic
▶
Unified entry point for DataFrame/Dataset and SQL (since Spark 2.0). It wraps SparkContext and replaces SQLContext/HiveContext usage directly.
15) What is DAG in Spark?
Basic
▶
A Directed Acyclic Graph representing the logical execution plan derived from transformations, which the scheduler optimizes into stages and tasks.
16) What is shuffling in Spark?
Intermediate
▶
Shuffle redistributes data across partitions/nodes (e.g., for
groupBy or joins). It is expensive due to serialization, network, and disk I/O.
17) What is caching in Spark?
Intermediate
▶
Caching keeps frequently reused DataFrames/RDDs in memory for faster access. Use
cache() (MEMORY_ONLY) or persist(StorageLevel) for custom levels.
18) What is checkpointing in Spark?
Intermediate
▶
Checkpointing saves RDD/DataFrame state to reliable storage (e.g., HDFS/S3), truncating lineage to reduce recomputation on failures.
19) Difference between
cache() and persist()?
Basic
▶
cache() is shorthand for MEMORY_ONLY. persist() lets you choose storage levels (e.g., MEMORY_AND_DISK, serialized variants) to suit memory and recomputation trade-offs.
20) What is a broadcast variable?
Intermediate
▶
A read‑only variable distributed to all executors, avoiding repeated task-side shipping. Ideal for small lookup tables used in joins or enrichments.
21) What is an accumulator?
Intermediate
▶
A write-only (from executors) variable for aggregation (e.g., counters, sums). Only the driver can read the final value; updates are associative/commutative.
22) Spark vs. Hadoop MapReduce?
Basic
▶
Spark: in-memory, faster, concise APIs, multi-workload (SQL/ML/Graph/Streaming). MapReduce: disk-heavy, slower, verbose APIs; good for very large batch I/O.
23) What cluster managers does Spark support?
Basic
▶
Standalone, Hadoop YARN, Apache Mesos, and Kubernetes. They allocate resources (CPU/memory) and manage executors for Spark applications.
24) How does Spark achieve fault tolerance?
Intermediate
▶
Via RDD lineage: lost partitions are recomputed from their deterministic transformation history. Checkpointing can further reduce recomputation cost.
25) What is Spark Streaming?
Intermediate
▶
A micro-batch streaming library using DStreams to process data from sources like Kafka, Flume, and sockets in near real-time.
26) Spark Streaming vs Structured Streaming?
Intermediate
▶
Spark Streaming: DStreams, micro-batch only. Structured Streaming: DataFrame/Dataset API, event-time semantics, watermarking, exactly-once sinks, and advanced optimizations.
27) What is MLlib?
Basic
▶
Spark’s machine learning library with algorithms for classification, regression, clustering, recommendation, plus feature engineering and pipeline utilities.
28) What is GraphX?
Basic
▶
A graph processing API on Spark for graph-parallel computations such as PageRank, connected components, and shortest paths.
29) What are partitions in Spark?
Basic
▶
Partitions are units of parallelism—subsets of data processed independently by tasks. Good partitioning improves cluster utilization and reduces skew.
30) How do you optimize Spark jobs?
Intermediate
▶
- Prefer DataFrames/Datasets for Catalyst & Tungsten benefits
- Use
reduceByKey/mapPartitionsinstead of widegroupByKey - Broadcast small dimension tables for joins
- Persist strategic intermediates; unpersist when done
- Optimize partitions (avoid tiny files; coalesce/repartition as needed)
- Prune columns/filter early; push down predicates; cache only hot data
Tip: Press Ctrl/Cmd + F to search in-page. Use the search box above to filter dynamically.

