Spark-basic-interview-questions

Apache Spark Interview Questions (Basic ➜ Intermediate)

30 frequently asked Q&As. Click a question to reveal the answer. Use the search box to filter.

1) What is Apache Spark? Basic ▶

Apache Spark is an open-source, distributed computing system for large-scale data processing and analytics. It offers high-level APIs (Scala, Java, Python, R) and libraries for SQL, machine learning, graph processing, and streaming.

2) What are the main features of Spark? Basic ▶

In-memory computation for speed
Fault tolerance
APIs in multiple languages
Rich libraries (SQL, MLlib, GraphX, Streaming)
Integrates with Hadoop and cloud storage

3) Explain RDD in Spark. Basic ▶

RDD (Resilient Distributed Dataset) is Spark’s fundamental abstraction: an immutable, partitioned collection with lineage-based fault tolerance. It supports transformations (lazy) and actions (eager).

4) What are transformations and actions in RDD? Basic ▶

Transformations (e.g., map, filter) build a new RDD lazily. Actions (e.g., count, collect, saveAsTextFile) trigger DAG execution and return a result or write output.

5) What is lazy evaluation in Spark? Basic ▶

Spark delays computation until an action is invoked. It first builds a Directed Acyclic Graph (DAG) of transformations, then optimizes and executes them.

6) What are Spark DataFrames? Basic ▶

A DataFrame is a distributed table with named columns. It enables SQL-like operations and benefits from the Catalyst optimizer for efficient execution.

7) Difference between RDD and DataFrame? Basic ▶

RDD: low-level, unstructured, maximal control. DataFrame: high-level, schema-aware, optimized, supports SQL; generally preferred for analytics.

8) What is Dataset in Spark? Basic ▶

Dataset is a strongly typed, distributed collection combining RDD’s type-safety with DataFrame’s optimizations. Available in Scala/Java (not in PySpark).

9) What is the Catalyst Optimizer? Intermediate ▶

Catalyst is Spark SQL’s query optimizer. It applies rule-based and cost-based optimizations to produce efficient physical plans from logical queries.

10) What is Tungsten in Spark? Intermediate ▶

Tungsten is a set of execution engine optimizations: efficient memory management (incl. off-heap), cache-friendly formats, and runtime code generation for tight CPU loops.

11) Explain narrow vs wide transformations. Intermediate ▶

Narrow: each child partition depends on a single parent partition (e.g., map, filter). Wide: requires shuffle; child partitions aggregate from many parents (e.g., groupByKey, repartition).

12) What is a job, stage, and task? Basic ▶

Job: triggered by an action. Stage: set of tasks separated by shuffle boundaries. Task: unit of work on a partition.

13) What is SparkContext? Basic ▶

Entry point for the RDD API. It connects to the cluster manager, allocates resources, and coordinates job execution.

14) What is SparkSession? Basic ▶

Unified entry point for DataFrame/Dataset and SQL (since Spark 2.0). It wraps SparkContext and replaces SQLContext/HiveContext usage directly.

15) What is DAG in Spark? Basic ▶

A Directed Acyclic Graph representing the logical execution plan derived from transformations, which the scheduler optimizes into stages and tasks.

16) What is shuffling in Spark? Intermediate ▶

Shuffle redistributes data across partitions/nodes (e.g., for groupBy or joins). It is expensive due to serialization, network, and disk I/O.

17) What is caching in Spark? Intermediate ▶

Caching keeps frequently reused DataFrames/RDDs in memory for faster access. Use cache() (MEMORY_ONLY) or persist(StorageLevel) for custom levels.

18) What is checkpointing in Spark? Intermediate ▶

Checkpointing saves RDD/DataFrame state to reliable storage (e.g., HDFS/S3), truncating lineage to reduce recomputation on failures.

19) Difference between cache() and persist()? Basic ▶

cache() is shorthand for MEMORY_ONLY. persist() lets you choose storage levels (e.g., MEMORY_AND_DISK, serialized variants) to suit memory and recomputation trade-offs.

20) What is a broadcast variable? Intermediate ▶

A read‑only variable distributed to all executors, avoiding repeated task-side shipping. Ideal for small lookup tables used in joins or enrichments.

21) What is an accumulator? Intermediate ▶

A write-only (from executors) variable for aggregation (e.g., counters, sums). Only the driver can read the final value; updates are associative/commutative.

22) Spark vs. Hadoop MapReduce? Basic ▶

Spark: in-memory, faster, concise APIs, multi-workload (SQL/ML/Graph/Streaming). MapReduce: disk-heavy, slower, verbose APIs; good for very large batch I/O.

23) What cluster managers does Spark support? Basic ▶

Standalone, Hadoop YARN, Apache Mesos, and Kubernetes. They allocate resources (CPU/memory) and manage executors for Spark applications.

24) How does Spark achieve fault tolerance? Intermediate ▶

Via RDD lineage: lost partitions are recomputed from their deterministic transformation history. Checkpointing can further reduce recomputation cost.

25) What is Spark Streaming? Intermediate ▶

A micro-batch streaming library using DStreams to process data from sources like Kafka, Flume, and sockets in near real-time.

26) Spark Streaming vs Structured Streaming? Intermediate ▶

Spark Streaming: DStreams, micro-batch only. Structured Streaming: DataFrame/Dataset API, event-time semantics, watermarking, exactly-once sinks, and advanced optimizations.

27) What is MLlib? Basic ▶

Spark’s machine learning library with algorithms for classification, regression, clustering, recommendation, plus feature engineering and pipeline utilities.

28) What is GraphX? Basic ▶

A graph processing API on Spark for graph-parallel computations such as PageRank, connected components, and shortest paths.

29) What are partitions in Spark? Basic ▶

Partitions are units of parallelism—subsets of data processed independently by tasks. Good partitioning improves cluster utilization and reduces skew.

30) How do you optimize Spark jobs? Intermediate ▶

Prefer DataFrames/Datasets for Catalyst & Tungsten benefits
Use reduceByKey/mapPartitions instead of wide groupByKey
Broadcast small dimension tables for joins
Persist strategic intermediates; unpersist when done
Optimize partitions (avoid tiny files; coalesce/repartition as needed)
Prune columns/filter early; push down predicates; cache only hot data

Tip: Press Ctrl/Cmd + F to search in-page. Use the search box above to filter dynamically.

S D L

Spark-basic-interview-questions

Apache Spark Interview Questions (Basic ➜ Intermediate)

Post a Comment

Popular Libraries

Popular Posts

भारतीय वायु सेना ग्रुप 'C' सिविलियन भर्ती 2025

SQL- Queries-Practise

संघर्ष और सपनों की कहानी: रिंकू सिंह और प्रिया सरोज की प्रेरणादायक जोड़ी

फाइलेरिया: कारण, लक्षण, इलाज और बचाव की जानकारी

Categories

Social Plugin

Author Profile

Ads

ADDRESS:

Company

Footer Copyright

Contact form

S D L

Spark-basic-interview-questions

Apache Spark Interview Questions (Basic ➜ Intermediate)

You may like these posts

Post a Comment

Footer Copyright

Contact form