Spark-basic-interview-questions

Admin, Student's Library
0

Apache Spark Interview Questions (Basic ➜ Intermediate)

30 frequently asked Q&As. Click a question to reveal the answer. Use the search box to filter.

Apache Spark is an open-source, distributed computing system for large-scale data processing and analytics. It offers high-level APIs (Scala, Java, Python, R) and libraries for SQL, machine learning, graph processing, and streaming.
  • In-memory computation for speed
  • Fault tolerance
  • APIs in multiple languages
  • Rich libraries (SQL, MLlib, GraphX, Streaming)
  • Integrates with Hadoop and cloud storage
RDD (Resilient Distributed Dataset) is Spark’s fundamental abstraction: an immutable, partitioned collection with lineage-based fault tolerance. It supports transformations (lazy) and actions (eager).
Transformations (e.g., map, filter) build a new RDD lazily. Actions (e.g., count, collect, saveAsTextFile) trigger DAG execution and return a result or write output.
Spark delays computation until an action is invoked. It first builds a Directed Acyclic Graph (DAG) of transformations, then optimizes and executes them.
A DataFrame is a distributed table with named columns. It enables SQL-like operations and benefits from the Catalyst optimizer for efficient execution.
RDD: low-level, unstructured, maximal control. DataFrame: high-level, schema-aware, optimized, supports SQL; generally preferred for analytics.
Dataset is a strongly typed, distributed collection combining RDD’s type-safety with DataFrame’s optimizations. Available in Scala/Java (not in PySpark).
Catalyst is Spark SQL’s query optimizer. It applies rule-based and cost-based optimizations to produce efficient physical plans from logical queries.
Tungsten is a set of execution engine optimizations: efficient memory management (incl. off-heap), cache-friendly formats, and runtime code generation for tight CPU loops.
Narrow: each child partition depends on a single parent partition (e.g., map, filter). Wide: requires shuffle; child partitions aggregate from many parents (e.g., groupByKey, repartition).
Job: triggered by an action. Stage: set of tasks separated by shuffle boundaries. Task: unit of work on a partition.
Entry point for the RDD API. It connects to the cluster manager, allocates resources, and coordinates job execution.
Unified entry point for DataFrame/Dataset and SQL (since Spark 2.0). It wraps SparkContext and replaces SQLContext/HiveContext usage directly.
A Directed Acyclic Graph representing the logical execution plan derived from transformations, which the scheduler optimizes into stages and tasks.
Shuffle redistributes data across partitions/nodes (e.g., for groupBy or joins). It is expensive due to serialization, network, and disk I/O.
Caching keeps frequently reused DataFrames/RDDs in memory for faster access. Use cache() (MEMORY_ONLY) or persist(StorageLevel) for custom levels.
Checkpointing saves RDD/DataFrame state to reliable storage (e.g., HDFS/S3), truncating lineage to reduce recomputation on failures.
cache() is shorthand for MEMORY_ONLY. persist() lets you choose storage levels (e.g., MEMORY_AND_DISK, serialized variants) to suit memory and recomputation trade-offs.
A read‑only variable distributed to all executors, avoiding repeated task-side shipping. Ideal for small lookup tables used in joins or enrichments.
A write-only (from executors) variable for aggregation (e.g., counters, sums). Only the driver can read the final value; updates are associative/commutative.
Spark: in-memory, faster, concise APIs, multi-workload (SQL/ML/Graph/Streaming). MapReduce: disk-heavy, slower, verbose APIs; good for very large batch I/O.
Standalone, Hadoop YARN, Apache Mesos, and Kubernetes. They allocate resources (CPU/memory) and manage executors for Spark applications.
Via RDD lineage: lost partitions are recomputed from their deterministic transformation history. Checkpointing can further reduce recomputation cost.
A micro-batch streaming library using DStreams to process data from sources like Kafka, Flume, and sockets in near real-time.
Spark Streaming: DStreams, micro-batch only. Structured Streaming: DataFrame/Dataset API, event-time semantics, watermarking, exactly-once sinks, and advanced optimizations.
Spark’s machine learning library with algorithms for classification, regression, clustering, recommendation, plus feature engineering and pipeline utilities.
A graph processing API on Spark for graph-parallel computations such as PageRank, connected components, and shortest paths.
Partitions are units of parallelism—subsets of data processed independently by tasks. Good partitioning improves cluster utilization and reduces skew.
  • Prefer DataFrames/Datasets for Catalyst & Tungsten benefits
  • Use reduceByKey/mapPartitions instead of wide groupByKey
  • Broadcast small dimension tables for joins
  • Persist strategic intermediates; unpersist when done
  • Optimize partitions (avoid tiny files; coalesce/repartition as needed)
  • Prune columns/filter early; push down predicates; cache only hot data
Tip: Press Ctrl/Cmd + F to search in-page. Use the search box above to filter dynamically.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.
Post a Comment (0)
Our website uses cookies to enhance your experience. Learn More
Accept !