🔥 Apache Spark Configurations Explained
A complete guide to understanding key Spark configuration parameters
| Configuration | Meaning | Example |
|---|---|---|
spark.app.name | Name of your Spark application. | spark.app.name=MySparkJob |
spark.master | Defines the cluster manager (local, yarn, etc.). | spark.master=local[*] |
spark.submit.deployMode | Deploy mode: client or cluster. | spark.submit.deployMode=cluster |
spark.home | Location of Spark installation. | spark.home=/usr/local/spark |
| Configuration | Meaning | Example |
|---|---|---|
spark.executor.memory | Memory per executor process. | 4g |
spark.driver.memory | Memory for driver process. | 2g |
spark.executor.cores | Number of CPU cores per executor. | 4 |
spark.memory.fraction | Fraction of JVM heap used for execution/storage. | 0.6 |
| Configuration | Meaning | Example |
|---|---|---|
spark.default.parallelism | Default number of partitions. | 8 |
spark.sql.shuffle.partitions | Partitions used during shuffle. | 200 |
spark.shuffle.compress | Compress shuffle output files. | true |
| Configuration | Meaning | Example |
|---|---|---|
spark.serializer | Defines serializer (Kryo or Java). | org.apache.spark.serializer.KryoSerializer |
spark.io.compression.codec | Compression codec used. | snappy |
spark.rdd.compress | Compress serialized RDD partitions. | true |
| Configuration | Meaning | Example |
|---|---|---|
spark.sql.shuffle.partitions | Number of shuffle partitions for joins/aggregations. | 200 |
spark.sql.autoBroadcastJoinThreshold | Broadcast join threshold in bytes. | 10485760 |
spark.sql.warehouse.dir | Location of Spark SQL warehouse. | /user/hive/warehouse |
| Configuration | Meaning | Example |
|---|---|---|
spark.dynamicAllocation.enabled | Enable dynamic allocation. | true |
spark.dynamicAllocation.minExecutors | Minimum executors allowed. | 2 |
spark.dynamicAllocation.maxExecutors | Maximum executors allowed. | 10 |
spark.dynamicAllocation.initialExecutors | Executors at application start. | 4 |
In PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MySparkApp") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config("spark.sql.shuffle.partitions", "100") \
.getOrCreate()
Using spark-submit:
spark-submit \
--class com.example.MyJob \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.memory=4g \
--conf spark.executor.cores=4 \
myjob.jar
💎 Advanced Spark Tuning & Best Practices (AWS + Kubernetes)
Optimize your Spark jobs for performance, reliability, and resource efficiency on Kubernetes (EKS) infrastructure.
1. Memory & CPU Configuration
Memory and CPU settings are the most common reasons Spark jobs fail due to OOM errors or excessive GC pauses.
spark.executor.memory: Allocate enough memory per executor based on dataset size. Increase for OOM; decrease if underutilized.spark.driver.memory: Increase if driver OOMs occur during large datasets.spark.executor.cores: 2–5 cores per executor for AWS pods; too many cores can cause GC delays.spark.task.cpus: Usually 1; increase for CPU-intensive tasks.
Tip: Match executor memory & CPU requests to Kubernetes pod limits.
2. Parallelism & Shuffle Tuning
spark.default.parallelism: 2–3× total executor cores. Too low → slow; too high → overhead.spark.sql.shuffle.partitions: Adjust based on shuffle size.- Use
spark.reducer.maxSizeInFlightandspark.shuffle.compressfor shuffle efficiency.
Fix for shuffle errors: Increase shuffle partitions or adjust executor memory.
3. Serialization & Compression
spark.serializer: Use KryoSerializer for faster serialization.spark.io.compression.codec: Use snappy for low CPU overhead.- Compress RDDs (
spark.rdd.compress=true) to save memory.
Fix for serialization errors: Register custom classes in Kryo to avoid ClassNotFoundException.
4. SQL / DataFrame Optimizations
spark.sql.autoBroadcastJoinThreshold: Reduce if broadcast joins fail.spark.sql.execution.arrow.enabled=truefor Pandas conversions (Python).- Repartition large datasets before heavy joins.
5. Dynamic Allocation
- Enable
spark.dynamicAllocation.enabled=truefor auto-scaling executors. - Set
minExecutors&maxExecutorscarefully to avoid contention. - Works best with Kubernetes HPA & proper pod limits.
Fix for resource issues: Too few executors → slow jobs; too many → pod contention.
6. Kubernetes + AWS (EKS) Recommendations
- Dedicated namespaces for Spark jobs
- Set resource requests/limits in pods carefully
- Enable dynamic allocation + autoscaling
- Use ephemeral storage for shuffle
- Monitor Spark UI + K8s dashboard + CloudWatch
7. Sample SparkApplication YAML (Scala)
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-job-sample
namespace: spark
spec:
type: Scala
mode: cluster
image: spark:3.5.0
imagePullPolicy: IfNotPresent
mainClass: com.example.SparkJob
mainApplicationFile: local:///opt/spark/jars/my-spark-job_2.12-1.0.jar
sparkVersion: 3.5.0
restartPolicy:
type: OnFailure
onFailureRetries: 3
retryInterval: 10
onSubmissionFailureRetries: 5
submissionFailureRetryInterval: 20
driver:
cores: 2
memory: 4g
labels:
version: 3.5.0
serviceAccount: spark
executor:
cores: 4
instances: 4
memory: 8g
labels:
version: 3.5.0
* Please Don't Spam Here. All the Comments are Reviewed by Admin.

