🔥 Apache Spark Configurations Explained
A complete guide to understanding key Spark configuration parameters
| Configuration | Meaning | Example |
|---|---|---|
spark.app.name | Name of your Spark application. | spark.app.name=MySparkJob |
spark.master | Defines the cluster manager (local, yarn, etc.). | spark.master=local[*] |
spark.submit.deployMode | Deploy mode: client or cluster. | spark.submit.deployMode=cluster |
spark.home | Location of Spark installation. | spark.home=/usr/local/spark |
| Configuration | Meaning | Example |
|---|---|---|
spark.executor.memory | Memory per executor process. | 4g |
spark.driver.memory | Memory for driver process. | 2g |
spark.executor.cores | Number of CPU cores per executor. | 4 |
spark.memory.fraction | Fraction of JVM heap used for execution/storage. | 0.6 |
| Configuration | Meaning | Example |
|---|---|---|
spark.default.parallelism | Default number of partitions. | 8 |
spark.sql.shuffle.partitions | Partitions used during shuffle. | 200 |
spark.shuffle.compress | Compress shuffle output files. | true |
| Configuration | Meaning | Example |
|---|---|---|
spark.serializer | Defines serializer (Kryo or Java). | org.apache.spark.serializer.KryoSerializer |
spark.io.compression.codec | Compression codec used. | snappy |
spark.rdd.compress | Compress serialized RDD partitions. | true |
| Configuration | Meaning | Example |
|---|---|---|
spark.sql.shuffle.partitions | Number of shuffle partitions for joins/aggregations. | 200 |
spark.sql.autoBroadcastJoinThreshold | Broadcast join threshold in bytes. | 10485760 |
spark.sql.warehouse.dir | Location of Spark SQL warehouse. | /user/hive/warehouse |
| Configuration | Meaning | Example |
|---|---|---|
spark.dynamicAllocation.enabled | Enable dynamic allocation. | true |
spark.dynamicAllocation.minExecutors | Minimum executors allowed. | 2 |
spark.dynamicAllocation.maxExecutors | Maximum executors allowed. | 10 |
spark.dynamicAllocation.initialExecutors | Executors at application start. | 4 |
In PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MySparkApp") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config("spark.sql.shuffle.partitions", "100") \
.getOrCreate()
Using spark-submit:
spark-submit \
--class com.example.MyJob \
--master yarn \
--deploy-mode cluster \
--conf spark.executor.memory=4g \
--conf spark.executor.cores=4 \
myjob.jar

