Full Spark Configurations + Advanced Tuning

🔥 Apache Spark Configurations Explained

A complete guide to understanding key Spark configuration parameters

Configuration	Meaning	Example
`spark.app.name`	Name of your Spark application.	`spark.app.name=MySparkJob`
`spark.master`	Defines the cluster manager (local, yarn, etc.).	`spark.master=local[*]`
`spark.submit.deployMode`	Deploy mode: client or cluster.	`spark.submit.deployMode=cluster`
`spark.home`	Location of Spark installation.	`spark.home=/usr/local/spark`

Configuration	Meaning	Example
`spark.executor.memory`	Memory per executor process.	`4g`
`spark.driver.memory`	Memory for driver process.	`2g`
`spark.executor.cores`	Number of CPU cores per executor.	`4`
`spark.memory.fraction`	Fraction of JVM heap used for execution/storage.	`0.6`

Configuration	Meaning	Example
`spark.default.parallelism`	Default number of partitions.	`8`
`spark.sql.shuffle.partitions`	Partitions used during shuffle.	`200`
`spark.shuffle.compress`	Compress shuffle output files.	`true`

Configuration	Meaning	Example
`spark.serializer`	Defines serializer (Kryo or Java).	`org.apache.spark.serializer.KryoSerializer`
`spark.io.compression.codec`	Compression codec used.	`snappy`
`spark.rdd.compress`	Compress serialized RDD partitions.	`true`

Configuration	Meaning	Example
`spark.sql.shuffle.partitions`	Number of shuffle partitions for joins/aggregations.	`200`
`spark.sql.autoBroadcastJoinThreshold`	Broadcast join threshold in bytes.	`10485760`
`spark.sql.warehouse.dir`	Location of Spark SQL warehouse.	`/user/hive/warehouse`

Configuration	Meaning	Example
`spark.dynamicAllocation.enabled`	Enable dynamic allocation.	`true`
`spark.dynamicAllocation.minExecutors`	Minimum executors allowed.	`2`
`spark.dynamicAllocation.maxExecutors`	Maximum executors allowed.	`10`
`spark.dynamicAllocation.initialExecutors`	Executors at application start.	`4`

In PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MySparkApp") \
    .master("yarn") \
    .config("spark.executor.memory", "4g") \
    .config("spark.sql.shuffle.partitions", "100") \
    .getOrCreate()

Using spark-submit:

spark-submit \
  --class com.example.MyJob \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.executor.memory=4g \
  --conf spark.executor.cores=4 \
  myjob.jar

💎 Advanced Spark Tuning & Best Practices (AWS + Kubernetes)

Optimize your Spark jobs for performance, reliability, and resource efficiency on Kubernetes (EKS) infrastructure.

1. Memory & CPU Configuration

Memory and CPU settings are the most common reasons Spark jobs fail due to OOM errors or excessive GC pauses.

spark.executor.memory: Allocate enough memory per executor based on dataset size. Increase for OOM; decrease if underutilized.
spark.driver.memory: Increase if driver OOMs occur during large datasets.
spark.executor.cores: 2–5 cores per executor for AWS pods; too many cores can cause GC delays.
spark.task.cpus: Usually 1; increase for CPU-intensive tasks.

Tip: Match executor memory & CPU requests to Kubernetes pod limits.

2. Parallelism & Shuffle Tuning

spark.default.parallelism: 2–3× total executor cores. Too low → slow; too high → overhead.
spark.sql.shuffle.partitions: Adjust based on shuffle size.
Use spark.reducer.maxSizeInFlight and spark.shuffle.compress for shuffle efficiency.

Fix for shuffle errors: Increase shuffle partitions or adjust executor memory.

3. Serialization & Compression

spark.serializer: Use KryoSerializer for faster serialization.
spark.io.compression.codec: Use snappy for low CPU overhead.
Compress RDDs (spark.rdd.compress=true) to save memory.

Fix for serialization errors: Register custom classes in Kryo to avoid ClassNotFoundException.

4. SQL / DataFrame Optimizations

spark.sql.autoBroadcastJoinThreshold: Reduce if broadcast joins fail.
spark.sql.execution.arrow.enabled=true for Pandas conversions (Python).
Repartition large datasets before heavy joins.

5. Dynamic Allocation

Enable spark.dynamicAllocation.enabled=true for auto-scaling executors.
Set minExecutors & maxExecutors carefully to avoid contention.
Works best with Kubernetes HPA & proper pod limits.

Fix for resource issues: Too few executors → slow jobs; too many → pod contention.

6. Kubernetes + AWS (EKS) Recommendations

Dedicated namespaces for Spark jobs
Set resource requests/limits in pods carefully
Enable dynamic allocation + autoscaling
Use ephemeral storage for shuffle
Monitor Spark UI + K8s dashboard + CloudWatch

7. Sample SparkApplication YAML (Scala)

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-job-sample
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: spark:3.5.0
  imagePullPolicy: IfNotPresent
  mainClass: com.example.SparkJob
  mainApplicationFile: local:///opt/spark/jars/my-spark-job_2.12-1.0.jar
  sparkVersion: 3.5.0
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    retryInterval: 10
    onSubmissionFailureRetries: 5
    submissionFailureRetryInterval: 20
  driver:
    cores: 2
    memory: 4g
    labels:
      version: 3.5.0
    serviceAccount: spark
  executor:
    cores: 4
    instances: 4
    memory: 8g
    labels:
      version: 3.5.0

Newer
Older

S D L

Full Spark Configurations + Advanced Tuning

🔥 Apache Spark Configurations Explained

💎 Advanced Spark Tuning & Best Practices (AWS + Kubernetes)

1. Memory & CPU Configuration

2. Parallelism & Shuffle Tuning

3. Serialization & Compression

4. SQL / DataFrame Optimizations

5. Dynamic Allocation

6. Kubernetes + AWS (EKS) Recommendations

7. Sample SparkApplication YAML (Scala)

Post a Comment

Popular Libraries

Popular Posts

भारतीय वायु सेना ग्रुप 'C' सिविलियन भर्ती 2025

SQL- Queries-Practise

संघर्ष और सपनों की कहानी: रिंकू सिंह और प्रिया सरोज की प्रेरणादायक जोड़ी

फाइलेरिया: कारण, लक्षण, इलाज और बचाव की जानकारी

Categories

Social Plugin

Author Profile

Ads

ADDRESS:

Company

Footer Copyright

Contact form

S D L

Full Spark Configurations + Advanced Tuning

🔥 Apache Spark Configurations Explained

🧠 1. Application-Level Configurations

⚙️ 2. Memory & Execution Configuration

⚡ 3. Parallelism & Shuffle Configurations

🧩 4. Serialization & Compression

📊 5. SQL & DataFrame Configurations

💡 6. Dynamic Allocation Configurations

✅ Example: Setting Configurations

💎 Advanced Spark Tuning & Best Practices (AWS + Kubernetes)

1. Memory & CPU Configuration

2. Parallelism & Shuffle Tuning

3. Serialization & Compression

4. SQL / DataFrame Optimizations

5. Dynamic Allocation

6. Kubernetes + AWS (EKS) Recommendations

7. Sample SparkApplication YAML (Scala)

You may like these posts

Post a Comment

Footer Copyright

Contact form