Full Spark Configurations + Advanced Tuning

Admin, Student's Library
0

🔥 Apache Spark Configurations Explained

A complete guide to understanding key Spark configuration parameters

ConfigurationMeaningExample
spark.app.nameName of your Spark application.spark.app.name=MySparkJob
spark.masterDefines the cluster manager (local, yarn, etc.).spark.master=local[*]
spark.submit.deployModeDeploy mode: client or cluster.spark.submit.deployMode=cluster
spark.homeLocation of Spark installation.spark.home=/usr/local/spark

ConfigurationMeaningExample
spark.executor.memoryMemory per executor process.4g
spark.driver.memoryMemory for driver process.2g
spark.executor.coresNumber of CPU cores per executor.4
spark.memory.fractionFraction of JVM heap used for execution/storage.0.6

ConfigurationMeaningExample
spark.default.parallelismDefault number of partitions.8
spark.sql.shuffle.partitionsPartitions used during shuffle.200
spark.shuffle.compressCompress shuffle output files.true

ConfigurationMeaningExample
spark.serializerDefines serializer (Kryo or Java).org.apache.spark.serializer.KryoSerializer
spark.io.compression.codecCompression codec used.snappy
spark.rdd.compressCompress serialized RDD partitions.true

ConfigurationMeaningExample
spark.sql.shuffle.partitionsNumber of shuffle partitions for joins/aggregations.200
spark.sql.autoBroadcastJoinThresholdBroadcast join threshold in bytes.10485760
spark.sql.warehouse.dirLocation of Spark SQL warehouse./user/hive/warehouse

ConfigurationMeaningExample
spark.dynamicAllocation.enabledEnable dynamic allocation.true
spark.dynamicAllocation.minExecutorsMinimum executors allowed.2
spark.dynamicAllocation.maxExecutorsMaximum executors allowed.10
spark.dynamicAllocation.initialExecutorsExecutors at application start.4

In PySpark:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MySparkApp") \
    .master("yarn") \
    .config("spark.executor.memory", "4g") \
    .config("spark.sql.shuffle.partitions", "100") \
    .getOrCreate()

Using spark-submit:

spark-submit \
  --class com.example.MyJob \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.executor.memory=4g \
  --conf spark.executor.cores=4 \
  myjob.jar

💎 Advanced Spark Tuning & Best Practices (AWS + Kubernetes)

Optimize your Spark jobs for performance, reliability, and resource efficiency on Kubernetes (EKS) infrastructure.

1. Memory & CPU Configuration

Memory and CPU settings are the most common reasons Spark jobs fail due to OOM errors or excessive GC pauses.

  • spark.executor.memory: Allocate enough memory per executor based on dataset size. Increase for OOM; decrease if underutilized.
  • spark.driver.memory: Increase if driver OOMs occur during large datasets.
  • spark.executor.cores: 2–5 cores per executor for AWS pods; too many cores can cause GC delays.
  • spark.task.cpus: Usually 1; increase for CPU-intensive tasks.
Tip: Match executor memory & CPU requests to Kubernetes pod limits.

2. Parallelism & Shuffle Tuning

  • spark.default.parallelism: 2–3× total executor cores. Too low → slow; too high → overhead.
  • spark.sql.shuffle.partitions: Adjust based on shuffle size.
  • Use spark.reducer.maxSizeInFlight and spark.shuffle.compress for shuffle efficiency.
Fix for shuffle errors: Increase shuffle partitions or adjust executor memory.

3. Serialization & Compression

  • spark.serializer: Use KryoSerializer for faster serialization.
  • spark.io.compression.codec: Use snappy for low CPU overhead.
  • Compress RDDs (spark.rdd.compress=true) to save memory.
Fix for serialization errors: Register custom classes in Kryo to avoid ClassNotFoundException.

4. SQL / DataFrame Optimizations

  • spark.sql.autoBroadcastJoinThreshold: Reduce if broadcast joins fail.
  • spark.sql.execution.arrow.enabled=true for Pandas conversions (Python).
  • Repartition large datasets before heavy joins.

5. Dynamic Allocation

  • Enable spark.dynamicAllocation.enabled=true for auto-scaling executors.
  • Set minExecutors & maxExecutors carefully to avoid contention.
  • Works best with Kubernetes HPA & proper pod limits.
Fix for resource issues: Too few executors → slow jobs; too many → pod contention.

6. Kubernetes + AWS (EKS) Recommendations

  • Dedicated namespaces for Spark jobs
  • Set resource requests/limits in pods carefully
  • Enable dynamic allocation + autoscaling
  • Use ephemeral storage for shuffle
  • Monitor Spark UI + K8s dashboard + CloudWatch

7. Sample SparkApplication YAML (Scala)

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-job-sample
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: spark:3.5.0
  imagePullPolicy: IfNotPresent
  mainClass: com.example.SparkJob
  mainApplicationFile: local:///opt/spark/jars/my-spark-job_2.12-1.0.jar
  sparkVersion: 3.5.0
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    retryInterval: 10
    onSubmissionFailureRetries: 5
    submissionFailureRetryInterval: 20
  driver:
    cores: 2
    memory: 4g
    labels:
      version: 3.5.0
    serviceAccount: spark
  executor:
    cores: 4
    instances: 4
    memory: 8g
    labels:
      version: 3.5.0

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.
Post a Comment (0)
Our website uses cookies to enhance your experience. Learn More
Accept !