Resource allocation configurations for Spark on Yarn
Env:
Spark 1.3.1Hadoop 2.5.1
Spark on Yarn mode
Goal:
This article explains the resource allocation configurations for Spark on Yarn with examples.It talks about the 2 modes: yarn-client and yarn-cluster.
Solution:
Spark can request 2 resources in YARN which are CPU and memory.Note: Spark configurations for resource allocation are set in spark-defaults.conf with the name like spark.xx.xx. Some of the them have a corresponding flag for client tools spark-submit/spark-shell/pyspark, with the name like --xx-xx.
In the following forms, if the configuration has a corresponding flag for client tools, we put the flag after the configurations in parenthesis"()". For example:
1
2
| spark.driver.cores (--driver-cores) |
1. yarn-client VS yarn-cluster mode
Per Spark documentation:- In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
- In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
2. Application Master(AM)
a. yarn-client
Configuration
|
Description
|
Default Value
|
spark.yarn.am.cores
|
Number of CPU cores used by AM
|
1
|
spark.yarn.am.memory
|
JAVA heap size of AM
|
512m
|
spark.yarn.am.memoryOverhead
|
The amount of off heap memory (in
megabytes) to be allocated in AM.
|
AM memory * 0.07, with minimum of 384
|
Take below settings for example:
1
2
3
| [root@h1 conf] # cat spark-defaults.conf |grep am spark.yarn.am.cores 4 spark.yarn.am.memory 777m |
It means, if we set spark.yarn.am.memory to 777M, the actual AM container size would be 2G. Because 777+Max(384, 777 * 0.07) = 777+384 = 1161, and the default yarn.scheduler.minimum-allocation-mb=1024, so 2GB container will be allocated to AM.
As a result, a (2G, 4 Cores) AM container with JAVA heap size -Xmx777M is allocated:
Assigned container container_1432752481069_0129_01_000001 of capacity <memory:2048, vCores:4, disks:0.0>
b. yarn-cluster
Since in yarn-cluster mode, the Spark driver is inside YARN AM, below driver related configurations also control the resource allocation for AM.
Configuration
|
Description
|
Default Value
|
spark.driver.cores
(--driver-cores) |
Number of CPU cores used by AM
|
1
|
spark.driver.memory
(--driver-memory) |
JAVA heap size of AM
|
512m
|
spark.yarn.driver.memoryOverhead
|
The amount of off heap memory (in
megabytes) to be allocated in AM.
|
AM memory * 0.07, with minimum of 384
|
Take below settings for example:
1
2
3
4
| MASTER=yarn-cluster /opt/mapr/spark/spark-1 .3.1 /bin/spark-submit --class org.apache.spark.examples.SparkPi \ --driver-memory 1665m \ --driver-cores 2 \ /opt/mapr/spark/spark-1 .3.1 /lib/spark-examples *.jar 1000 |
As a result, a (3G, 2 Cores) AM container with JAVA heap size -Xmx1665M is allocated:
Assigned container container_1432752481069_0135_02_000001 of capacity <memory:3072, vCores:2, disks:0.0>
3. Containers for Spark executors
For Spark executors' resource, yarn-client and yarn-cluster modes use the same configurations.
Configuration
|
Description
|
Default Value
|
spark.executor.instances
(--num-executors)
|
The number of executors.
|
2
|
spark.executor.cores
(--executor-cores)
|
Number of CPU cores used by each
executor.
|
1
|
spark.executor.memory
(--executor-memory)
|
JAVA heap size of each executor.
|
512m
|
spark.yarn.executor.memoryOverhead
|
The amount of off heap memory (in megabytes) to be
allocated per executor.
|
executorMemory * 0.07, with minimum of 384
|
Out of the box in spark-defaults.conf, spark.executor.memory is set to 2g.
Spark will start 2 (3G, 1 Core) executor containers with JAVA heap size -Xmx2048M:
Assigned container container_1432752481069_0140_01_000002 of capacity <memory:3072, vCores:1, disks:0.0> Assigned container container_1432752481069_0140_01_000003 of capacity <memory:3072, vCores:1, disks:0.0>However 1 core per executor means only 1 task can be running at any time for one executor.
In the case of broadcast join, the memory can be shared by multiple running tasks in the same executor if we increase the number of cores per executor.
Note: if dynamic resource allocation is enabled by setting spark.dynamicAllocation.enabled to true, Spark can scale the number of executors registered with this application up and down based on the workload. In this case, you do not need to specify spark.executor.instances manually.
Key takeaways:
- Spark driver resource related configurations also control the YARN application master resource in yarn-cluster mode.
- Be aware of the max(7%, 384m) overhead off heap memory when calculating the memory for executors.
- The number of CPU cores per executor controls the number of concurrent tasks per executor.
Commentaires
Enregistrer un commentaire