Spark on Yarn job fails when launching container

Spark on Yarn job fails when launching container

Symptom:

When running Spark jobs on Yarn:
bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster lib/spark-examples*.jar 10
the job fails with below error message from resource manager log:
INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1417043690539_0013 failed 2 times due to AM Container for appattempt_1417043690539_0013_000002 exited with  exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:295)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:314)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

main : command provided 1
main : user is root
main : requested yarn user is root

Container exited with a non-zero exit code 1
.Failing this attempt.. Failing the application.

Root Cause:

To find the root cause, we can follow below troubleshooting path, especially when Yarn log aggregation is not enabled.(By default, yarn.log-aggregation-enable=false)

1. Check which node manager has the failure from resource manager log.

By default, both resource manager log and node manager log are located at /opt/mapr/hadoop/hadoop-2.4.1/logs .
In this case, this attempt is on node "yarn-fcs-2":
INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application attempt appattempt_1417043690539_0013_000002 released container 
container_1417043690539_0013_02_000001 on node: host: yarn-fcs-2:42846

2. Check which container has the failure and what is the error message in node manager log.

In this case, the container is container_1417043690539_0013_02_000001 and the error message in node manager log is :
WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: 
Exception from container-launch with container ID: container_1417043690539_0013_02_000001 and exit code: 1
org.apache.hadoop.util.Shell$ExitCodeException:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:295)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:314)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
From above error message, we know this issue happened when launching container on this node.
However above error message does not tell the reason.

3. Check container stdout and stderr for the reason of the failure.

The container log is determined by parameter "yarn.nodemanager.log-dirs" in yarn-site.xml.
By default, it is set to ${yarn.log.dir}/userlogs, which means $HADOOP_YARN_HOME/logs/userlogs.
In this case, the container log is located here:
/opt/mapr/hadoop/hadoop-2.4.1/logs/userlogs/application_1417043690539_0013/container_1417043690539_0013_02_000001
The reason of the failure is :
# cat stderr
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

4. Find out which jar file has this missing class.

By searching, the missing jar file is located here:
/opt/mapr/spark/spark-1.1.0-bin-2.4.1-mapr-1408/lib/spark-assembly-1.1.0-hadoop2.4.1-mapr-1408.jar
This jar file is very important for running Spark on Yarn.

Solution:

1. Put spark-assembly-<version>.jar in property "yarn.application.classpath" from yarn-site.xml.
<property>
    <name>yarn.application.classpath</name>
    <value>/opt/mapr/spark/spark-1.1.0-bin-2.4.1-mapr-1408/lib/spark-assembly-1.1.0-hadoop2.4.1-mapr-1408.jar,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/lib/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/*,/contrib/capacity-scheduler/*.jar,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/*,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib/*
    </value>
</property>
2. Restart resource manager and node managers.
After that, the query works fine.

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization