Apache Spark Streaming : Logging Driver Executor logs the right way

From my experience, i feel logging properly is one of the most important thing to do first when starting Spark Streaming development especially when you are running on cluster with multiple worker machines.
Reason is simple : Streaming is a continuous running process and the exception/error may arrive after many hours/days and it can be because of driver or can be because of executor. It will be hard to debug the root cause as driver logs are coming in console cannot be seen after application shuts down while executor logs come in std out/err files ( i am using Mesos as cluster manager) which is tedious to download and see. So when some issue comes, like in my case an out-of-memory issue came after 2 days of running and application went down. I had to be sure whether driver or executor was the actual culprit where issue came first. So i first did this logging configuration properly before debugging the issue.
Also with logging, we can control how much retention/days of logs we want to keep for driver and executor so that disk space is not eat up by logs generated by ever running application. And if we are running multiple spark streaming applications on the same cluster , we can enable logging to separate log files for different executors even if multiple executors happen to run on same worker machine.
Logging Configuration Steps :
I am using standard apache logger library for logging with appropriate logging levels in code. Default spark log4j properties template can be found in the spark conf directory. For example in my case, its at /usr/local/spark-1.5.1-bin-hadoop2.6/conf/ : log4j.properties.template
1.  Create separate log4j configuration files for driver and executor and place them at conf/ directory. Each of them should be configured with different file name and other rolling properties as per use case. In case of multiple applications, create the same separately and differently for each application . For example, i have 2 streaming application named Request and Event. So, have created 4 files as:

Contents of log4j file should be like :
log4j.rootCategory=INFO, FILE
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
2. Copy the above files to similar conf/ directory on each worker machine and create the log directory for logs on each of them. I have specified log dir as temp/ as of now, which is already there. As shown above, i am keeping only last 10 recent log files of max size 500 mb each for my driver and executor log on each worker machine as it suffices for 4-5 days of logs retention for my specific use case.
3. Include the log dir for driver and executor in spark submit command as below :
      spark-submit   --driver-java-options "-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/c
onf/log4j_RequestLogDriver.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.prope
rties"   --master mesos://sparkclusterip:5050  --jars     ……so on

Troubleshooting logging issue :
  Ideally, as per spark documentation, this configuration is enough for separate logging of driver and executor to log file.
However for some unknown reason, in my case, my driver logs were getting written properly to file but executor logs were not going into file and were still coming on Mesos cluster standard std out/err files .
Setting the same above driver and executor java options in spark conf object in code through application cont properties file did the trick for me which is another way of setting spark properties if you don't want to include ins park submit command.
application.conf :
request-driver-options = "-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogDriver.properties"
request-executor-options = "-XX:+UseG1GC -Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.properties"
Code :
sparkConf.set("spark.driver.extraJavaOptions",conf.getString("application.request-driver-options") )
sparkConf.set("spark.executor.extraJavaOptions",conf.getString("application.request-executor-options") )Doing above, now i can see all logs getting written to log files only. On Mesos UI , std err/out files are now empty which used to contain executor logs earlier.

Although this post described logging configuration based on my experience on Spark Streaming with Mesos cluster, it will be similar for batch spark jobs and other clusters like standalone,Yarn. Also i am using spark 1.5.1 version right now, so behaviour might change in future releases as spark evolves further e.g. in cases like graceful shutdown,see my last post.


Posts les plus consultés de ce blog

Spark performance optimization: shuffle tuning

Spark optimization

Use Apache Spark to write data to ElasticSearch