Spark Streaming performance tuning

Spark Streaming provides an efficient and convenient streaming approach, but in some scenarios, using the default configuration is not optimal, or even real-time processing from external data, this time we need to modify the default configuration of the relevant changes The Because the reality of the scene and the amount of data is not the same, so we can not set some common configuration (or else Spark Streaming developers will not get so many parameters, write directly die), we need according to the amount of data, different scenes set Not the same configuration, here just give advice, these tuning does not necessarily try to use your program, a good configuration is the need to slowly try.
1, set a reasonable batch time (batchDuration).
When building a StreamingContext, we need to pass in a parameter that sets the Spark Streaming batch interval. Spark will submit the job once every batchDuration. If your job is over the batchDuration setting, it will cause the subsequent job to fail to be submitted on time. Over time, more and more jobs are procrastinated. Finally, Causing the entire Streaming job to be blocked, which indirectly results in the inability to process the data in real time, which is definitely not what we want.
In addition, although the batchDuration unit can reach the millisecond level, experience tells us that if this value is too small will lead to frequent submission of the operation to the burden of the entire Streaming, so try not to set this value to less than 500ms. In many cases, set to 500ms performance is very good.
So how do you set a good value? We can first set the value to a larger value (such as 10S). If we find that the job is submitted quickly, we can further reduce the value until the Streaming job is able to process the last batch of data, Then this value is the optimal value we want.

2, increase Job parallelism
We need to make full use of the resources of the cluster, as much as possible to the Task assigned to different nodes, on the one hand can make full use of cluster resources; the other hand, you can also timely processing of data. For example, we use Streaming to receive data from Kafka, we can set up a receiver for each Kafka partition, which can achieve load balancing, timely processing of data (on how to use Streaming to read the data in Kafka,
Another example is the reduceByKey () and Join function can set the parallelism parameters.

3, using Kryo series.
Spark defaults to the use of Java built-in serialization class, although it can handle all self-inherited class serialization class, but its poor performance, if this becomes a performance bottleneck, you can use Kryo serialization class. Using serialized data can improve GC behavior well.

4, the cache needs frequent data
For some frequently used data, we can explicitly call rdd.cache () to cache the data, which can also speed up the processing of data, but we need more memory resources.

5, clear the unwanted data
With the passage of time, some data is not needed, but these data are cached in memory and will consume our valuable memory resources, we can configure spark.cleaner.ttl for a reasonable value; but this value can not Too small, because if the need to use the data after the calculation is clear will bring unnecessary trouble. Moreover, we can also configure the option spark.streaming.unpersist to true (the default is true) to more intelligently to unpersist (unpersist) RDD. This configuration allows the system to find out which RDDs that do not need to be kept, and then to persist them. This can reduce the memory usage of the Spark RDD and may also improve the behavior of garbage collection.

6, set a reasonable GC
GC is the most difficult piece of the program, unreasonable GC behavior will give the program a great impact. In a clustered environment, we can use the parallel Mark-Sweep garbage collection mechanism, although this consumes more resources, but we still recommend opening. Can be configured as follows:

[Bash shell] plain text view copy code
  Spark.executor.extraJavaOptions = -XX: + UseConcMarkSweepGC 

For more information on GC behavior configuration, please refer to the Java garbage collection article. This is not described in detail here.

7, set a reasonable number of CPU resources
In many cases Streaming program requires a lot of memory, but the need for a lot of CPU. In the Streaming program, the use of CPU resources can be divided into two categories:
(1) for receiving data;
(2) for processing data. We need to set enough CPU resources, so that there is enough CPU resources for receiving and processing data, so as to timely and efficient processing of data.


  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Services

    Advanced Analytics Solutions

    Full Stack Development Services


Enregistrer un commentaire

Posts les plus consultés de ce blog

Spark performance optimization: shuffle tuning

Spark optimization

Use Apache Spark to write data to ElasticSearch