Spark Streaming performance tuning
Spark
Streaming provides an efficient and convenient streaming approach, but
in some scenarios, using the default configuration is not optimal, or
even real-time processing from external data, this time we need to
modify the default configuration of the relevant changes The
Because the reality of the scene and the amount of data is not the
same, so we can not set some common configuration (or else Spark
Streaming developers will not get so many parameters, write directly
die), we need according to the amount of data, different scenes set Not
the same configuration, here just give advice, these tuning does not
necessarily try to use your program, a good configuration is the need to
slowly try.
1, set a reasonable batch time (batchDuration).
When building a StreamingContext, we need to pass in a parameter that sets the Spark Streaming batch interval.
Spark will submit the job once every batchDuration. If your job is over
the batchDuration setting, it will cause the subsequent job to fail to
be submitted on time. Over time, more and more jobs are procrastinated.
Finally, Causing the entire Streaming job to be blocked, which
indirectly results in the inability to process the data in real time,
which is definitely not what we want.
In addition, although the batchDuration unit can reach the millisecond
level, experience tells us that if this value is too small will lead to
frequent submission of the operation to the burden of the entire
Streaming, so try not to set this value to less than 500ms. In many cases, set to 500ms performance is very good.
So how do you set a good value?
We can first set the value to a larger value (such as 10S). If we find
that the job is submitted quickly, we can further reduce the value until
the Streaming job is able to process the last batch of data, Then this
value is the optimal value we want.
2, increase Job parallelism
We need to make full use of the resources of the cluster, as much as
possible to the Task assigned to different nodes, on the one hand can
make full use of cluster resources; the other hand, you can also timely
processing of data.
For example, we use Streaming to receive data from Kafka, we can set up
a receiver for each Kafka partition, which can achieve load balancing,
timely processing of data (on how to use Streaming to read the data in
Kafka,
Another example is the reduceByKey () and Join function can set the parallelism parameters.
3, using Kryo series.
Spark defaults to the use of Java built-in serialization class,
although it can handle all self-inherited java.io.Serializable class
serialization class, but its poor performance, if this becomes a
performance bottleneck, you can use Kryo serialization class. Using serialized data can improve GC behavior well.
4, the cache needs frequent data
For some frequently used data, we can explicitly call rdd.cache () to
cache the data, which can also speed up the processing of data, but we
need more memory resources.
5, clear the unwanted data
With the passage of time, some data is not needed, but these data are
cached in memory and will consume our valuable memory resources, we can
configure spark.cleaner.ttl for a reasonable value; but this value can
not Too small, because if the need to use the data after the calculation
is clear will bring unnecessary trouble.
Moreover, we can also configure the option spark.streaming.unpersist to
true (the default is true) to more intelligently to unpersist
(unpersist) RDD. This configuration allows the system to find out which RDDs that do not need to be kept, and then to persist them. This can reduce the memory usage of the Spark RDD and may also improve the behavior of garbage collection.
6, set a reasonable GC
GC is the most difficult piece of the program, unreasonable GC behavior will give the program a great impact.
In a clustered environment, we can use the parallel Mark-Sweep garbage
collection mechanism, although this consumes more resources, but we
still recommend opening. Can be configured as follows:
[Bash shell] plain text view copy code
Spark.executor.extraJavaOptions = -XX: + UseConcMarkSweepGC
For more information on GC behavior configuration, please refer to the Java garbage collection article. This is not described in detail here.
7, set a reasonable number of CPU resources
In many cases Streaming program requires a lot of memory, but the need for a lot of CPU. In the Streaming program, the use of CPU resources can be divided into two categories:
(1) for receiving data;
(2) for processing data.
We need to set enough CPU resources, so that there is enough CPU
resources for receiving and processing data, so as to timely and
efficient processing of data.
Source: http://dataunion.org/16314.html
I agreee with all your words, but now spark session (dataset) came into picture most of the streaming problems solved using structure streaming and spark streaming. Agree?
RépondreSupprimerVenu
spark training institute in Hyderabad
bigdata training in Hyderabad