In this current post we’d like to help you to start with the latest - 1.2.0
Spark release in minutes – using Docker. Though we have released and
pushed the container between the holidays into the official Docker
repository, we were still due with the post. Here are the details …
Docker and Spark are two technologies which are very hyped these days. At SequenceIQ we use both quite a lot, thus we put together a Docker container and sharing it with the community.
The container’s code is available in our GitHub repository.
Pull the image from Docker Repository
We suggest to always pull the container from the official Docker repository – as this is always maintained and supported by us.
1
docker pull sequenceiq/spark:1.2.0
Building the image
Alternatively you can always build your own container based on our Dockerfile.
1
docker build --rm -t sequenceiq/spark:1.2.0 .
Running the image
Once you have pulled or built the container, you are ready to start with Spark.
1
docker run -i -t -h sandbox sequenceiq/spark:1.2.0 /etc/bootstrap.sh -bash
Testing
In order to check whether everything is OK, you can run one of the
stock examples, coming with Spark. Check our previous blog posts and
examples about Spark here and here.
123456
cd /usr/local/spark
# run the spark shell
./bin/spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1
# execute the the following command which should return 1000
scala> sc.parallelize(1 to 1000).count()
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs inside
an application master process which is managed by YARN on the cluster,
and the client can go away after initiating the application. In
yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Estimating Pi (yarn-cluster mode):
1234
cd /usr/local/spark
# execute the the following command which should write the "Pi is roughly 3.1418" into the logs
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.2.0-hadoop2.4.0.jar
Estimating Pi (yarn-client mode):
1234
cd /usr/local/spark
# execute the the following command which should print the "Pi is roughly 3.1418" to the screen
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 1 ./lib/spark-examples-1.2.0-hadoop2.4.0.jar
It is useful to be able to control the degree of parallelism when using Spark. Spark provides a very convenient method to increase the degree of parallelism which should be adequate in practice. This blog entry of the "Inside Spark" series describes the knobs Spark uses to control the degree of parallelism. Controlling the number of partitions in the " local " mode Spark is designed ground up to enable unit testing. To control the degree of parallelism in the local mode simply utilize the function JavaRDD<String> org.apache.spark.api.java.JavaSparkContext.parallelize(List<String> list, int numSlices) The following snippet of code demonstrates this usage- public class SparkControlPartitionSizeLocally { public static void main ( String [] args ) { JavaSparkContext sc = new JavaSparkContext ( "local" , "localpartitionsizecontrol" ); String input = "four score and sev
Article directory 1 shuffle tuning 1.1 Summary of tuning 1.2 Overview of ShuffleManager Development 1.3 HashShuffleManager operating principle 1.3.1 Unoptured HashShuffleManager 1.3.2 Optimized HashShuffleManager 1.4 SortShuffleManager operating principle 1.4.1 General operating mechanism 1.4.2 bypass running mechanism 1.5 shuffle related parameters tuning 1.5.1 spark.shuffle.file.buffer 1.5.2 spark.reducer.maxSizeInFlight 1.5.3 spark.shuffle.io.maxRetries 1.5.4 spark.shuffle.io.retryWait 1.5.5 spark.shuffle.memoryFraction 1.5.6 spark.shuffle.manager 1.5.7 spark.shuffle.sort.bypassMergeThreshold 1.5.8 spark.shuffle.consolidateFiles 2 write in the last words Shuffle Summary of tuning Most of the performance of Spark operations is mainly consumed in the shuffle link, because the link contains a large number of disk IO, serialization, network data transmission and othe
One operation and maintenance 1. Master hang up, standby restart is also invalid Master defaults to 512M of memory, when the task in the cluster is particularly high, it will hang, because the master will read each task event log log to generate spark ui, the memory will naturally OOM, you can run the log See that the master of the start through the HA will naturally fail for this reason. solve Increase the Master's memory spark-env.sh , set in the master node spark-env.sh : export SPARK_DAEMON_MEMORY 10g # 根据你的实际情况 Reduce the job information stored in the Master memory spark.ui.retainedJobs 500 # 默认都是1000 spark.ui.retainedStages 500 Hang up or suspend Sometimes we will see the web node in the web ui disappear or in the dead state, the task of running the node will report a variety of lost worker errors, causing the same reasons and the above, worker memory to save a lot of ui The information leads to gc when the heartbeat is lost
best online training for big data and hadoop
RépondreSupprimer