Big data

Articles

Affichage des articles du mai, 2017

Spark through the BulkLoad quickly on the massive data into the Hbase

mai 30, 2017

We introduced a quick way to import massive amounts of data into Hbase by introducing bulk data into Hbase [Hadoop articles] through BulkLoad . This article will show you how to use Scala to quickly import data into Springs Methods. Here are two ways: the first to use Put ordinary method to count down; the second use Bulk Load API. For more information on why you need to use Bulk Load This article is not introduced , see "BulkLoad quickly into the massive data into Hbase [Hadoop articles]" . If you want to keep abreast of Spark , Hadoop or Hbase related articles, please pay attention to WeChat public account: iteblog_hadoop Article directory 1 Use org.apache.hadoop.hbase.client.Put to write data 2 batch guide data to Hbase 2.1 Bulk imports Hfiles into Hbase 2.2 Direct Bulk Load data to Hbase 3 other Use org.apache.hadoop.hbase.client.Put to write data Use org.apache.hadoop.hbase.client.Put to write data one by o

Spark reads the data in Hbase

mai 30, 2017

If you want to keep abreast of Spark , Hadoop or Hbase related articles, please pay attention to WeChat public account: iteblog_hadoop We may know that are familiar with Spark two common data read (stored in the RDD): (1), call the parallelize function directly from the collection to obtain data and stored in the RDD; Java version is as follows: JavaRDD<Integer> myRDD = sc.parallelize(Arrays.asList(1,2,3)); Scala version is as follows: val myRDD= sc.parallelize(List(1,2,3)) This is a simple and easy way to turn the data from one set into the RDD initialization value; more often (2), read the data from the text into the RDD, the text can be a plain text file, Is a sequence file; can be stored locally (file: //), can be stored in HDFS (hdfs: //), can also be stored on the S3. In fact, for the file, Spark support Hadoop support all the file types and file storage location. The Java version is as follows: /////////////////////////////////////////////////

Use Spark to read the data in HBase

mai 30, 2017

In the "Spark read the data in the Hbase" article I introduced how to read the data in the Spark in the Spark , and provides two versions of Java and Scala implementation, this article will be described later how to calculate through the Spark Of the data stored in the Hbase . Spark built-in provides two methods can be written to the data Hbase: (1), saveAsHadoopDataset; (2), saveAsNewAPIHadoopDataset, their official introduction are as follows: saveAsHadoopDataset : Output the RDD to any Hadoop-supported storage system, using a Hadoop JobConf object for that storage system. The JobConf should set an OutputFormat and any output paths required (eg a table name to write to) in the same way as it would Be configured for a Hadoop MapReduce job. saveAsNewAPIHadoopDataset : Output the RDD to any Hadoop-supported storage system with new Hadoop API, using a Hadoop Configuration object for that storage system. The Conf should set an OutputFormat and any out

object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable)

mai 30, 2017

In the use of Spark operation Hbase, the return of the data type is RDD [ImmutableBytesWritable, Result], we may be on the results of other operations, such as join, but because org.apache.hadoop.hbase.io.ImmutableBytesWritable and Org.apache.hadoop.hbase.client.Result does not implement the java.io.Serializable interface, the program may run in the process of the following exceptions: Serialization stack: - object not serializable (class: org.apache.hadoop.hbase.io.ImmutableBytesWritable, value: 30 30 30 30 30 30 32 34 32 30 32 37 37 32 31) - field (class: scala.Tuple2, name: _1, type : class java.lang.Object) - object (class scala.Tuple2, (30 30 30 30 30 30 32 34 32 30 32 37 37 32 31,keyvalues={00000011020Winz59XojM111 /f :iteblog /1470844800000/Put/vlen =2 /mvcc =0})) - element of array (index: 0) - array (class [Lscala.Tuple2;, size 10); not retrying 17 /03/16 16:07:48 ERROR ApplicationMaster: User class threw exception: org.apache.spark.Spark

Spark performance optimization: shuffle tuning

mai 30, 2017

Article directory 1 shuffle tuning 1.1 Summary of tuning 1.2 Overview of ShuffleManager Development 1.3 HashShuffleManager operating principle 1.3.1 Unoptured HashShuffleManager 1.3.2 Optimized HashShuffleManager 1.4 SortShuffleManager operating principle 1.4.1 General operating mechanism 1.4.2 bypass running mechanism 1.5 shuffle related parameters tuning 1.5.1 spark.shuffle.file.buffer 1.5.2 spark.reducer.maxSizeInFlight 1.5.3 spark.shuffle.io.maxRetries 1.5.4 spark.shuffle.io.retryWait 1.5.5 spark.shuffle.memoryFraction 1.5.6 spark.shuffle.manager 1.5.7 spark.shuffle.sort.bypassMergeThreshold 1.5.8 spark.shuffle.consolidateFiles 2 write in the last words Shuffle Summary of tuning Most of the performance of Spark operations is mainly consumed in the shuffle link, because the link contains a large number of disk IO, serialization, network data transmission and othe