Spark through the BulkLoad quickly on the massive data into the Hbase

We introduced a quick way to import massive amounts of data into Hbase by introducing bulk data into Hbase [Hadoop articles] through BulkLoad . This article will show you how to use Scala to quickly import data into Springs Methods. Here are two ways: the first to use Put ordinary method to count down; the second use Bulk Load API. For more information on why you need to use Bulk Load This article is not introduced , see "BulkLoad quickly into the massive data into Hbase [Hadoop articles]" .

If you want to keep abreast of Spark , Hadoop or Hbase related articles, please pay attention to WeChat public account: iteblog_hadoop

Use org.apache.hadoop.hbase.client.Put to write data

Use org.apache.hadoop.hbase.client.Put to write data one by one to the Hbase, but it is org.apache.hadoop.hbase.client.Put than Bulk loading, just as a contrast.
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
  
val conf = HBaseConfiguration.create()
val tableName = "/iteblog"
conf.set(TableInputFormat.INPUT_TABLE, tableName)
  
val myTable = new HTable(conf, tableName);
var p = new Put();
p = new Put(new String("row999").getBytes());
p.add("cf".getBytes(), "column_name".getBytes(), new String("value999").getBytes());
myTable.put(p);
myTable.flushCommits();

Batch guide to Hbase

Batch guide data to the Hbase can be divided into two kinds: (1), generate Hfiles, and then batch guide data;
(2), direct import of data into the Hbase.

Bulk imports Hfiles into Hbase

Now we have to introduce how to batch data into the Hbase, mainly divided into two steps:
(1), Mr. Hfiles;
(2), use org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to import Hfiles into Hbase in advance.
The implementation code is as follows:
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
  
val conf = HBaseConfiguration.create()
val tableName = "iteblog"
val table = new HTable(conf, tableName)
  
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(conf)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])
job.setMapOutputValueClass (classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad (job, table)
  
// Generate 10 sample data:
val num = sc.parallelize(1 to 10)
val rdd = num.map(x=>{
    val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(), "c1".getBytes(), "value_xxx".getBytes() )
    (new ImmutableBytesWritable(Bytes.toBytes(x)), kv)
})
  
// Save Hfiles on HDFS
rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], conf)
  
//Bulk load Hfiles to Hbase
val bulkLoader = new LoadIncrementalHFiles(conf)
bulkLoader.doBulkLoad(new Path("/tmp/iteblog"), table)
After running the above code, we can see that the itblog table in the Hbase has generated 10 data, as follows:
hbase(main):020:0> scan 'iteblog'
ROW                                                 COLUMN+CELL
 \x00\x00\x00\x01                                   column=cf:c1, timestamp=1425128075586, value=value_xxx
 \x00\x00\x00\x02                                   column=cf:c1, timestamp=1425128075586, value=value_xxx
 \x00\x00\x00\x03                                   column=cf:c1, timestamp=1425128075586, value=value_xxx
 \x00\x00\x00\x04                                   column=cf:c1, timestamp=1425128075586, value=value_xxx
 \x00\x00\x00\x05                                   column=cf:c1, timestamp=1425128075586, value=value_xxx
 \x00\x00\x00\x06                                   column=cf:c1, timestamp=1425128075675, value=value_xxx
 \x00\x00\x00\x07                                   column=cf:c1, timestamp=1425128075675, value=value_xxx
 \x00\x00\x00\x08                                   column=cf:c1, timestamp=1425128075675, value=value_xxx
 \x00\x00\x00\x09                                   column=cf:c1, timestamp=1425128075675, value=value_xxx
 \x00\x00\x00\x0A                                   column=cf:c1, timestamp=1425128075675, value=value_xxx

Direct Bulk Load data to Hbase

This method does not need to generate Hfiles on HDFS in advance, but directly to the bulk of the data into the Hbase. There are only minor differences compared to the above example, as follows:
will
rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], conf) 
changed to:
rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())
The complete implementation is as follows:
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
  
val conf = HBaseConfiguration.create()
val tableName = "iteblog"
val table = new HTable(conf, tableName)
  
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(conf)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])
job.setMapOutputValueClass (classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad (job, table)
  
// Generate 10 sample data:
val num = sc.parallelize(1 to 10)
val rdd = num.map(x=>{
    val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(), "c1".getBytes(), "value_xxx".getBytes() )
    (new ImmutableBytesWritable(Bytes.toBytes(x)), kv)
})
  
// Directly bulk load to Hbase/MapRDB tables.
rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())

other

In the above example, we used the saveAsNewAPIHadoopFile API to write data to HBase; in fact, we can also use the saveAsNewAPIHadoopDataset API to achieve the same goal, we only need to the saveAsNewAPIHadoopDataset code
rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())
changed to
job.getConfiguration.set("mapred.output.dir", "/tmp/iteblog")
rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
The rest is exactly the same as before.

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization