Spark through the BulkLoad quickly on the massive data into the Hbase

We introduced a quick way to import massive amounts of data into Hbase by introducing bulk data into Hbase [Hadoop articles] through BulkLoad . This article will show you how to use Scala to quickly import data into Springs Methods. Here are two ways: the first to use Put ordinary method to count down; the second use Bulk Load API. For more information on why you need to use Bulk Load This article is not introduced , see "BulkLoad quickly into the massive data into Hbase [Hadoop articles]" .

If you want to keep abreast of Spark , Hadoop or Hbase related articles, please pay attention to WeChat public account: iteblog_hadoop

Article directory

Use org.apache.hadoop.hbase.client.Put to write data

Use org.apache.hadoop.hbase.client.Put to write data one by one to the Hbase, but it is org.apache.hadoop.hbase.client.Put than Bulk loading, just as a contrast.

import org.apache.spark._

import org.apache.spark.rdd.NewHadoopRDD

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

import org.apache.hadoop.hbase.client.HBaseAdmin

import org.apache.hadoop.hbase.mapreduce.TableInputFormat

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HColumnDescriptor

import org.apache.hadoop.hbase.util.Bytes

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.client.HTable;

val conf = HBaseConfiguration.create()

val tableName = "/iteblog"

conf.set(TableInputFormat.INPUT_TABLE, tableName)

val myTable = new HTable(conf, tableName);

var p = new Put();

p = new Put(new String("row999").getBytes());

p.add("cf".getBytes(), "column_name".getBytes(), new String("value999").getBytes());

myTable.put(p);

myTable.flushCommits();

Batch guide to Hbase

Batch guide data to the Hbase can be divided into two kinds: (1), generate Hfiles, and then batch guide data;
(2), direct import of data into the Hbase.

Bulk imports Hfiles into Hbase

Now we have to introduce how to batch data into the Hbase, mainly divided into two steps:
(1), Mr. Hfiles;
(2), use org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to import Hfiles into Hbase in advance.
The implementation code is as follows:

import org.apache.spark._

import org.apache.spark.rdd.NewHadoopRDD

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

import org.apache.hadoop.hbase.client.HBaseAdmin

import org.apache.hadoop.hbase.mapreduce.TableInputFormat

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HColumnDescriptor

import org.apache.hadoop.hbase.util.Bytes

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.mapred.TableOutputFormat

import org.apache.hadoop.mapred.JobConf

import org.apache.hadoop.hbase.io.ImmutableBytesWritable

import org.apache.hadoop.mapreduce.Job

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

import org.apache.hadoop.hbase.KeyValue

import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat

import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles

val conf = HBaseConfiguration.create()

val tableName = "iteblog"

val table = new HTable(conf, tableName) 

conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

val job = Job.getInstance(conf)

job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])

job.setMapOutputValueClass (classOf[KeyValue])

HFileOutputFormat.configureIncrementalLoad (job, table)

// Generate 10 sample data:

val num = sc.parallelize(1 to 10)

val rdd = num.map(x=>{

    val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(), "c1".getBytes(), "value_xxx".getBytes() )

    (new ImmutableBytesWritable(Bytes.toBytes(x)), kv)

})

// Save Hfiles on HDFS 

rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], conf)

//Bulk load Hfiles to Hbase

val bulkLoader = new LoadIncrementalHFiles(conf)

bulkLoader.doBulkLoad(new Path("/tmp/iteblog"), table)

After running the above code, we can see that the itblog table in the Hbase has generated 10 data, as follows:

hbase(main):020:0> scan 'iteblog'

ROW                                                 COLUMN+CELL

 \x00\x00\x00\x01                                   column=cf:c1, timestamp=1425128075586, value=value_xxx

 \x00\x00\x00\x02                                   column=cf:c1, timestamp=1425128075586, value=value_xxx

 \x00\x00\x00\x03                                   column=cf:c1, timestamp=1425128075586, value=value_xxx

 \x00\x00\x00\x04                                   column=cf:c1, timestamp=1425128075586, value=value_xxx

 \x00\x00\x00\x05                                   column=cf:c1, timestamp=1425128075586, value=value_xxx

 \x00\x00\x00\x06                                   column=cf:c1, timestamp=1425128075675, value=value_xxx

 \x00\x00\x00\x07                                   column=cf:c1, timestamp=1425128075675, value=value_xxx

 \x00\x00\x00\x08                                   column=cf:c1, timestamp=1425128075675, value=value_xxx

 \x00\x00\x00\x09                                   column=cf:c1, timestamp=1425128075675, value=value_xxx

 \x00\x00\x00\x0A                                   column=cf:c1, timestamp=1425128075675, value=value_xxx

Direct Bulk Load data to Hbase

This method does not need to generate Hfiles on HDFS in advance, but directly to the bulk of the data into the Hbase. There are only minor differences compared to the above example, as follows:
will

rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], conf)  

changed to:

rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())

The complete implementation is as follows:

import org.apache.spark._

import org.apache.spark.rdd.NewHadoopRDD

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

import org.apache.hadoop.hbase.client.HBaseAdmin

import org.apache.hadoop.hbase.mapreduce.TableInputFormat

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.hbase.HColumnDescriptor

import org.apache.hadoop.hbase.util.Bytes

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.mapred.TableOutputFormat

import org.apache.hadoop.mapred.JobConf

import org.apache.hadoop.hbase.io.ImmutableBytesWritable

import org.apache.hadoop.mapreduce.Job

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

import org.apache.hadoop.hbase.KeyValue

import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat

import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles

val conf = HBaseConfiguration.create()

val tableName = "iteblog"

val table = new HTable(conf, tableName) 

conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

val job = Job.getInstance(conf)

job.setMapOutputKeyClass (classOf[ImmutableBytesWritable])

job.setMapOutputValueClass (classOf[KeyValue])

HFileOutputFormat.configureIncrementalLoad (job, table)

// Generate 10 sample data:

val num = sc.parallelize(1 to 10)

val rdd = num.map(x=>{

    val kv: KeyValue = new KeyValue(Bytes.toBytes(x), "cf".getBytes(), "c1".getBytes(), "value_xxx".getBytes() )

    (new ImmutableBytesWritable(Bytes.toBytes(x)), kv)

})

// Directly bulk load to Hbase/MapRDB tables.

rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration())

other

In the above example, we used the saveAsNewAPIHadoopFile API to write data to HBase; in fact, we can also use the saveAsNewAPIHadoopDataset API to achieve the same goal, we only need to the saveAsNewAPIHadoopDataset code

rdd.saveAsNewAPIHadoopFile("/tmp/iteblog", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], job.getConfiguration()) 

changed to

job.getConfiguration.set("mapred.output.dir", "/tmp/iteblog")

rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)

The rest is exactly the same as before.

Rechercher dans ce blog

Big data