Integrate Kafka, SparkStreaming and HBase (kafka–>SparkStreaming

Apache Kafka is publish-subscribe messaging system. It is a distributed, partitioned, replicated commit log service.
Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop. Spark Streaming is a real-time processing tool that runs on top of the Spark engine.
Create a pojo class as below:

public class PersonBean {

	private String uid;
	private String firstName;
	private String lastName;
	private String city;
	private double salary;
	
}

public class PersonBean {

private String uid;

private String firstName;

private String lastName;

private String city;

private double salary;

}

StreamingToHbase program receives 4 parameters as input: <zkQuorum> <group> <topics> <numThreads>
zkQuorum: is a list of one or more zookeeper servers that make quorum
group : is the name of kafka consumer group
topics : is a list of one or more kafka topics to consume from
numThreads: is the number of threads the kafka consumer should use

public final class StreamingToHbase {
	
  static Logger log = Logger.getLogger(StreamingToHbase.class.getName());
  static Gson gson = new GsonBuilder().setDateFormat("yyyy-MM-dd'T'HH:mm:ss").create();
  
  @SuppressWarnings("serial")
  public static void main(String[] args) {
    if (args.length < 6) {
      System.err.println("Usage: spark streaming to HBase <zkQuorum> <group> <topics> <numthreads>");
      System.exit(1);
    }

public final class StreamingToHbase {

static Logger log = Logger.getLogger(StreamingToHbase.class.getName());

static Gson gson = new GsonBuilder().setDateFormat("yyyy-MM-dd'T'HH:mm:ss").create();

@SuppressWarnings("serial")

public static void main(String[] args) {

if (args.length < 6) {

System.err.println("Usage: spark streaming to HBase <zkQuorum> <group> <topics> <numThreads>");

System.exit(1);

}

First, we create a JavaStreamingContext object, which is the main entry point for all streaming functionality. We create a local StreamingContext with five execution threads, and a batch interval of 5 seconds.
SparkConf sparkConf = new SparkConf().setAppName(“spark streaming to HBase”).setMaster(“local[2]”);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5000));
final JavaSparkContext context = javaStreamingContext.sparkContext();
When you create a HBaseConfiguration, it reads in whatever you’ve set into your hbase-site.xml and in hbase-default.xml, as long as these can be found on the CLASSPATH.
Configuration config = HBaseConfiguration.create();
JavaHBaseContext hBaseContext = new JavaHBaseContext(context, config);
If you receive data from multiple Kafka topics to spark streaming; then divide the topics based on delimiter. Here we use “,” as delimiter

int numThreads = Integer.parseInt(args[3]);
    Map<String, Integer> topicMap = new HashMap<String, Integer>();
    String[] topics = args[2].split(",");
    for (String topic: topics) {
      topicMap.put(topic, numThreads);
    }

int numThreads = Integer.parseInt(args[3]);

Map<String, Integer> topicMap = new HashMap<String, Integer>();

String[] topics = args[2].split(",");

for (String topic: topics) {

topicMap.put(topic, numThreads);

}

Import KafkaUtils and create an input DStream as follows:

JavaPairReceiverInputDStream<String, String> messages =  KafkaUtils.createStream(javaStreamingContext, args[0], args[1], topicMap);

1	JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(javaStreamingContext, args[0], args[1], topicMap);

This line DStream represents the stream of data that will be received from the data server. Each record in this stream is a line of text. Here the data format is in the form of json. So convert data from json to gson using fromJson()

JavaDStream<personbean> lines = messages.map(new Function<Tuple2<String, String>, PersonBean>() {
    	public PersonBean call(Tuple2<String, String> tuple2) {
        	String jsonData = tuple2._2();
        	TypeToken<personbean> token = new TypeToken<personbean>() {};
        	PersonBean model = gson.fromJson(jsonData, token.getType());
            return model;
        }
      });

JavaDStream<PersonBean> lines = messages.map(new Function<Tuple2<String, String>, PersonBean>() {

public PersonBean call(Tuple2<String, String> tuple2) {

String jsonData = tuple2._2();

TypeToken<PersonBean> token = new TypeToken<PersonBean>() {};

PersonBean model = gson.fromJson(jsonData, token.getType());

return model;

}

});

Convert each line of input data into an RDD and push these lines of data into HBase. Here we created pushRawDataToHBase() which receives hbaseContext, personRDD as arguments. Create table “person” in HBase with column family “details”.

lines.foreach(new Function<JavaRDD<personbean>, Void>() {
    	public Void call(JavaRDD<personbean> personRDD) throws Exception {
    		pushRawDataToHBase(hBaseContext, personRDD);
			return null;
		}
	});
    
    lines.print();
    javaStreamingContext.start();
  }

  public static void pushRawDataToHBase(JavaHBaseContext hBaseContext, JavaRDD<personbean> resultRDD) {

try{
	hBaseContext.bulkPut(resultRDD, "person", new Function<PersonBean, Put>() {

private static final long serialVersionUID = 4090513150477943397L;
//@Override
@SuppressWarnings("deprecation")
public Put call(PersonBean dataBean) throws Exception {
	Put put = new Put(Bytes.toBytes(new java.util.Date().getTime()));
	put.add(Bytes.toBytes("details"), Bytes.toBytes("UniqueId"), Bytes.toBytes(dataBean.getUid()));
	put.add(Bytes.toBytes("details"), Bytes.toBytes("FirstName"), Bytes.toBytes(dataBean.getFirstName()));
	put.add(Bytes.toBytes("details"), Bytes.toBytes("LastName"), Bytes.toBytes(dataBean.getLastName()));
	put.add(Bytes.toBytes("details"), Bytes.toBytes("City"), Bytes.toBytes(dataBean.getCity()));
	put.add(Bytes.toBytes("details"), Bytes.toBytes("Salary"), Bytes.toBytes(dataBean.getSalary()));
	
	return put;
	}
}, true);  
} catch (Exception e) {
	e.printStackTrace();
	  }
  }
}

lines.foreach(new Function<JavaRDD<PersonBean>, Void>() {

public Void call(JavaRDD<PersonBean> personRDD) throws Exception {

pushRawDataToHBase(hBaseContext, personRDD);

return null;

}

});

lines.print();

javaStreamingContext.start();

}

public static void pushRawDataToHBase(JavaHBaseContext hBaseContext, JavaRDD<PersonBean> resultRDD) {

try{

hBaseContext.bulkPut(resultRDD, "person", new Function<PersonBean, Put>() {

private static final long serialVersionUID = 4090513150477943397L;

//@Override

@SuppressWarnings("deprecation")

public Put call(PersonBean dataBean) throws Exception {

Put put = new Put(Bytes.toBytes(new java.util.Date().getTime()));

put.add(Bytes.toBytes("details"), Bytes.toBytes("UniqueId"), Bytes.toBytes(dataBean.getUid()));

put.add(Bytes.toBytes("details"), Bytes.toBytes("FirstName"), Bytes.toBytes(dataBean.getFirstName()));

put.add(Bytes.toBytes("details"), Bytes.toBytes("LastName"), Bytes.toBytes(dataBean.getLastName()));

put.add(Bytes.toBytes("details"), Bytes.toBytes("City"), Bytes.toBytes(dataBean.getCity()));

put.add(Bytes.toBytes("details"), Bytes.toBytes("Salary"), Bytes.toBytes(dataBean.getSalary()));

return put;

}

}, true);

} catch (Exception e) {

e.printStackTrace();

}

Now records get inserted into HBase tables. You can view the data by using scan ‘person’ from HBase shell.

Rechercher dans ce blog

Big data

Integrate Kafka, SparkStreaming and HBase (kafka–>SparkStreaming –> HBase)

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning