Reading Data from HBase table using Spark

HBase is a data model, similar to Google’s big table, designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop Distributed File System (HDFS).
HBase is column-oriented database built on top of the HDFS. It is an open-source and is horizontally scalable. HBase is used to access very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.
Let us consider we have a table with name “student_info” within our HBase with the columnfamily “details” and column qualifiers “sid, firstName, lastName, branch, emailId”.  Create a pojo class as below: 
Create JavaSparkContext object using SparkConf object
Read data from HBase table, providing the key space and table name using the code below:

Now studentRDD will have all the records from the table in the form of Spark RDD. We can perform any aggregate or spark operation on top of this RDD.

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization