Data Partitioning – An optimization Technique in Apache Hive

1. Objective

The tutorial explains about the data partitioning in hive, what is the need of partitioning, how it improves the performance. Partitioning is the optimization technique in Hive which improves the performance significantly. Apache hive is the data warehouse on the top of Hadoop, which enables adhoc analysis over structured and semi-structured data, to learn more about hive follow this introductory guide.

2. Hive Partitioning

AS we know that in the current century huge amount of data which is in the range of petabytes is getting stored in HDFS. Due to which it becomes very difficult for hadoop users to query this data. Hive was developed to lower down this burden of data querying. Hives converts the SQL queries into mapreduce jobs and submit it to the hadoop cluster. Now when we submit a SQL query, the entire data-set is read by hive. Hence it becomes lil inefficient to run MapReduce jobs over a large table. This problem can be overcome by creating partitions in tables. Hive makes this job of implementing partitions very easy by creating partitions by its automatic partition scheme at the time of creation of table.
In Hive’s data partitioning method, all the data present inside a table is divided into multiple partitions. Each partition corresponds to a specific value(s) of partition column(s) and is kept as a sub-record inside the table’s record present in the HDFS. So on querying a particular table, appropriate partition of table is queried which contains the query value. Hence it decreases the I/O time required by the query which increases the performance speed.
Unlike the logical Partitioning in RDBMS, partitioning in hive is physical, for each partition different directory is created. Now while querying if partitioned column is specified then only the data of specified partition will be processed. If the table contains 10 Billion records, for each adhoc requirement all 10 Billion records need to be processed. On the other hand if the table is partitioned (as shown in figure), if report over specific day’s data need to be created, the data of that day (specified partition containing data in the range of few thousands) need to be processed.

3. Partitioning Example

Let’s understand this with an example; consider a table named record which contains student’s data such as rollno, name, class, and batch. Suppose you need to retrieve the details of the entire students who left the school in year 2011. The whole table is queried if a query has been raised to retrieve any information. However, if you partition the student’s table with the year, a lot of query processing time will be reduced on retrieving any information. The example below will help you to learn how to partition a file and its data:
The file named file1 contains student data table.
1
2
3
4
5
6
record/studentdata/file1
rollno, name, class, batch
1, monika, 12, 2010
2, yash, 12, 2010
3, shilpa,11, 2011
4, parv, 11, 2011
Now we will partition the above data into two files using year.
1
2
3
4
5
6
7
record/studentdata/2010/file2
1, monika, 12, 2010
2, yash, 12, 2010
record/studentdata/2011/file3
3, shilpa,11, 2011
4, parv,11, 2011
While retrieving the data from the table, only the data of specified partition will be queried. Create a partitioned table:
1
2
3
CREATE TABLE studentTab (rollno INT, name STRING, class INT, batch INT) PARTITIONED BY (year STRING);
LOAD DATA LOCAL INPATH 'record/studentdata/2011/file2' OVERWRITE INTO TABLE studentTab PARTITION (year='2010');
LOAD DATA LOCAL INPATH 'record/studentdata/2011/file3' OVERWRITE INTO TABLE studentTab PARTITION (year='2011');

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization