Data Partitioning in Big Data Application with Apache Hive

Big data consulting company professionals are introducing the concept of partitioning in big data applications. You need to read the post completely to understand how to do partitioning in such an app using Apache Hive. If you don’t know how to do it, experts will help

Introduction About Partitioning In Big Data Application

Apache Hadoop is a data framework that can support methods of big data.

In big data applications, data partitioning is very important because it will divide the huge data set into many related sub-data set by many criteria base on columns of data such as partition by collect date, country, city… With partitioning, we will have better organization in data to query and analyze data and improve performance.

f:id:aegissofttech:20211222125605j:plain

Big-data-hive

In this topic, I will introduce how to do partition data in Hive and focus con the example to do the partition data in a big data application daily to make sure always load the process daily data to the partition table.

Environment

Java: JDK 1.7

Cloudera version:  CDH4.6.0

Initial steps

1. We have to be compelled to prepare some input data files, open a new file in the terminal of Linux:

vifile1

Text some input data with format: id;name;ticketPrice;collectdate

1;Join;2000;20160730000000

2;Jean;3000;20160731000000

3;Jang;5000;20160730000000

4;Jane;1000;20160731000000

5;Jin;6000;20160730000000

2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mysample/mappingTable

hadoop fs -put file1/data/mysample/mappingTable/

Code walkthrough

We will create a Hive script that loads the data to an external table to do the partitioning. After that, we will move the data from the mapping table to the output table for data analytics which support querying the data for business purpose. 

Create database

create database cloudera_dataPartitionExample;

Use database

usecloudera_dataPartitionExample;

Drop the partitionTable

Create the partitionTable to contain the real data after partitioning

DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/partitionTable’

DROP TABLE IF EXISTS partitionTable;

CREATE EXTERNAL TABLE partitionTable

(

id string,

name string,

ticketPrice string

)

PARTITIONED BY (collectdate string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

STORED AS SEQUENCEFILE

LOCATION '/data/mysample/partitionTable';

Drop the mappingTable

Create the mappingTable to mapping to data which have not done partitionTable

DELIMITED by ‘;’ in our data and store at the location ‘/data/mysample/mappingTable’

DROP TABLE IF EXISTS mappingTable;

CREATE EXTERNAL TABLE mappingTable

(

id string,

name string,

ticketPrice string

collectdate string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

STORED AS SEQUENCEFILE

LOCATION '/data/mysample/mappingTable';

 

SET hive.exec.max.dynamic.partitions.pernode=100000;

SET hive.exec.max.dynamic.partitions=100000;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.dynamic.partition=true;

Move the data from mappingTable to partitionTable. This query will always move the data to partitionTable with partition column is “collectdate”

INSERT INTO TABLE partitionTable PARTITION (collectdate) SELECT * FROM mappingTable;

Load all metadata of Hive for partition in this folder to Hive table

MSCK REPAIR TABLE partitionTable;

Drop mapping table after partitioning

DROP TABLE IF EXISTS mappingTable;

Verify the result

After running the script above, we will check the HDFS and hive table to make sure the data is partition good or not.

       1. Use this command to show the data on HDFS

hadoop fs –ls /data/mysample/partitionTable/

We will see two partition folders:

/data/mysample/partitionTable/collectdate=20160730000000

/data/mysample/partitionTable/collectdate=20160731000000

      2. View the data for each partition folder:

hadoop fs –text /data/mysample/partitionTable/collectdate=20160731000000/* | head –n 10

We will see that this folder will have two records because it is collected by 20160731:

2;Jean;3000

4;Jane;1000

hadoop fs –text /data/mysample/partitionTable/collectdate=20160730000000/* | head –n 10

We will see that this folder will have three records because it is collected by 20160730:

3;Jang;5000

1;Join;2000

5;Jin;6000

3. We can access to Hive client terminal to query the data with the table partitionTablein database cloudera_dataPartitionExample

select * from cloudera_dataPartitionExample.partitionTable;

show create table cloudera_dataPartitionExample.partitionTable;

 

We can see that we will not have the collect column anymore in our real data because we have already partitioned the data by that column. Now you can see we can query the data very easily by each partition based on the collection date column.

Hope that you guys can understand how partitioning is important to apply to our big data application.

Hope you have completely understood the concept of data partitioning using Apache hive. For doubts and queries, you may contact any good big data consulting company and avail related information from experts.

Related Post:

Why Global Defense shifting their gears towards the adoption of AI and Big Data?

Looking Into The Near Future: AI, Big Information and Upcoming Pandemic