Bulk Loader for Trafodion
HBase has a feature to produce HFiles from bulk data and to integrate them into an existing table. Khaled has designed and implemented a bulk loader that is based on this feature.
1. What is HBase bulk load:
HBase bulk load is the process of preparing and loading Hfiles which are HBase own format directly in the region servers and bypassing the write path and the cost associated with it. The write path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile.
2. The HBase bulk load process takes place in two phases:
2.1. Preparation phase
2.1.1. This step is usually done as a map reduce job that extracts the data from the source files and generated KeyValue pairs that will populate the Hfiles. The keyvalues pairs need to be sorted and the map reduce job should be configured to do the sorting on the generated keyvalues
2.1.2. In the case of trafodion one more thing needs to done: data needs to be encoded
2.2. Loading the files into Hbase phase
2.2.1. This step uses the LoadIncremental
3. Integration of Trafodion and Hbase Bulk load.
Two approaches were studied
3.1. Approach 1: in this approach, the data preparation is done using Trafodion. Trafodion reads the data from source files in HDFS (hive), does the sort and converts and encodes the data using Trafodion expressions and then produces keyvalue pairs and write them to Hfiles on the hadoop file system directly bypassing the hbase client and the write path. The sort is also done In Trafodion.
3.2. Approach 2 : In this approach, data preparation is done using a custom mapreduce jobs. In this case we may need to replicate the encoding/
3.3.
4. Approach selected
4.1. For now I started implementing Approach 1 and have a working prototype. For approcah2, we can implement it later if we need to.
To load data the user can use the load command
and the explain plan in Trafodion is shown below:
5. What is supported and not supported for the first delivery after approval
5.1. The preparation phase can generate Hfiles in parallel
5.2. Indexes and check constarints are not supported in the first deliverable and will be supported in future deliverables
5.3. Transactions are not supported and if the table is not empty and something goes wrong it may not be recoverable. A check if the table is empty or not can be added
5.4. The Hfiles that are produced by the preparation phase are stored in HDFS and may take as much space as the original source files. Once the loading phase is done the Hfiles are removed
5.5. Exceptions and error rows logging is not supported in this deliverable and will be supported in future deliverables
5.6. I implemented CIF for the move expression in the HDFSScan
5.7. Performance optimizations will be done in future deliveries and may include changes to the plan and other optimizations
Blueprint information
- Status:
- Complete
- Approver:
- None
- Priority:
- High
- Drafter:
- khaled Bouaziz
- Direction:
- Approved
- Assignee:
- khaled Bouaziz
- Definition:
- Approved
- Series goal:
- None
- Implementation:
- Implemented
- Milestone target:
- None
- Started by
- Suresh Subbiah
- Completed by
- Suresh Subbiah
Whiteboard
1. What is Trafodion Bulk Load
Trafodion Bulk Load is the process of preparing and loading Hfiles directly in the region servers and bypassing the write path and the cost associated with it. The write path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile.
The Trafodion bulk load process takes place in two phases:
1. Preparation phase
o In This step Trafodion reads the data from the source files in hive/hdfs, partitions the data based on the target table partitioning scheme, sorts the data and then generates KeyValue pairs that will populate the Hfiles. Trafodion also encode the data for faster storage and retrieval.
2. Loading the files into Hbase phase
o This step uses the LoadIncremental
2. Syntax
The Trafodion Bulk load syntax is described below
LOAD [with option[,option,….]] INTO <target-table> SELECT .. FROM <source-table>
Where:
• <target-table> is the target Trafodion table where the data will be loaded.
• <source-table> can be a Trafodion table or a hive table that has the source data. Hive tables can be accessed in Trafodion using the hive.hive schema (example hive.hive.orders). The hive table needs to already exist in hive before Trafodion can access it. If user wants to load data that is already in an hdfs folder then they need to create an external hive table with the right fields and pointing the hdfs folder containing the data. Users can also specify a WHERE clause on the source data as a filter.
• [With option[,option,…]] is a set of options that user can specify and they can be zero or more of the following:
o TRUNCATE TABLE: By default target table is not truncated before loading data. If truncate table option is specified the target table is truncated before starting the load.
o NO RECOVERY: by default Trafodion Bulk Load handles recovery using hbase snapshots mechanism. If no recovery option is specified then snapshots are not used.
o NO POPULATE INDEXES: by default index maintenance are handled by Trafodion Bulk Load and indexes are disabled before starting the load and populated after the load is complete. If no populate indexes option is specified, indexes are not handled by load.
o NO DUPLICATE CHECK: This options check if there are duplicates coming from the source only. By default an error is generated when duplicates from the source data are detected. If no Duplicate check option is specified then duplicates are ignored.
o NO OUTPUT: by default load command print status messages listing the steps that are being executed. If no output is specified then no status messages are displayed.
• Example: below is an example of running the load statement to load customer_
>>load with populate indexes into customer_
+>select * from hive.hive.
Task: LOAD Status: Started Object: TRAFODION.
Task: DISABLE INDEXE Status: Started Object: TRAFODION.
Task: DISABLE INDEXE Status: Ended Object: TRAFODION.
Task: PREPARATION Status: Started Object: TRAFODION.
Rows Processed: 5000
Task: PREPARATION Status: Ended ET: 00:00:03.199
Task: COMPLETION Status: Started Object: TRAFODION.
Task: COMPLETION Status: Ended ET: 00:00:00.331
Task: POPULATE INDEX Status: Started Object: TRAFODION.
Task: POPULATE INDEX Status: Ended ET: 00:00:05.262
3. Configuration
3.1. Staging folder for Hfiles
Trafodion Bulk Load uses an HDFS folder as staging area for the hfiles prior to calling hbase APIs to merge them into the Trafodion table. By default Trafdion uses “/bulkload/” as the staging folder. This folder needs to be owned by the same user as the one under which Trafodion runs. Trafodion also need to have full permissions on this folder. The hbase user (user under which hbase runs)
needs to have read/write access to this folder.
Example : drwxr-xr-x - trafodion trafodion 0 2014-07-07 09:49 /bulkload.
3.2 Access Control Lists (ACL)
As of trafodion 0.9 users can also use ACLs to set the right permissions on the /bulkload folder.
more information on ACLs can be found at: http://
2. HDFS
To load data stored in hdfs, we need to create a hive table with the right fields and types pointing to hdfs folder containing the data before we start the load. The data files under the hdfs folder need to be in delimited format. the Row delimiter is the hive row delimiter which is '\n' (Ascii 0x10). The field delimiter nees also to be one byte in length with values between 1 and 255 (decimal). THe data in the data files needs to be either in ISO-8851or UTF-8 charecter set. Currently character sets other than ISO-88591 and UTF-8 are not supported.
3. Snapshots
If the NO RECOVERY OPTION is not specified, Trafodion Bulk Load uses hbase snapshots as mechanism for recovery. Snapshots are light weight operation where some metadata is copied (data is not copied). A snapshot is taken before the load starts and removed after the load completes successfully. If something goes wrong and it is possible to recover then the snapshot is used to restore the table to its initial state before the load started. Hbase also needs to be configured to allow snapshots.
4. Examples:
4.1. Example 1:
For customer demographics data residing in /hive/tpcds/
create external table customer_
(
cd_demo_sk int,
cd_gender string,
cd_
cd_
cd_
cd_
cd_dep_count int,
cd_
cd_
)
row format delimited fields terminated by '|'
location '/hive/
The Trafodion table where we want t o load the data is defined using the DDL:
create table customer_
(
cd_demo_sk int not null,
cd_gender char(1),
cd_marital_status char(1),
cd_education_
cd_purchase_
cd_credit_rating char(10),
cd_dep_count int,
cd_dep_
cd_dep_
primary key (cd_demo_sk)
)
salt using 4 partitions on (cd_demo_sk);
To load the data we can use the statement:
load with populate indexes into customer_
4.4. Example 2:
4.4.1.Mapping HDFS data to a Hive table
To load data stored in hdfs, a hive table with right fields and types need to be created and pointed to hdfs folder containing data. The hive file must be of format
TEXTFILE
For example, if a table contains data with 2 fields(string and int) delimited by ‘|’ in folder ‘/hive/
need to be run in hive shell:
create external table customer_
(
cd_demo_sk int,
cd_name string
)
row format delimited fields terminated by '|'
location '/hive/
4.4.2.Staging folder for Hfiles
Trafodion Bulk Load uses an HDFS folder as staging area for temporary hfiles prior to calling hbase APIs to merge them into target Trafodion table. By default Trafdion uses “/bulkload/” as the staging folder. This folder is created by the installer with appropriate permissions. On some systems there may be access permission errors related to this folder. If such an error is seen please contact Trafodion development.
4.4.3.Trafodion Table
A Trafodion table need to be created where data will be loaded.
For example, table corresponding to hdfs data from the previous section will be:
create table Trafodion.
(
demo_sk int not null,
name varchar(100),
primary key (demo_sk)
)
hbase_options
(
DATA_
COMPRESSION = 'SNAPPY'
)
salt using 8 partitions on (demo_sk) ;
Other compression like ‘GZ’ could also be used. For salting some planning should be done to determine how many partitions a table should have and what the partitioning keys should be.
4.4.4.Loading data
Once both hive and Trafodion tables have been created, data can be loaded using the following command:
load with populate indexes into trafodion.
select * from hive.hive.
This will load data into Trafodion table and populate all indexes that are created on that table. If tables have indexes and disk space is a concern, each index may need to have data_block_encoding and compression enabled. For index this cannot be done in the CREATE INDEX statement, please contact Trafodion development on how these options can be enabled for an index.