Bulk Loader for Trafodion

Registered by Hans Zeller

HBase has a feature to produce HFiles from bulk data and to integrate them into an existing table. Khaled has designed and implemented a bulk loader that is based on this feature.

                                                  Bulk loader for Trafodion

1. What is HBase bulk load:
HBase bulk load is the process of preparing and loading Hfiles which are HBase own format directly in the region servers and bypassing the write path and the cost associated with it. The write path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile.
2. The HBase bulk load process takes place in two phases:
2.1. Preparation phase
2.1.1. This step is usually done as a map reduce job that extracts the data from the source files and generated KeyValue pairs that will populate the Hfiles. The keyvalues pairs need to be sorted and the map reduce job should be configured to do the sorting on the generated keyvalues
2.1.2. In the case of trafodion one more thing needs to done: data needs to be encoded
2.2. Loading the files into Hbase phase
2.2.1. This step uses the LoadIncrementalHFiles (aka computebulkload tool) and load the generated HFiles into the region servers. This step should not take a lot of time nor resources
3. Integration of Trafodion and Hbase Bulk load.
Two approaches were studied
3.1. Approach 1: in this approach, the data preparation is done using Trafodion. Trafodion reads the data from source files in HDFS (hive), does the sort and converts and encodes the data using Trafodion expressions and then produces keyvalue pairs and write them to Hfiles on the hadoop file system directly bypassing the hbase client and the write path. The sort is also done In Trafodion.
3.2. Approach 2 : In this approach, data preparation is done using a custom mapreduce jobs. In this case we may need to replicate the encoding/expressions mechanism that is used in trafodion. It also requires querying the metadata and figure out all the conversions.
3.3.
4. Approach selected
4.1. For now I started implementing Approach 1 and have a working prototype. For approcah2, we can implement it later if we need to.
To load data the user can use the load command
                   load into trafodion.sch.orders select o_orderkey, o_custkey, o_orderstatus, o_totalprice, cast(o_orderdate as date),
                   o_orderpriority, o_clerk, o_shippriority, o_comment from hive.hive.orders;
and the explain plan in Trafodion is shown below:
5. What is supported and not supported for the first delivery after approval
5.1. The preparation phase can generate Hfiles in parallel
5.2. Indexes and check constarints are not supported in the first deliverable and will be supported in future deliverables
5.3. Transactions are not supported and if the table is not empty and something goes wrong it may not be recoverable. A check if the table is empty or not can be added
5.4. The Hfiles that are produced by the preparation phase are stored in HDFS and may take as much space as the original source files. Once the loading phase is done the Hfiles are removed
5.5. Exceptions and error rows logging is not supported in this deliverable and will be supported in future deliverables
5.6. I implemented CIF for the move expression in the HDFSScan
5.7. Performance optimizations will be done in future deliveries and may include changes to the plan and other optimizations

Blueprint information

Status:
Complete
Approver:
None
Priority:
High
Drafter:
khaled Bouaziz
Direction:
Approved
Assignee:
khaled Bouaziz
Definition:
Approved
Series goal:
None
Implementation:
Implemented
Milestone target:
None
Started by
Suresh Subbiah
Completed by
Suresh Subbiah

Related branches

Sprints

Whiteboard

1. What is Trafodion Bulk Load
Trafodion Bulk Load is the process of preparing and loading Hfiles directly in the region servers and bypassing the write path and the cost associated with it. The write path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile.
The Trafodion bulk load process takes place in two phases:
1. Preparation phase
o In This step Trafodion reads the data from the source files in hive/hdfs, partitions the data based on the target table partitioning scheme, sorts the data and then generates KeyValue pairs that will populate the Hfiles. Trafodion also encode the data for faster storage and retrieval.
2. Loading the files into Hbase phase
o This step uses the LoadIncrementalHFiles (aka computebulkload tool) and load the generated HFiles into the region servers.
2. Syntax
The Trafodion Bulk load syntax is described below

    LOAD [with option[,option,….]] INTO <target-table> SELECT .. FROM <source-table>

Where:
• <target-table> is the target Trafodion table where the data will be loaded.
• <source-table> can be a Trafodion table or a hive table that has the source data. Hive tables can be accessed in Trafodion using the hive.hive schema (example hive.hive.orders). The hive table needs to already exist in hive before Trafodion can access it. If user wants to load data that is already in an hdfs folder then they need to create an external hive table with the right fields and pointing the hdfs folder containing the data. Users can also specify a WHERE clause on the source data as a filter.
• [With option[,option,…]] is a set of options that user can specify and they can be zero or more of the following:
o TRUNCATE TABLE: By default target table is not truncated before loading data. If truncate table option is specified the target table is truncated before starting the load.
o NO RECOVERY: by default Trafodion Bulk Load handles recovery using hbase snapshots mechanism. If no recovery option is specified then snapshots are not used.
o NO POPULATE INDEXES: by default index maintenance are handled by Trafodion Bulk Load and indexes are disabled before starting the load and populated after the load is complete. If no populate indexes option is specified, indexes are not handled by load.
o NO DUPLICATE CHECK: This options check if there are duplicates coming from the source only. By default an error is generated when duplicates from the source data are detected. If no Duplicate check option is specified then duplicates are ignored.
o NO OUTPUT: by default load command print status messages listing the steps that are being executed. If no output is specified then no status messages are displayed.
• Example: below is an example of running the load statement to load customer_demographics_salt table from hive table hive.hive.customer>
         >>load with populate indexes into customer_demographics_salt
+>select * from hive.hive.customer_demographics where cd_demo_sk <= 5000;
Task: LOAD Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
Task: DISABLE INDEXE Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
Task: DISABLE INDEXE Status: Ended Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
Task: PREPARATION Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
       Rows Processed: 5000
Task: PREPARATION Status: Ended ET: 00:00:03.199
Task: COMPLETION Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
Task: COMPLETION Status: Ended ET: 00:00:00.331
Task: POPULATE INDEX Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
Task: POPULATE INDEX Status: Ended ET: 00:00:05.262

3. Configuration
3.1. Staging folder for Hfiles
Trafodion Bulk Load uses an HDFS folder as staging area for the hfiles prior to calling hbase APIs to merge them into the Trafodion table. By default Trafdion uses “/bulkload/” as the staging folder. This folder needs to be owned by the same user as the one under which Trafodion runs. Trafodion also need to have full permissions on this folder. The hbase user (user under which hbase runs)
needs to have read/write access to this folder.
Example : drwxr-xr-x - trafodion trafodion 0 2014-07-07 09:49 /bulkload.
3.2 Access Control Lists (ACL)
As of trafodion 0.9 users can also use ACLs to set the right permissions on the /bulkload folder.
more information on ACLs can be found at: http://hadoop.apache.org/docs/r0.18.3/hdfs_permissions_guide.html

2. HDFS
To load data stored in hdfs, we need to create a hive table with the right fields and types pointing to hdfs folder containing the data before we start the load. The data files under the hdfs folder need to be in delimited format. the Row delimiter is the hive row delimiter which is '\n' (Ascii 0x10). The field delimiter nees also to be one byte in length with values between 1 and 255 (decimal). THe data in the data files needs to be either in ISO-8851or UTF-8 charecter set. Currently character sets other than ISO-88591 and UTF-8 are not supported.

3. Snapshots
If the NO RECOVERY OPTION is not specified, Trafodion Bulk Load uses hbase snapshots as mechanism for recovery. Snapshots are light weight operation where some metadata is copied (data is not copied). A snapshot is taken before the load starts and removed after the load completes successfully. If something goes wrong and it is possible to recover then the snapshot is used to restore the table to its initial state before the load started. Hbase also needs to be configured to allow snapshots.

4. Examples:
4.1. Example 1:
For customer demographics data residing in /hive/tpcds/customer_demographics, we create an external hive table using the below hive SQL
create external table customer_demographics
(
    cd_demo_sk int,
    cd_gender string,
    cd_marital_status string,
    cd_education_status string,
    cd_purchase_estimate int,
    cd_credit_rating string,
    cd_dep_count int,
    cd_dep_employed_count int,
    cd_dep_college_count int
)
row format delimited fields terminated by '|'
location '/hive/tpcds/customer_demographics';

The Trafodion table where we want t o load the data is defined using the DDL:

create table customer_demographics_salt
(
  cd_demo_sk int not null,
  cd_gender char(1),
  cd_marital_status char(1),
  cd_education_status char(20),
  cd_purchase_estimate int,
  cd_credit_rating char(10),
  cd_dep_count int,
  cd_dep_employed_count int,
  cd_dep_college_count int,
  primary key (cd_demo_sk)
)
salt using 4 partitions on (cd_demo_sk);
To load the data we can use the statement:
load with populate indexes into customer_demographics_salt select * from hive.hive.customer_demographics where cd_demo_sk <= 5000;

4.4. Example 2:

4.4.1.Mapping HDFS data to a Hive table
To load data stored in hdfs, a hive table with right fields and types need to be created and pointed to hdfs folder containing data. The hive file must be of format
TEXTFILE

For example, if a table contains data with 2 fields(string and int) delimited by ‘|’ in folder ‘/hive/tpcds/customer_demographics’, then the following create statement
need to be run in hive shell:
create external table customer_demographics
(
    cd_demo_sk int,
    cd_name string
)
row format delimited fields terminated by '|'
location '/hive/tpcds/customer_demographics';

4.4.2.Staging folder for Hfiles
Trafodion Bulk Load uses an HDFS folder as staging area for temporary hfiles prior to calling hbase APIs to merge them into target Trafodion table. By default Trafdion uses “/bulkload/” as the staging folder. This folder is created by the installer with appropriate permissions. On some systems there may be access permission errors related to this folder. If such an error is seen please contact Trafodion development.

4.4.3.Trafodion Table
A Trafodion table need to be created where data will be loaded.

For example, table corresponding to hdfs data from the previous section will be:
create table Trafodion.sch.customer_demographics
(
demo_sk int not null,
name varchar(100),
primary key (demo_sk)
)
hbase_options
  (
    DATA_BLOCK_ENCODING = 'FAST_DIFF',
    COMPRESSION = 'SNAPPY'
  )
salt using 8 partitions on (demo_sk) ;

Other compression like ‘GZ’ could also be used. For salting some planning should be done to determine how many partitions a table should have and what the partitioning keys should be.
4.4.4.Loading data
Once both hive and Trafodion tables have been created, data can be loaded using the following command:

load with populate indexes into trafodion.sch.customer_demographics
           select * from hive.hive.customer_demographics

This will load data into Trafodion table and populate all indexes that are created on that table. If tables have indexes and disk space is a concern, each index may need to have data_block_encoding and compression enabled. For index this cannot be done in the CREATE INDEX statement, please contact Trafodion development on how these options can be enabled for an index.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.