Handling of region rebalancing and splits

Registered by Oliver Bucaojit

Handling of region split and balance operations with transactions active have a similar designs so they are both included in this BluePrint. Currently we disable the HBase balance feature to avoid losing transaction state when a region is transitioned OFFLINE. As for region split, the operation is delayed if there are transactions in the prepared or active state. Once these maps are clear, the split operation will resume.

The in-memory state does not persist over a balance or split. HBase proper is able to perform these by flushing the memstore to disk before bringing down the region and pointing the newly brought up region to the file system location of the persisted data. This design is to follow the process of flushing transactional data and reading it into the coprocessor when the regions are reopened.

Shared Region Map:
The static region map will hold the objects for the TrxRegionEndpoint and SsccRegionEndpoint coprocessors. When the observer code needs to issue the flush or read methods on the region, it will get the coprocessor object and call the public methods to flush or read the data from the filesystem. This is especially useful for the split operation as the parent region will be able to coordinate the flush/read and the setting of the ‘closing’ flag so that new external transactions are not started while the persisted transactions are being read.

Persisting Data:
In-memory transactional state will be serialized into Google protocol buffers and flushed to an HFile during the preClose() process of the region observer. There are two types of protocol buffers used, one is used for each TransactionState object and another to recreate the transactional information for the region, such as: transactionsById, commitedTransactionsBySequenceNumber, and nextSequenceId.

HFiles will be written to the same directory as where region information is saved.

Reading Data:
For the balance operation, the read will be done on postOpen() of the region. This allows the HFile read and replay operation to complete while the region is not available.

For the split operation, the read will be coordinated by the parent region during postSplit(). This is necessary as the parent region is aware of the split status, coordination, and HFile path. Another problem that needed to be solved was that the HRegion information is not visibile to the RegionObserver until postSplit(), which is after the regions have been opened. The reason this was an issue is because there is no way of letting a region know it is a daughter region as opposed to a non-split region start.

Incorporating Delays:
This design will incorporate a delay for prepared transactions and scans. These are operations that complete relatively quickly and the delay check will be 0.5 seconds or so. In future implementations this may change to being persisted, tests will need to be run to determine whether it will be necessary.

ZooKeeper Coordination:
ZooKeeper will be used for the regions to know whether a balance or split is occurring, and to store the path of the flushed HFile. As an alternative, this information can be stored in HDFS or any other persistent store. The format of the ZK node will be:
/hbase/Trafodion/splitbalance/<TABLE NAME>/<SPLIT OR BALANCE>/

Client Retry:
Client-side calls will need to have a retry and wait added depending on the Exception. Currently I have added a one second delay on retry with up to 15 retries. So a 15 second delay maximum for the client. The retry will trigger if the region has thrown a ‘closing region’ exception or if the region was unavailable (the coprocessor call returned no results).

Transactional state objects to persist will include:

TransactionStates that are in transactionsByID and commitedTransactionsBySequenceNumber
 transactionsById
commitedTransactionsBySequenceNumber
nextSequenceId.

__Design for Balance__
 preClose() --
If RS is closing, return
Check for prepared transactions or scans, delay if present
Block new transactions
   Write out transactional state
   Zookeeper check -- if no split node set, set balance node

 -- Close() --

-- Region open --
postOpen() --
   If no ZK split node set then read transactional information
   Reinstate the objects
   Clear ZK balance node

__Design for Split__
 preSplit() -- Check for prepared transactions or scans, delay if present
Block new transactions
Set split ZK node

preClose() --
Write out transactional state

 -- Close() --

 Daughter regions open
postSplit() --
   From parent region, coordinate the daughter region’s reading transactional state from parent
   Set region to closing so no new transactions started

   if(this.m_Region.rowIsInRange(this.regionInfo, row)) { /* process data */}
   Reinstate the objects
   Set region closing flag to ‘false’
   Clear split ZK node

--------------------------------------------------------------------------------------

This current design will implement the transaction state flush/read feature. Depending on the amount of time the region is unavailable, other changes may be necessary. This is still something to be determined with further testing.

One idea to implement would be to implement both the active transaction wait and flush/read feature. The active transaction wait can be configured to always wait as long as the number of transactions for a given region is above a specified threshold. Another idea would be to have the active wait feature delay until a maintenance window where the administrator can choose to allow the split/balance operations to continue.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Oliver Bucaojit
Direction:
Needs approval
Assignee:
Oliver Bucaojit
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Have completed code changes and passed the following tests:

                Balance – Loading a table through sqlci and ‘disabling’ ‘enabling’ the table
                                                                -Client-side retries until the region is re-enabled (configurable wait limit, currently 15 sec)
                                                Having 2 RegionServers on workstation and running the balancer while table is being loaded
                                                                -Making sure the region was moved to a different RS while loading
                Split -- For the split test, I create a table and set the MAX_FILESIZE, MEMSTORE_FLUSHSIZE to a small size, 4342177, to split with a small load
a) Load the table until it splits into 2 regions, check the number of rows in table match rows inserted
b) Load the table with 3 separate sessions until splits into 2 regions
c) Repeat the loading and monitor the region numbers, successfully tested 11 region splits

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.