Skew insensitive parallel hash joins (skew buster)

Registered by QF Chen

Trafodion partitions tables across all processing nodes and distributes the rows of the tables evenly across all disk partitions using a hashing algorithm. The columns used for the hash key are chosen by the DBA to get as even a distribution of rows as possible. This enables all of the processing nodes to have equal-size private subsets of each table to process, thereby giving each node the same amount of work to do in parallel. As long as large tables are simply scanned, there is no problem. Or if large tables are joined or grouped by the same columns used for the hashed distribution in data loading, then 100 percent parallelism is achieved. However, this does not prevent data skew in the intermediate tables that must be joined or processed with a group by operation in complex, multi-stage queries.

Data skew comes in multiple forms. The most common is the “most popular values” data skew, which happens when certain values in an important data column occur significantly more often than other values. In intermediate stages of complex queries, two large tables might need to be joined on a join key that differs from the original key used to hash the distribution of the tables. The values in join columns can turn out to be very skewed (a very large percentage of rows have the same value). A good hashing algorithm evenly distributes the values across nodes, but if there are a disproportionate number of rows with the same value in the hash key columns, all of these rows end up on the same node for processing. One or a few nodes will be overworked, while other nodes will have little to do.

The skew buster feature in Trafodion can recognize situations where data is skewed in intermediate stages of a query and adjust the query plan and execution time redistribution of intermediate data to ensure that all data is evenly distributed over all processing nodes.

Blueprint information

Status:
Complete
Approver:
None
Priority:
Medium
Drafter:
QF Chen
Direction:
Needs approval
Assignee:
QF Chen
Definition:
Approved
Series goal:
None
Implementation:
Implemented
Milestone target:
None
Started by
QF Chen
Completed by
QF Chen

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.