Integrate Update Statistics with Bulk Load for Faster Initial Stats
The cost of generating/updating statistics has long been a problematic issue for Trafodion and its predecessor systems. Many performance improvements have been made, but updating statistics remains a bottleneck for many applications. This blueprint describes an alternative approach to constructing histograms, while preserving the existing form of the histograms. Rather than sorting the data and grouping common values, a Counting Bloom Filter (CBF) can be used to record the frequency of each distinct value, and then used to build the histogram intervals, and to derive the distribution of frequency values needed for estimating UECs (unique entry counts) based on the sample used. The immediate work proposed in this blueprint addresses the initial collection of statistics for a table initially populated by a bulk load. It includes adding Update Statistics as an optional task of the bulk load utility (controlled initially by a CQD, but possibly by amending the bulk load syntax in the future), creating the sample table by randomly selecting rows as they pass through the loader and writing them to a Hive table, and using the sample table in conjunction with CBFs and a new "fast stats" algorithm to create histograms for the table's columns upon conclusion of the bulk load. Subsequent work will address making the sample table persistent, and updating it (and perhaps the histograms as well) when the HBase flush mechanism creates a new HFile from a MemStore, but this work will be undertaken post-1.1.
Blueprint information
- Status:
- Not started
- Approver:
- QF Chen
- Priority:
- Undefined
- Drafter:
- Barry Fritchman
- Direction:
- Needs approval
- Assignee:
- Barry Fritchman
- Definition:
- New
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- r2.0
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
Gerrit topic: https:/
Addressed by: https:/
New ustat algorithm and bulk load integration