Integrate Update Statistics with Bulk Load for Faster Initial Stats

Registered by Barry Fritchman on 2015-03-08

The cost of generating/updating statistics has long been a problematic issue for Trafodion and its predecessor systems. Many performance improvements have been made, but updating statistics remains a bottleneck for many applications. This blueprint describes an alternative approach to constructing histograms, while preserving the existing form of the histograms. Rather than sorting the data and grouping common values, a Counting Bloom Filter (CBF) can be used to record the frequency of each distinct value, and then used to build the histogram intervals, and to derive the distribution of frequency values needed for estimating UECs (unique entry counts) based on the sample used. The immediate work proposed in this blueprint addresses the initial collection of statistics for a table initially populated by a bulk load. It includes adding Update Statistics as an optional task of the bulk load utility (controlled initially by a CQD, but possibly by amending the bulk load syntax in the future), creating the sample table by randomly selecting rows as they pass through the loader and writing them to a Hive table, and using the sample table in conjunction with CBFs and a new "fast stats" algorithm to create histograms for the table's columns upon conclusion of the bulk load. Subsequent work will address making the sample table persistent, and updating it (and perhaps the histograms as well) when the HBase flush mechanism creates a new HFile from a MemStore, but this work will be undertaken post-1.1.

Blueprint information

Not started
QF Chen
Barry Fritchman
Needs approval
Barry Fritchman
Series goal:
Milestone target:
milestone icon r2.0

Related branches



Gerrit topic:,topic:bp/ustat-bulk-load,n,z

Addressed by:
    New ustat algorithm and bulk load integration


Work Items

This blueprint contains Public information 
Everyone can see this information.