Implements GROUP BY operations in API

Registered by Julien Danjou

The API needs to provide some sort of GROUP BY operation to solve certain query types.

Blueprint information

Status:
Complete
Approver:
Julien Danjou
Priority:
High
Drafter:
Julien Danjou
Direction:
Approved
Assignee:
Terri Yu
Definition:
Approved
Series goal:
Accepted for havana
Implementation:
Implemented
Milestone target:
milestone icon 2013.2
Started by
Terri Yu
Completed by
Julien Danjou

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/api-group-by,n,z

Wiki: https://wiki.openstack.org/wiki/Ceilometer/blueprints/api-group-by

The wiki mirrors the contents of this blueprint's whiteboard and also has some additional notes and discussion.

Summary
-------------

Enhance API v2 so it implements new arguments to do GROUP BY operations when calculating meter statistics.

If the user requests query filtering and/or period grouping, these operations are applied first, then the GROUP BY operations are applied second.

General design comments
------------------------------------

### Metadata fields

Decided not to implement group by for metadata fields and do this at a later date.

### Ordering of parameters applied

Meter statistics can be called with three parameters: query filter, period, and groupby. Each parameter corresponds to an operation. It is important to note the order in which these operations are applied. Query filtering is always applied first. Then what about period and group by?

Since period grouping is basically a group by for time range, there is an ambiguity when **both** period and group by for other field(s) are requested. Conceivably, you could have

1. Period grouping is applied first, followed by group by on other field(s)
2. Group by on field(s) is applied first, followed by period grouping

We've chosen to implement the first possibility, where period grouping is performed first.

To summarize, the order of application is

1. Query filters
2. Period grouping
3. Group by on other field(s)

Major components
-------------------------

1. Storage driver tests to check group by statistics
2. SQL Alchemy driver group by implementation
3. MongoDB driver group by implementation
4. API tests to check group by statistics
5. API group by statistics implementation

Storage driver tests to check group by statistics
---------------------------------------------------------------

Addressed by: https://review.openstack.org/41597
"Add SQLAlchemy implementation of groupby"

Created a new class StatisticsGroupByTest in tests/storage/base.py that contains the storage tests for group by statistics and has its own test data

The storage tests check group by statistics for
 1) single field, "user-id"
 2) single field, "resource-id"
 3) single field, "project-id"
 4) single field, "source"
 5) single field with invalid/unknown field value
 6) single metadata field (not yet implemented)
 7) multiple fields
 8) multiple metadata fields (not yet implemented)
 9) multiple mixed fields, regular and metadata (not yet implemented)
10) single field groupby with query filter
11) single metadata field groupby with query filter (not yet implemented)
12) multiple field group by with multiple query filters
13) multiple metadata field group by with multiple query filters (not yet implemented)
14) single field with period
15) single metadata field with period (not yet implemented)
16) single field with query filter and period
17) single metadata field with query filter and period (not yet implemented)

The test data is constructed such that the measurements are integers (specified by the "volume" attribute of the sample) and the averages in the statistics are also integers. This helps avoid floating point errors when checking the statistics attributes (e.g. min, max, avg) in the tests.

Currently, metadata group by tests are not implemented. Supporting metadata fields is a more complicated case, so we leave that for future work. The test data contains metadata fields, as a starting point for future work on metadata group by.

The group by period tests and test data are constructed, so that there are periods with no samples. For the group by period tests, statistics are calculated for the periods 10:11 - 12:11, 12:11 - 14:11, 14:11 - 16:11, and 16:11 - 18:11. However, there are no samples with timestamps in the period 12:11 - 14:11. It's important to have this case, to check that the storage drivers behave properly when there are no samples in a period.

SQL Alchemy driver group by implementation
-------------------------------------------------------------

Addressed by: https://review.openstack.org/41597
"Add SQLAlchemy implementation of groupby"

Decided to only implement group by for the "user-id", "resource-id", and "project-id" fields. The "source" and metadata fields are not supported. It turned out that supporting "source" in SQL Alchemy is much more complicated than "user-id", "resource-id", and "project-id".

MongoDB driver group by implementation
--------------------------------------------------------

Addressed by: https://review.openstack.org/43043
"Adds group by statistics for MongoDB driver"

Adds group by meter statistics in the MongoDB driver for the case where the groupby fields are a combination of 'user-id', 'resource-id', 'project-id', and 'source' (metadata fields not implemented as noted in above section "General design comments")

### Design for aggregation method

Summary: Decided to continue using the mapReduce() MongoDB aggregation method, even though there are other options in the API.

There are three types of MongoDB aggregation commands: aggregate(), mapReduce(), group()

The MongoDB manual has a comparison of these three types:

http://docs.mongodb.org/manual/reference/aggregation-commands-comparison/

Apparently, MongoDB now recommends aggregate() (aka "the aggregation pipeline", a new feature since MongoDB version 2.2) when possible:

"For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline."

Reference: http://docs.mongodb.org/manual/core/map-reduce/

Ceilometer currently uses the mapReduce() method to calculate meter statistics, but we could conceivably switch to using aggregate() or group().

Decided to stick with mapReduce() because it's the most flexible. mapReduce() can support non-standard aggregation operators (operations that are not min, max, avg, etc.), whereas aggregate() cannot.

For example, in the blueprint "Improvements for API v2", it suggests an improvement:

"Provide additional statistical function (Deviation, Median, Variation, Distribution, Slope, etc...) which could be given as multiple results for a given data set collection"

Reference: https://blueprints.launchpad.net/ceilometer/+spec/api-v2-improvement

Functions like deviation and median are not standard aggregation operators in the MongoDB aggregation pipeline aggregate().

The group() aggregation command is less flexible than mapReduce(), slower in performance than `aggregate`, and doesn't support sharded collections (i.e. a database distributed across multiple servers)

Also, for the aggregate() and group() methods, the result set must fit within the maximum BSON document size limit (16 MB). but

"Additionally, map-reduce operations can have output sets that are that exceed the 16 megabyte output limitation of the aggregation pipeline."

Reference: http://docs.mongodb.org/manual/core/aggregation-introduction/#map-reduce

### Design for map functions

It's straightforward to implement group by statistics in MongoDB. The statistics are calculated using the mapReduce() method. The way that mapReduce() is implemented in Ceilometer, mapReduce() needs a map() function, a reduce() function, and a finalize() function.

Reference on these different functions: http://docs.mongodb.org/manual/reference/command/mapReduce/

To compute meter statistics in MongoDB, there are four cases that need to be accounted for:

1. no period, no group by
2. period only
3. group by only
4. period and group by

All the cases can be implemented by using slightly different map functions with the same reduce ad finalize functions. The map() function works by processing each document and emitting a key value pair. Each case requires a different key.

1. no period, no group by --> key can be anything as long as it's a constant, e.g. 'statistics'
2. period only --> key is the variable "period_start"
3. group by only --> key is the variable "groupby"
4. period and group by --> key is the combination of variables "period_start" and "groupby"

Then we just need to pass right values for the "groupby", "period_start", and "period_end" objects in the emitted values.

Tried to minimize duplicate code by using string substitutions as much as possible in the map functions MAP_STATS, MAP_STATS_PERIOD, MAP_STATS_GROUPBY, MAP_STATS_PERIOD_GROUPBY

API tests to check group by statistics
-------------------------------------------------

Addressed by: https://review.openstack.org/44130
"Add group by statistics tests in API v2 tests"

### Add API tests for group by statistics

The API group by statistics tests are in a new class StatisticsGroupByTest in tests/api/v2/test_statistics_scenarios.py

The tests implemented are group by

 1) single field, "user-id"
 2) single field, "resource-id"
 3) single field, "project-id"
 4) single field, "source" (*)
 5) single field with invalid/unknown field value
 6) multiple fields
 7) single field groupby with query filter
 8) multiple field group by with multiple query filters
 9) single field with start timestamp after all samples
10) single field with end timestamp before all samples
11) single field with start timestamp
12) single field with end timestamp
13) single field with start and end timestamps
14) single field with start and end timestamps and query filter
15) single field with start and end timestamps and period
16) single field with start and end timestamps, query filter, and period

(*) Group by source isn't supported in SQLAlchemy at this time, so we have to put this test in its own class TestGroupBySource

The tests use the same data and test cases as the groupby storage tests in class StatisticsGroupByTest in tests/storage/test_storage_scenarios.py

Group by metadata fields is not implemented at this time, so there aren't any tests for metadata fields.

### Add tests for new function _validate_groupby_fields() in test_query.py

A new function _validate_groupby_fields() was added in ceilometer/api/controllers/v2.py, so there need to be tests for it. The logical place to put the tests is tests/api/v2/test_query.py

The tests check for valid fields, invalid fields, and duplicate fields.

### Add groupby parameter in stubs in test_compute_duration_by_resource_scenarios.py

In tests/api/v2/test_compute_duration_by_resource_scenarios.py , the function _stub_interval_func() stubs out get_meter_statistics(). Since the get_meter_statistics() function now accepts a groupby parameter, the stubs should also have a groupby parameter.

An additional parameter groupby was added to the functions get_interval()

### Revise get_json() to accept groupby parameter

The method get_json() in ceilometer/tests/api.py simulates an HTTP GET request for testing purposes. It has been modified to accept a groupby parameter.

API group by statistics implementation
---------------------------------------------------

Addressed by: https://review.openstack.org/44130
"Add group by statistics tests in API v2 tests"

The additions below were made to ceilometer/api/controllers/v2.py

### Add groupby attribute to class Statistics

The API has a class Statistics that holds all the computed statistics from a meter/meter_name/statistics request. The class has been updated to include an attribute "groupby" for the group, so that we know which group the statistics are associated with. For example if we request group by user_id, "groupby" might be {'user_id': 'user-1'}, indicating that these are the statistics for all samples with 'user-1'.

### Add groupby parameter to API method statistics()

The API has a method statistics() which is called when the user submits an HTTP GET request of the form "meter/meter_name/statistics"

This method has been updated so it can accept groupby parameters like

/v2/meters/instance/statistics?groupby=user_id&groupby=source

The groupby fields are assumed to be unicode strings, such that the groupby parameter passed to statistics() is a list of unicode strings. For the above example, the groupby parameter would be ['user_id', 'source']

The API method statistics() then validates the groupby fields using a new method _validate_groupby_fields() and if the fields are valid, calls the get_meter_statistics() method corresponding to the current storage driver with those groupby fields.

The method _validate_groupby_fields() validates the groupby parameter and removes duplicate fields. This method is useful because it throws an error if an invalid field is given, i.e. a field that is not in the set ['user_id', 'resource_id', 'project_id', 'source']. Note that the duplicate fields are removed using list(set(groupby_fields)), which does not preserve the order of the groupby fields. So if a request

/v2/meters/instance/statistics?groupby=user_id&groupby=source

is made, the order could be switched from ['user_id', 'source'] to ['source', 'user_id']

(?)

Work Items

Work items:
Storage driver tests: DONE
SQL Alchemy implementation: DONE
MongoDB implementation: DONE
API tests: DONE
API implementation: DONE

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.