Monasca Cassandra Time Series DB
Monasca Cassandra Time Series DB
Blueprint information
- Status:
- Started
- Approver:
- Roland Hochmuth
- Priority:
- High
- Drafter:
- Deklan Dieterly
- Direction:
- Approved
- Assignee:
- Shinya Kawabata
- Definition:
- Approved
- Series goal:
- Accepted for trunk
- Implementation:
- Needs Code Review
- Milestone target:
- None
- Started by
- Roland Hochmuth
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
This blueprint has been implemented - see 20337572e309439
-----------
Initial proposed data model for the Monasca Cassandra Time Series DB.
Premise:
All unique sets of dimensions/metric names are hashed using SHA-1 hash function to derive a unique identifier. This identifies a unique “resource” that is being measured.
There will be three tables.
I. A lookup table comprised of the following:
Tenant ID
Region
Dimensions/
set of dimension names/values, metric name
Partition key is tenant id, region, dimensions/metric name sha-1 hash.
Inserts into this table will be insert/update depending on the existence of the value. Cassandra will handle the insert/update automatically. So, there will be one row and one column for each unique set of dimensions/metric name per tenant per region.
There will be an index on the map. The index allows queries of the form where metric_
The identifier 'name' will need to be unique and no other dimension name will be able to use the identifier 'name'. We could make this a very strange identifier like 'xxx_name_xxx' to allow a dimension to have the identifier 'name'.
The set will be formed by URL encoding name/value pairs and concatenating them with an equals sign ("=").
This allows fast look ups of unique resources given a partial map of dimension name/values and a metric name. The results from this query will be used to query the measurements table.
II. A measurements table comprised of the following:
Tenant ID
Region
Dimensions/
Time_stamp
Value
value_meta
Partitions key is tenant id, region, dimensions/metric name sha-1 hash
Primary key is tenant id, region, dimensions/metric name hash, time_stamp
Another approach would be to leave the metric name out of the hash. Then the metric name would be a column in the measurement table. However, this could lead to too much data in one row.
III. An alarm state history table comprised of the following:
Tenant ID
Region
Alarm ID
Metric
New state
Old state
Reason
Reason data
Sub alarms
Time_stamp
Example Schema
drop table if exists monasca.metric_map;
drop table if exists monasca.
drop table if exists monasca.
drop schema if exists monasca;
create schema monasca
with replication = { 'class' : 'SimpleStrategy', 'replication_
use monasca;
create table monasca.metric_map (
tenant_id text,
region text,
metric_hash blob,
metric_map map<text, text>,
primary key ((tenant_id, region), metric_hash)
);
create index on monasca.metric_map (entries(
create table monasca.
tenant_id text,
region text,
metric_hash blob,
time_stamp timestamp,
value float,
value_meta text,
primary key ((tenant_id, region, metric_hash), time_stamp)
);
create table monasca.
tenant_id text,
region text,
alarm_id text,
metrics text,
new_state text,
old_state text,
reason text,
reason_data text,
sub_alarms text,
time_stamp timestamp,
primary key ((tenant_id, region), alarm_id, time_stamp)
);
Example query to get a list of metric_hash.
select * from monasca.metric_map where tenant_id = 'hpe' and region = 'uswest' and metric_map['host'] = 'host1';
Example with data.
insert into monasca.
insert into monasca.
select * from monasca.metric_map;
tenant_id | region | metric_hash | metric_map
-------
hpe | uswest | 0xaa | {'host': 'host1', 'name': 'cpu'}
hpe | uswest | 0xbb | {'host': 'host2', 'name': 'cpu'}
(2 rows)
select * from monasca.metric_map where tenant_id = 'hpe' and region = 'uswest' and metric_map['host'] = 'host1';
tenant_id | region | metric_hash | metric_map
-------
hpe | uswest | 0xaa | {'host': 'host1', 'name': 'cpu'}
(1 rows)
select * from monasca.metric_map where tenant_id = 'hpe' and region = 'uswest' and metric_map['host'] = 'host1' and metric_map['name'] = 'cpu' allow filtering;
tenant_id | region | metric_hash | metric_map
-------
hpe | uswest | 0xaa | {'host': 'host1', 'name': 'cpu'}
(1 rows)
"allow filtering" is needed on the query if there are more than one contains clauses.
See http://
cqlsh:monasca> select * from monasca.
tenant_id | metric_hash | time_stamp | value | value_meta
-------
hpe | 0xaa | 1970-01-17 18:42:41+0000 | 0.99 | {}
hpe | 0xaa | 1970-01-17 18:42:41+0000 | 0.99 | {}
hpe | 0xaa | 1970-01-17 18:42:41+0000 | 0.79 | {}
hpe | 0xaa | 1970-01-17 18:42:41+0000 | 0.69 | {}
(4 rows)
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> select * from monasca.
tenant_id | region | metric_hash | time_stamp | value | value_meta
-------
hpe | uswest | 0xaa | 1970-01-17 18:42:41+0000 | 1.04 | {}
hpe | uswest | 0xbb | 1970-01-17 18:42:41+0000 | 1.14 | {}
(2 rows)
Only 2 rows (columns) are in the table because the same date was used. If a column already exists, then Cassandra will update the values for that column.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> insert into monasca.
cqlsh> select * from monasca.
tenant_id | region | metric_hash | time_stamp | value | value_meta
-------
hpe | uswest | 0xaa | 1970-01-17 18:42:41+0000 | 1.04 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.01 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.02 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.03 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.04 | {}
hpe | uswest | 0xbb | 1970-01-17 18:42:41+0000 | 1.14 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.11 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.12 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.13 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.14 | {}
(10 rows)
Now when using different dates with milliseconds we see that all the expected rows (columns) are there.
cqlsh> select * from monasca.
tenant_id | region | metric_hash | time_stamp | value | value_meta
-------
hpe | uswest | 0xaa | 1970-01-17 18:42:41+0000 | 1.04 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.01 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.02 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.03 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.04 | {}
(5 rows)
cqlsh> select * from monasca.
tenant_id | region | metric_hash | time_stamp | value | value_meta
-------
hpe | uswest | 0xaa | 1970-01-17 18:42:41+0000 | 1.04 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.01 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.02 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.03 | {}
hpe | uswest | 0xaa | 2015-12-10 15:35:30+0000 | 1.04 | {}
hpe | uswest | 0xbb | 1970-01-17 18:42:41+0000 | 1.14 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.11 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.12 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.13 | {}
hpe | uswest | 0xbb | 2015-12-10 15:35:30+0000 | 1.14 | {}
(10 rows)
If we turn on tracing ('tracing on') we can see that the index is used for lookups using the dimensions/metric name set.
cqlsh> select * from monasca.metric_map where tenant_id = 'hpe' and region = 'uswest' and metric_map['name'] = 'cpu' and metric_map['host'] = 'host1' allow filtering;
tenant_id | region | metric_hash | metric_map
-------
hpe | uswest | 0xaa | {'host': 'host1', 'name': 'cpu'}
(1 rows)
Tracing session: f12650e0-
activity | timestamp | source | source_elapsed
-------
Parsing select * from monasca.metric_map where tenant_id = 'hpe' and region = 'uswest' and metric_map['name'] = 'cpu' and metric_map['host'] = 'host1' allow filtering; [SharedPool-
cqlsh>
All aggregation functions will need to be implemented and performed by the API.
It appears that group by functionality was recently added to Cassandra. https:/
We will need do inserts efficiently using non-batches. https:/
Questions:
1. It wasn't clear what the schema was in I.
Look at the monasca_metric_map table shown above.
2. What is the possibility of a collision with SHA-1 hashes based on the number of metrics?
Please see this webpage for the calculation of the probability. http://
For a more concrete number, we would need to have seen about 2^80 (10^24) unique sets of dimensions/metric name before we had a 50% chance of a collision.
To put that into perspective, if we saw 1,000,000 new unique sets per second per tenant for 10 years, we would only have seen less than 10^16 unique sets. That is well below 10^24.
So, in practical terms, the chances that there will be a collision are so low as to not need to be considered. You are more likely to be struck by lightning on your birthday.
QED
3. Is the date field used for partitioning
Yes, the date field is for partitioning. We need to do a back-of-
Doing the calculation shows that we do not need to partition using date.
If we assume 1 measurement per second for each metric, then it would take 63 years before we reached the 2 billion maximum allowed columns in Cassandra.
4. Why isn't the metric name just a column in the metric lookup table?
Good question. There are 2 ways to include the metric name as a separate field (column) in the metric_map table (column family). In either case, metric name would need to be part of the primary key for fast look ups.
1. Make metric name part of the partition key. This would require too many partitions. Searching across partitions is expensive. Queries that needed to look for a particular set of dimension would be required to search across all partitions.
2. Make metric name part of the primary key, but not part of the partition key. This would make metric name a clustering key. Again clustering around metric name does not help in searches unless the search is done with the metric name. In many cases, a metric name will not be given in the search; only a set of dimensions. So, the extra overhead of maintaining the clustering does not seem to buy anything on the search.
Including metric name in the metric_
5. I don't see anything that would handle Brad Klein's change: https:/
We will implement that as we do currently. I.e., we will get a list of metrics and then query for date ranges. The alternative is to try to keep track of first seen dates and last seen dates in another table. But this reduces to the way it is currently implemented.
Gerrit topic: https:/
Addressed by: https:/
Add Cassandra support to devstack
Addressed by: https:/
Enable Cassandra Database
Addressed by: https:/
Revival of Cassandra repository
Addressed by: https:/
Add java implementation of Cassandra support