Using SQL backend is non-functionally slow

Bug #1180438 reported by Aimon Bustardo
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Undecided
You Yamagata

Bug Description

On a new system 'ceilometer meter-list' replies within a few seconds. Once a few days of data have been collected 'ceilometer meter-list' slows to 10+ minutes to reply. This is true when using any SQL backend. During the query it appears there is an excessive amount of SQL happening.

This was tested on Ubuntu Precise using following packages:
ii ceilometer-api 2013.1-0ubuntu1~cloud ceilometer api service
ii ceilometer-collector 2013.1-0ubuntu1~cloud ceilometer collector service
ii ceilometer-common 2013.1-0ubuntu1~cloud ceilometer common files
ii python-ceilometer 2013.1-0ubuntu1~cloud ceilometer python libraries
ii python-ceilometerclient 1.0.0-0ubuntu1~cloud0 Client library for Openstack ceilometer server.

Aimon Bustardo (aimonb)
description: updated
description: updated
Revision history for this message
Aimon Bustardo (aimonb) wrote :

From what I am gathering.. at request time there is a large join.. which doesn't take that much time.. Then there are thousands of small select statements recursively looping through the joined data. This is highly ineffective and there is probability that the issue is exponential.

Aimon Bustardo (aimonb)
summary: - 'ceilometer meter-list' too slow to respond
+ 'ceilometer *-list' too slow to respond
Revision history for this message
Feilong Wang (flwang) wrote : Re: 'ceilometer *-list' too slow to respond

Aimon, by default, if you just call "ceilometer meter-list", it means there is no any filter applied. In other words, the query will try to get all the data from DB. I think that's why it's slow.
Besides, generally, Ceilometer's data is consumed by REST API. So it means consumer will query data from Ceilometer with some filter parameters, just like this:
http://127.0.0.1:8777/v2/meters/cpu_util?q[0].field=timestamp&q[0].op=ge&q[0].value=2013-05-10T03:17:42&q[1].field=resource_id&q[1].op=eq&q[1].value=5ff7850e-c559-4270-beb6-baccf7219181

Please try above query (replace the value with your env) and I'm glad to know the result.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

This is doesn't seem to be a valid Ubuntu bug. Marking it as invalid.

Changed in ceilometer (Ubuntu):
status: New → Invalid
Revision history for this message
Aimon Bustardo (aimonb) wrote :

I have determined this is caused by SQL backends. a SQL backend on ceilometer apparently is a very bad idea. This should be made into a documentation bug.

no longer affects: ceilometer (Ubuntu)
summary: - 'ceilometer *-list' too slow to respond
+ Using SQL backend is non-functional
description: updated
Revision history for this message
Aimon Bustardo (aimonb) wrote : Re: Using SQL backend is non-functional

I have updated the ticket title and description to reflect true issue which is that using SQL behing ceilometer is a horrible idea. The documentation does not properly reflect this and should,

summary: - Using SQL backend is non-functional
+ Using SQL backend is non-functionally slow
Revision history for this message
Feilong Wang (flwang) wrote :

Aimon, can you post the root cause you found?

Revision history for this message
Aimon Bustardo (aimonb) wrote :

flwang , Simply put.. SQL is a bad idea for any data set this numerous. MongoDB was designed specifically to tackle items like ceilometer meter storage and it does it very well. We won't be attempting to use MySQL again nor do we think SQL will ever work at scale so there really is no need to search for a root cause.

Revision history for this message
Aimon Bustardo (aimonb) wrote :

BTW, with MySQL backend our requests took up to 30 minutes to complete. With MongoDB they took less then 3 seconds.

Revision history for this message
Feilong Wang (flwang) wrote :

hmm... I'm very interested in this issue. Since I'm doing some investigation for this blueprint: https://blueprints.launchpad.net/ceilometer/+spec/sqlalchemy-metadata-query, so it would be nice if we can get some lessen learn from this bug.

Revision history for this message
Aimon Bustardo (aimonb) wrote :

Have there been any successful use cases with SQL backends to ceilometer? Our test environment was fairly small at <100Vms .

Revision history for this message
gordon chung (chungg) wrote :

tried using sqlite to query ~12500 records... took ~6mins. definitely needs an investigation.

Changed in ceilometer:
status: New → Confirmed
Revision history for this message
xingzhou (xingzhou) wrote :

Seems one difference between mongo and sql backend is the resource collection in mongo db has meter list stored, while in sql backend, the resource table have to join meter to get meter info, and the meter list have to filter out the duplicate instances under the same meter, this might cause some performance issue

Revision history for this message
Aimon Bustardo (aimonb) wrote :

A Typical installation gets in the millions of entries per day. You can see that if it takes 6 minutes to query 12k records we have a serious problem. There is a good doc laying out the data amounts here: https://docs.google.com/a/morphlabs.com/spreadsheet/ccc?key=0AtziNGvs-uPudDhRbEJJOHFXV3d0ZGc1WE9NLTVPX0E#gid=0

@xingzhou This sound correct from what I saw in logs. If you enable query logging in your DB and issue a 'ceilometer meter list' you will see queries happening for the full response time duration (in our case 30+ minutes of joins/subqueries).

Revision history for this message
Aimon Bustardo (aimonb) wrote :

The issue here appears to be that we are using a relational DB in a relational way... which is slow.. If we use SQL at all then it must be a single table which mirrors a mongo 'collection'. I don't see any complex query or join working ever for this.

Revision history for this message
Aimon Bustardo (aimonb) wrote :

Here are the questions I think we need to ask ourselves before any work in done on this:

1. Can SQL ever be efficient enough? (If not then sql is moot and issue should turn into a doc bug)
2. Given that MySQL/PgSQL will NEVER perform as well as the in memory mongo.. What is the driver/motivation for working on this?

Revision history for this message
Julien Danjou (jdanjou) wrote :

Don't rush any analysis. Are you sure the indexes are correct for example?

Changed in ceilometer:
assignee: nobody → You Yamagata (y-yamagata)
Revision history for this message
You Yamagata (y-yamagata) wrote :

I think the cause of the problem is that get_meters() exec full scan query to join of resource and meter tables and then remove duplicated record in python codes.

https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L274

 Using 'group by' clause to required columns in meter table and check resource_id existense using subquery is fast in my test.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/31596

Changed in ceilometer:
status: Confirmed → In Progress
Revision history for this message
Aimon Bustardo (aimonb) wrote :

I hope I spoke too soon when I said it won't scale. Has anyone tried it with a few million records? WHat is the performance difference between the two?

Revision history for this message
You Yamagata (y-yamagata) wrote :

I did test my patch with 420k records in meter table with sqlite.

sqlite> select count(*) from meter;
421083

sqlite> select count(*) from resource;
15

$ time ceilometer meter-list
...
real 0m11.691s
user 0m0.200s
sys 0m0.060s

$ time sqlite3 /opt/stack/var/lib/ceilometer/ceilometer.sqlite 'SELECT resource.id AS resource_id, resource.resource_metadata AS resource_resource_metadata, resource.user_id AS resource_user_id, resource.project_id AS resource_project_id, anon_1.id AS anon_1_id, anon_1.counter_name AS anon_1_counter_name, anon_1.user_id AS anon_1_user_id, anon_1.project_id AS anon_1_project_id, anon_1.resource_id AS anon_1_resource_id, anon_1.resource_metadata AS anon_1_resource_metadata, anon_1.counter_type AS anon_1_counter_type, anon_1.counter_unit AS anon_1_counter_unit, anon_1.counter_volume AS anon_1_counter_volume, anon_1.timestamp AS anon_1_timestamp, anon_1.message_signature AS anon_1_message_signature, anon_1.message_id AS anon_1_message_id FROM resource JOIN (SELECT meter.id AS id, meter.counter_name AS counter_name, meter.user_id AS user_id, meter.project_id AS project_id, meter.resource_id AS resource_id, meter.resource_metadata AS resource_metadata, meter.counter_type AS counter_type, meter.counter_unit AS counter_unit, meter.counter_volume AS counter_volume, meter.timestamp AS timestamp, meter.message_signature AS message_signature, meter.message_id AS message_id FROM meter WHERE meter.id IN (SELECT max(meter.id) AS max_1 FROM meter GROUP BY meter.resource_id)) AS anon_1 ON resource.id = anon_1.resource_id'
...
real 0m2.578s
user 0m1.112s
sys 0m1.456s

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/31596
Committed: http://github.com/openstack/ceilometer/commit/327bd26e2f07ebc16ee1b025f2c029ea7d95baae
Submitter: Jenkins
Branch: master

commit 327bd26e2f07ebc16ee1b025f2c029ea7d95baae
Author: You Yamagata <email address hidden>
Date: Tue Jun 4 10:26:11 2013 +0900

    sqlalchemy: fix performance issue on get_meters()

    Change SQL query to fix performance issue on Connection.get_meters().
    It affects response time of 'ceilometer meter-list'.

    Fixes: Bug #1180438

    Change-Id: I8ff6e03b9487705fea277066fbc51f554c54bba5

Changed in ceilometer:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.