Improve the ceilometer-compute-agent to decrease the load of Nova API

Registered by Liusheng

This blueprint has been superseded. See the newer blueprint "resource metadata caching" for updated plans.

ceilometer-compute-agent is running on every compute host. it will polling the instances of this host and then polling the instances's resource (vcpu, vnic, disk, etc.), This means that compute-agent will call Nova API to get instances every polling period every host, if we have a "large-scale" environment (hundreds hosts), nova API will receive hundreds call of getting instances every ceilometer polling period, this will lead a heavy load to nova API.

The basic idea to improve the ceilometer-compute-agent:
1. don't polling instance every polling period, instead, compute-agent use the notification info of instance have received by ceilometer notification-agent to get instances's basic info of Nova, and then use instance name to inspect in libvirt.
2. a alternative, use central-agent to listing all instance by call nova API and compute agent use the instance info (filter by instance['host']) and inspect instance's libvirt info.

Blueprint information

Status:
Complete
Approver:
None
Priority:
Undefined
Drafter:
Liusheng
Direction:
Needs approval
Assignee:
Liusheng
Definition:
Superseded
Series goal:
None
Implementation:
Unknown
Milestone target:
None
Completed by
gordon chung

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/improve-compute-agent,n,z

Addressed by: https://review.openstack.org/101814
    Specs for improve-compute-agent

we can open alternative to maybe poll ceilometer based on events -- gordc (10.07.15)

I have read the resource-metadata-caching spec,but I wonder if the metadata-caching with using 'change-since' in Nova API can signally improve. IMO the main pain of performance is compute-agent polling, because every ceilometer-polling agent will invoke 'get_instances_by_host' every interval. that will cause a large amount of Nova API calls every polling period, because we may have many compute-agents in our deployment (central-agent situation is better). the metadata-caching only reduce the the amount of data queried but not the amount of API calls. I am not sure I have missed something because the spec described simply. -- liusheng (2015.7.13)

what is the nova issue. the api load or the backend query load? the changes-since call doesn't minimise the number of api requests but it does limit the amount of calculations each request makes... regarding api load, if there is too many requests on api made by ceilometer, most deployers have a read-only nova api just for ceilometer to poll and allow normal access via other api.

the problem with the proposed solution is that there are synchronisation/dependency issues. in first solution, if an event is missed, you won't instance exists. also, the both require heavy polling of ceilometer storage which some users/deployers don't actually use. these are the main issues in this design that you must address if you want to continue with this solution -- gordc (14.07.15)

Hi gordon, sorry for reply late. yes, the the change-since only minmise the amount of api query result. I have discussed with others people, in common deployments, the amount of instance in a host will not be large, it's not usually more than hundreds. so I personally think the main difference in load of Nova API may not between querying 100 servers and 10 servers , but between 1 API request and 100 API requests. Because we may have hundreds hosts and deployed ceilometer-compute-agent on each host, and, each compute-agent will call get_instance_by_host, which API will be posted from each compute-agent to controller node and return the result to each compute-agent.

Yes, The first solution is unreasonable as you explained. I have a proposal based on second solution and combined with the change-since approach. we can query all instances info by collector (maybe central-agent is unreasonable) and with *change-since* filter parameter, that will reduce the amount of result. The compute-agent will only collect instance info from virt layer and then push the info to collector. collector will assemble the instance info from Nova API and info from virt layer, the orphan instance info will be filtered and dropped. what do you think of this idea ?

I can write a spec about this propsal. -- liusheng(2015.7.23)

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.