Monitoring Of Scheduler Queues
Bug #1015532 shows a great example of place where we lack important monitoring in LAVA. We basically need to monitor the job queues in the scheduler to make sure jobs are being executed in a timely manner. We should check for things like:
* no devices of a given device type online
* jobs queues growing too large (and therefore taking too long to execute)
* jobs that seem to be hung
In the event these situations occur we should email an alert so the team is aware of the situation.
Blueprint information
Related branches
Related bugs
Sprints
Whiteboard
[2012-07-26]: I have this set up as a cronjob on control under /home/doanac/
Meta:
Headline: Monitoring added for LAVA job queues
Acceptance: Alerts will be raised when the scheduler is experiencing overloaded queues.
Roadmap id: CARD-128
Work Items
Work items:
create query to list when no devices of a given device type online: DONE
create query for jobs queues growing too large or having a job queued too long: DONE
create query to find jobs that seem to be hung: DONE
report this information: DONE