Monitoring Of Scheduler Queues

Registered by Andy Doan on 2012-07-01

Bug #1015532 shows a great example of place where we lack important monitoring in LAVA. We basically need to monitor the job queues in the scheduler to make sure jobs are being executed in a timely manner. We should check for things like:

 * no devices of a given device type online
 * jobs queues growing too large (and therefore taking too long to execute)
 * jobs that seem to be hung

In the event these situations occur we should email an alert so the team is aware of the situation.

Blueprint information

Status:
Complete
Approver:
Andy Doan
Priority:
Medium
Drafter:
None
Direction:
Approved
Assignee:
Andy Doan
Definition:
Approved
Series goal:
Accepted for trunk
Implementation:
Implemented
Milestone target:
milestone icon 2012.07
Started by
Andy Doan on 2012-07-25
Completed by
Andy Doan on 2012-07-25

Related branches

Sprints

Whiteboard

[2012-07-26]: I have this set up as a cronjob on control under /home/doanac/lava-scripts

Meta:
Headline: Monitoring added for LAVA job queues
Acceptance: Alerts will be raised when the scheduler is experiencing overloaded queues.
Roadmap id: CARD-128

(?)

Work Items

Work items:
create query to list when no devices of a given device type online: DONE
create query for jobs queues growing too large or having a job queued too long: DONE
create query to find jobs that seem to be hung: DONE
report this information: DONE

This blueprint contains Public information 
Everyone can see this information.