Force all LAVA devices to recheck health status at next opportunity

Registered by Paul Larson on 2012-03-08

When we upgrade the system, we want to make sure that things are behaving well and that there are no obvious problems introduced that are going to interfere with normal job progression. We can do that using the health checks.

If the health status is set to unknown before bringing the scheduler daemon back online, devices that are idle will start a health check immediately after the scheduler comes back up. If they were already processing a job during the update (which will be using the old pre-update code), it will complete, but the next job (using the new code just deployed) will be a health check job.

If the health check fails, the board will be marked offline and this will need to be investigated before allowing the board to continue processing normal jobs.

Blueprint information

Status:
Complete
Approver:
Paul Larson
Priority:
Medium
Drafter:
Paul Larson
Direction:
Approved
Assignee:
Michael Hudson-Doyle
Definition:
Approved
Series goal:
Accepted for trunk
Implementation:
Implemented
Milestone target:
milestone icon 2012.03
Started by
Michael Hudson-Doyle on 2012-03-16
Completed by
Michael Hudson-Doyle on 2012-04-02

Related branches

Sprints

Whiteboard

Meta:
Headline: It is possible to force health checks to run on all boards after a deployment
Acceptance: there is a djano admin action to set set all devices health to unknown
Roadmap id: LAVA2012-LAVA-HEALTH-MANAGEMENT

(?)

Work Items

Work items:
add admin action to set all devices health to unknown: DONE

This blueprint contains Public information 
Everyone can see this information.