Making sure LAVA test boards stay healthy

Registered by Paul Larson

We've started running health check jobs in the lab every day, to try to help us spot problems in the boards, infrastructure, or lava itself. There are some things we can do to improve this though.
1. Health check UI
Spring has been working on this, we should review what he has so far, and talk about what we'd like to see here for helping us more easily track the health of machines
2. automatic detection and response to problems
Once we are at a point where we feel comfortable that these jobs will ONLY fail when there's a real problem, we should make sure we have the pieces in place to automatically offline the board, and notify the team that something needs to be looked at

Blueprint information

Status:
Complete
Approver:
Paul Larson
Priority:
Undefined
Drafter:
Spring Zhang
Direction:
Needs approval
Assignee:
Spring Zhang
Definition:
Superseded
Series goal:
None
Implementation:
Unknown
Milestone target:
None
Completed by
Paul Larson

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.