Monitoring and managing device health in LAVA

Registered by Paul Larson

We have begun to run automated daily "health" tests on the devices in lava. These jobs are known-good images that have previously passed. The goal of this testing is to expose issues with the hardware, infrastructure, or LAVA itself that could prevent test jobs from passing when normally they should. These jobs are intentionally quick-running to avoid interfering with other jobs as much as possible. We should consider better ways of visualizing the results of this so that we can more easily see when things look ill, and how we can eventually even have lava respond by offlining boards once we can be sure that failures in the health checks equate to a problem with the device itself.

Some of the things we should discuss are:
1. Health check UI
Spring has been working on this, we should review what he has so far, and talk about what we'd like to see here for helping us more easily track the health of machines
2. automatic detection and response to problems
Once we are at a point where we feel comfortable that these jobs will ONLY fail when there's a real problem, we should make sure we have the pieces in place to automatically offline the board, and notify the team that something needs to be looked at

Blueprint information

Status:
Not started
Approver:
Paul Larson
Priority:
Undefined
Drafter:
Spring Zhang
Direction:
Needs approval
Assignee:
Spring Zhang
Definition:
Discussion
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items