pre-built image job reliability

Registered by Andy Doan

LAVA has pre-built images submitted for testing each day like:
 http://validation.linaro.org/lava-server/dashboard/streams/private/team/linaro/pre-built-leb-origen

According to our reports view:
 http://validation.linaro.org/lava-server/scheduler/reports

Our failure rate for successfully completing these test jobs ranges between 50% to about 65%. We should start to analyze these jobs to see common causes of failures in a similar way that we did for health job failures. After doing this type of investigation, we should hopefully be able to find the most common causes of failures and make adjustments to LAVA to help correct these issues.

Blueprint information

Status:
Complete
Approver:
None
Priority:
High
Drafter:
Andy Doan
Direction:
Approved
Assignee:
Spring Zhang
Definition:
Approved
Series goal:
Accepted for trunk
Implementation:
Implemented
Milestone target:
milestone icon 2012.07
Started by
Spring Zhang
Completed by
Spring Zhang

Related branches

Sprints

Whiteboard

[qzhang, 20120719] Result on https://wiki.linaro.org/Platform/Validation/PrebuiltImageReliability
[qzhang, 20120720] Origen 27 jobs: 25883~25908; Panda 25 jobs: 25495-25519, 15 nano and 10 leb; Snowball 25 jobs: 25858~25882
[qzhang, 20120723] Previous jobs are invalid for all failed on a updated dispatcher code, now re-run it. Origen 25 jobs: 26226~26251; Snowball 25 jobs: 26253~26277
[qzhang, 20120729] Convert wiki to SpreadSheet on https://docs.google.com/spreadsheet/ccc?key=0AqSRlHjy1cqjdDh5bXVoUkxWY01iZ3U5bEs2c0ZCbWc.

Meta:
Headline: pre-built image testing improved
Acceptance: we have metrics (if not code fixes) to the most common LAVA failures for pre-built image testing
Roadmap id: CARD-128

(?)

Work Items

Work items:
pick a few daily builds of Origen and re-submit 25 jobs for each build: DONE
pick a few daily builds of Panda and re-submit 25 jobs for each build: DONE
pick a few daily builds of Snowball and re-submit 25 jobs for each build: DONE
create a spreadsheet of failures organized by "lava failure", "image failure", "don't know": DONE
write a summary of the most common problems we are seeing: DONE

This blueprint contains Public information 
Everyone can see this information.