Implement monitoring for EC2 Builders infrastructure

Registered by Paul Sokolovsky

Large part of Android Build infrastructure is EC2-based, and that makes it subject to various non-deterministic errors inherently caused by complexity of cloud environment. As it is multilayer architecture, we also should monitor its functioning on multiple layers, including using standalone scripts to catch errors we can't handle in Jenkins or build scripts.

Blueprint information

Status:
Complete
Approver:
Данило Шеган
Priority:
Medium
Drafter:
Paul Sokolovsky
Direction:
Approved
Assignee:
Paul Sokolovsky
Definition:
Approved
Series goal:
Accepted for trunk
Implementation:
Implemented
Milestone target:
milestone icon 2013.02
Started by
Paul Sokolovsky
Completed by
Paul Sokolovsky

Whiteboard

[pfalcon 2012-02-27] Created based on lp:940226
[pfalcon 2012-04-04] Extend scope to all EC2 infra
[pfalcon 2012-04-04] Essential based on EC2 over-budget we have
[pfalcon 2012-04-04] Starting on "Improve handling on Jenkins build slave logs" right away - that's needed to provide feedback to lp:932088 as requested by Ubuntu EC2 team.
[pfalcon 2012-04-06] Nagios and stuff for sure won't fit into 12.04. Also, who said it will be Nagios? ;-) It should start with WI "Review and select system monitoring solution". In this regard, just setting up dumb df-based cron script is much more practical...
[fboudra 2012-04-06] Nagios was proposed because we're looking for monitoring solution for other systems like validation servers as well. apt-get install nagios-nrpe-plugin won't take more time than the dumb df-based cron script ;)
[pfalcon 2012-04-09] ^ Yes, once we have Nagios setup and proven to work reliably. In the meantime, dumb cronjob has been setup.
[pfalcon 2012-04-27] Due to urgent Android restricted builds BP, less than 50% of this BP was implemented, proposed to move it altogether to 2012.04.
[dzin 2012-04-27] Move to 12.05
[danilo 2012-05-08] Splitting and cleaning up with David, Paul.
[danilo 2012-06-06] Whatever's implemented doesn't match the acceptance criteria, so moving to backlog after discussion with David.
[pfalcon 2013-02-28] This has been fully implemented now.
[pfalcon 2013-03-01] Re:"Setup bot EC2 user which can shutdown zombie slave instances" - initially planned, but turned out not needed.

Meta:
Headline: A watchdog script is now running to keep ec2 slaves for android-build under control.
Acceptance: Watchdog script is in place that kills any run-off ec2 slaves if jobs run too long.

(?)

Work Items

Work items:
Implement initial script to catch long-running EC2 build slaves and send emails to appropriate parties: DONE
Deploy the monitoring script as cronjob: DONE
Elaborate monitoring script to avoid false positives: DONE
Get more information on long-running build on ci.*: DONE
Add support for different groups of build with different timeouts: DONE
Setup bot EC2 user which can shutdown zombie slave instances: POSTPONED
Extend the monitoring script to automatically shutdown zombie instances: DONE

This blueprint contains Public information 
Everyone can see this information.