Implement monitoring for EC2 Builders infrastructure
Large part of Android Build infrastructure is EC2-based, and that makes it subject to various non-deterministic errors inherently caused by complexity of cloud environment. As it is multilayer architecture, we also should monitor its functioning on multiple layers, including using standalone scripts to catch errors we can't handle in Jenkins or build scripts.
Blueprint information
- Status:
- Complete
- Approver:
- Данило Шеган
- Priority:
- Medium
- Drafter:
- Paul Sokolovsky
- Direction:
- Approved
- Assignee:
- Paul Sokolovsky
- Definition:
- Approved
- Series goal:
- Accepted for trunk
- Implementation:
- Implemented
- Milestone target:
- 2013.02
- Started by
- Paul Sokolovsky
- Completed by
- Paul Sokolovsky
Related branches
Sprints
Whiteboard
[pfalcon 2012-02-27] Created based on lp:940226
[pfalcon 2012-04-04] Extend scope to all EC2 infra
[pfalcon 2012-04-04] Essential based on EC2 over-budget we have
[pfalcon 2012-04-04] Starting on "Improve handling on Jenkins build slave logs" right away - that's needed to provide feedback to lp:932088 as requested by Ubuntu EC2 team.
[pfalcon 2012-04-06] Nagios and stuff for sure won't fit into 12.04. Also, who said it will be Nagios? ;-) It should start with WI "Review and select system monitoring solution". In this regard, just setting up dumb df-based cron script is much more practical...
[fboudra 2012-04-06] Nagios was proposed because we're looking for monitoring solution for other systems like validation servers as well. apt-get install nagios-nrpe-plugin won't take more time than the dumb df-based cron script ;)
[pfalcon 2012-04-09] ^ Yes, once we have Nagios setup and proven to work reliably. In the meantime, dumb cronjob has been setup.
[pfalcon 2012-04-27] Due to urgent Android restricted builds BP, less than 50% of this BP was implemented, proposed to move it altogether to 2012.04.
[dzin 2012-04-27] Move to 12.05
[danilo 2012-05-08] Splitting and cleaning up with David, Paul.
[danilo 2012-06-06] Whatever's implemented doesn't match the acceptance criteria, so moving to backlog after discussion with David.
[pfalcon 2013-02-28] This has been fully implemented now.
[pfalcon 2013-03-01] Re:"Setup bot EC2 user which can shutdown zombie slave instances" - initially planned, but turned out not needed.
Meta:
Headline: A watchdog script is now running to keep ec2 slaves for android-build under control.
Acceptance: Watchdog script is in place that kills any run-off ec2 slaves if jobs run too long.
Work Items
Work items:
Implement initial script to catch long-running EC2 build slaves and send emails to appropriate parties: DONE
Deploy the monitoring script as cronjob: DONE
Elaborate monitoring script to avoid false positives: DONE
Get more information on long-running build on ci.*: DONE
Add support for different groups of build with different timeouts: DONE
Setup bot EC2 user which can shutdown zombie slave instances: POSTPONED
Extend the monitoring script to automatically shutdown zombie instances: DONE