Production debugging

Registered by Julian Edwards on 2013-10-09

Debugging failures in MAAS is hard, we should aim so that everything we ship with MAAS - i.e. outside of its core - should work through the API, even debugging tools. Customers - or PES - should not have to perform brain surgery on MAAS to diagnose a tagging issue, for example. Any solution we come up with has to work on site, in production.

Examples of failures:
 * node fails to boot
 * node fails to commission
 * power scripts not working
 * DNS not working

Blueprint information

Daniel Westervelt
Series goal:
Informational Informational
Milestone target:
Completed by
Adam Collard on 2019-10-09


= Debugging MAAS =

== Ideas ==

- IPMI console (conserver), view it in the UI when requested
- Kernel param net console
- Better error messages
- Keep track of the major events in the lifecycle of a node
- Add a command line option ("debug mode") to "block" during enlistment/commissioning to let the user log in and debug (i.e. similar to the backdoor feature that "exists" today).
- MaaSTest permanently in debug mode so you can always get a back door into the system.
- cloud-init logs on region/UI
- Consolidate logs into one file (→ (r)syslog)
- Log level changes on the fly (this needs a bug)
- Set software clock on enlistment
- Customize the DHCP template from the UI.
- Improved notifications UI.
- SOS report (in saucy main / gathers logs/db dump and creates a tarball with all that to help remote debugging.


Work Items

Work items:
IPMI console on demand, net-console by default (guesstimate 2w): TODO
Replace mod_wsgi as it doesn't support websockets (what with?): TODO
State management "threads" for each node and audit state history in DB (guesstimate 2w): TODO
Audit log UI / DB for a node (guesstimate 1w): TODO
Better error messages (review): TODO
Debug mode (backdoor, ipmi console on) (2w): TODO
Consolidate logs to syslog (tiered logging via syslog - store in postgres) (2w): TODO
Cloud-init logs in UI/API - redirect cloud-init logs and do a UI (2w): TODO
Log level changes on the fly ?????: TODO
Set software clock on enlistment/commissioning (1d): TODO

This blueprint contains Public information 
Everyone can see this information.