Enterprise Service: Logging and Monitoring, etc
[RATIONALE]
The operation of cloud environments is, beside other aspects, associated with requirements in reliability and scalability. Here operators need continuously and near-term information about
- the current situation regarding emerging errors and failures,
- the load of the system and
- the usage of resources.
These information have to be filtered and aggregated to get a differentiated view even in large environments. While the aggregated information for a dashboard helps to get a quick overview about the state and possible immediately needed tasks (e.g. scaling up highly-loaded services) the detailed logged information filtered for a given context helps detect and correct error causes.
The whole logging and monitoring has to be designed scalable and fast enough to process a high-frequency data stream in very large environments.
[GOAL]
- Provide an internal API for logging and status information (like counters, times etc.)
- Provide a tool set to gather monitoring relevant information and pass it to the API
- Provide a scalable backend for the configurable processing the logging and monitoring data stream
- Provide an API for a dashboard
- Provide an API for log analysis
Blueprint information
- Status:
- Complete
- Approver:
- Mark Ramm
- Priority:
- High
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- Tim Penhey
- Definition:
- Obsolete
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
- Katherine Cox-Buday
Whiteboard
[USER STORIES]
As a sysop I want to be immediately alerted in case of a failure.
As a sysop I want to be able to drill down into failure situation to discover the cause of the failure.
As a sysop I want to see mid to long term error aggregations helping me to discover error trends (regarding the provider, instances or charms).
As a sysop I want to see the resource consumption of my environment.
As a sysop I want to see trends allowing me to discover when resource limits of my environment are reached.
As a sysop I want to correlate the resource consumption with external events (e.g. by marketing) to prepare for future peaks.
As a devop I want to analyse the error logs to discover and correlate coherent output leading to failures (e.g. outages, lack of resources, network latency, routing problems etc.).
[ASSUMPTIONS]
[RISKS]
[IN SCOPE]
[OUT OF SCOPE]
[USER ACCEPTANCE]
[RELEASE NOTE/BLOG]