Enterprise Service: Logging and Monitoring, etc

Registered by Frank Mueller

[RATIONALE]

The operation of cloud environments is, beside other aspects, associated with requirements in reliability and scalability. Here operators need continuously and near-term information about

- the current situation regarding emerging errors and failures,
- the load of the system and
- the usage of resources.

These information have to be filtered and aggregated to get a differentiated view even in large environments. While the aggregated information for a dashboard helps to get a quick overview about the state and possible immediately needed tasks (e.g. scaling up highly-loaded services) the detailed logged information filtered for a given context helps detect and correct error causes.

The whole logging and monitoring has to be designed scalable and fast enough to process a high-frequency data stream in very large environments.

[GOAL]

- Provide an internal API for logging and status information (like counters, times etc.)
- Provide a tool set to gather monitoring relevant information and pass it to the API
- Provide a scalable backend for the configurable processing the logging and monitoring data stream
- Provide an API for a dashboard
- Provide an API for log analysis

Blueprint information

Status:
Complete
Approver:
Mark Ramm
Priority:
High
Drafter:
None
Direction:
Needs approval
Assignee:
Tim Penhey
Definition:
Obsolete
Series goal:
None
Implementation:
Unknown
Milestone target:
None
Completed by
Katherine Cox-Buday

Related branches

Sprints

Whiteboard

[USER STORIES]
As a sysop I want to be immediately alerted in case of a failure.

As a sysop I want to be able to drill down into failure situation to discover the cause of the failure.

As a sysop I want to see mid to long term error aggregations helping me to discover error trends (regarding the provider, instances or charms).

As a sysop I want to see the resource consumption of my environment.

As a sysop I want to see trends allowing me to discover when resource limits of my environment are reached.

As a sysop I want to correlate the resource consumption with external events (e.g. by marketing) to prepare for future peaks.

As a devop I want to analyse the error logs to discover and correlate coherent output leading to failures (e.g. outages, lack of resources, network latency, routing problems etc.).

[ASSUMPTIONS]
[RISKS]
[IN SCOPE]
[OUT OF SCOPE]
[USER ACCEPTANCE]
[RELEASE NOTE/BLOG]

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.