juju-core

Enterprise Service: Logging and Monitoring, etc

Registered by Frank Mueller on 2013-04-30

[RATIONALE]

The operation of cloud environments is, beside other aspects, associated with requirements in reliability and scalability. Here operators need continuously and near-term information about

- the current situation regarding emerging errors and failures,
- the load of the system and
- the usage of resources.

These information have to be filtered and aggregated to get a differentiated view even in large environments. While the aggregated information for a dashboard helps to get a quick overview about the state and possible immediately needed tasks (e.g. scaling up highly-loaded services) the detailed logged information filtered for a given context helps detect and correct error causes.

The whole logging and monitoring has to be designed scalable and fast enough to process a high-frequency data stream in very large environments.

[GOAL]

- Provide an internal API for logging and status information (like counters, times etc.)
- Provide a tool set to gather monitoring relevant information and pass it to the API
- Provide a scalable backend for the configurable processing the logging and monitoring data stream
- Provide an API for a dashboard
- Provide an API for log analysis

Blueprint information

Status:: Complete

Approver:: Mark Ramm

Priority:: High

Drafter:: None

Direction:: Needs approval

Assignee:: Tim Penhey

Definition:: Obsolete

Series goal:: None

Implementation:: Unknown

Milestone target:: None

Completed by: Katherine Cox-Buday on 2015-06-11

Related branches

Related bugs

Sprints

cloud-1305

Whiteboard

[USER STORIES]
As a sysop I want to be immediately alerted in case of a failure.

As a sysop I want to be able to drill down into failure situation to discover the cause of the failure.

As a sysop I want to see mid to long term error aggregations helping me to discover error trends (regarding the provider, instances or charms).

As a sysop I want to see the resource consumption of my environment.

As a sysop I want to see trends allowing me to discover when resource limits of my environment are reached.

As a sysop I want to correlate the resource consumption with external events (e.g. by marketing) to prepare for future peaks.

As a devop I want to analyse the error logs to discover and correlate coherent output leading to failures (e.g. outages, lack of resources, network latency, routing problems etc.).

[ASSUMPTIONS]
[RISKS]
[IN SCOPE]
[OUT OF SCOPE]
[USER ACCEPTANCE]
[RELEASE NOTE/BLOG]

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

Dave Cheney

Ian Booth

John A Meinel

Kapil Thangavelu

Richard Harding

Tim Penhey

William Reade