Ubuntu Enterprise Cloud Monitoring and Graphing

Registered by Dustin Kirkland 

If you're running a production UEC, you're probably curious what's actually going on in your Cloud!

You might want an integrated, instantaneous view of your Cloud's usage, from a cpu/memory/disk/network/etc perspective. And you really might want to see these collected and graphed over time.

The current mechanisms for viewing this sort of data in UEC is really primitive, if extant at all. There's a handful of euca2ools commands (euca-describe-*, euca_conf --list-*, etc) that can tell you a bit about what your cloud is doing and where. And byobu can tell you what services are running on the current machine.

But perhaps a real SNMP monitoring system would be useful. Tools like Munin and Nagios are also often found in Linux data centers.

What do we think about integrating something like SNMP, OpenNMS, Munin, or Nagios into UEC? Would this require Eucalyptus changes? What do you want to see, as a UEC administrator?

Blueprint information

Status:
Not started
Approver:
Jos Boumans
Priority:
Medium
Drafter:
Dustin Kirkland 
Direction:
Approved
Assignee:
Dave Walker
Definition:
Approved
Series goal:
Accepted for maverick
Implementation:
Deferred
Milestone target:
milestone icon ubuntu-10.10-beta

Whiteboard

Status:
Plan defined for alpha3.collectd MIR stalled awaiting guidance from ubuntu-devel. Munin plugin submitted as merge proposal for eucalyptus. collectd postponed, unlikely for maverick given size and complexity of MIR and proximity to feature freeze.

Complexity:
maverick-alpha-3: 4

Work items for maverick-alpha-2:
[kirkland] Fix Bug #595588, package/install eucalyptus extras/* scripts: DONE

Work items for maverick-alpha-3:
integrate said scripts with nagios/munin/logging: POSTPONED
create data abstraction for web view in UEC part 1: POSTPONED
Integrate data into UEC: POSTPONED
[ivanka] update UEC frontend theme to new Ubuntu aubergine branding (find someone in Design team ?): POSTPONED
[clint-fewbar] merge rrdtool >= 1.4 merged, FTBFS until libdbi0 clears MIR and is moved to main (LP: #605871): DONE
[clint-fewbar] MIR libdbi0 (new dependency of rrdtool) (LP: #608552) and (LP: #608556): DONE
[clint-fewbar] update libdbi to latest version to resolve issues with rrdtool 1.4's dbi support: DONE
[clint-fewbar] MIR for collectd: POSTPONED
[clint-fewbar] adapt ganglia script for collectd and/or munin: DONE
[clint-fewbar] update seeds with collectd/libdbi in main: POSTPONED

Work items for ubuntu-10.10-beta:
[clint-fewbar] produce custom templates for eucalyptus plugins (Not doing this, not worth running two copies of munin): DONE
[clint-fewbar] test automatic configuration of plugins on installation (README.Debian file added rather than re-writing plugin script): DONE

20100806: The folowing work items must be re-evaluated at UDS-N (clint-fewbar):
 [clint-fewbar] MIR for collectd: INPROGRESS
 [clint-fewbar] update seeds with collectd/libdbi in main: TODO

view 20100602:
* Targets should include:
  * packaging the provided monitoring/logging scripts from eucalyptus
  * Integrate with our nagios/munin/logging
  * Nice to have: frontend available on cloud/cluster controller

Monitoring UEC instances vs monitoring applications running in instances? -- mathiaz

== Cloud monitoring and graphing ==

Use case: optimize the number of instance on a cloud

=== Data collection ===
 * Node controller:
   * number of instance running
   * resources used by each instance: number of core, disk available, memory
   * generic stats: network io, disk io, power consumption
   * statistics about each instance: kvm information, cpu load
   * ksm
   * disk io per instances
 * Cluster controller:
   * network throughput:
     - In, Out:
       by NC, by security groups, by instance?
   * latency: delay added by the CC.
 * Storage controller:
   * disk io
   * network io
 * Cloud controller:
   * number of instances started/stopped (Counter)
   * nb of instances by users, by security groups.
   * ebs usage
   * reserved ips
   all the ressource that a user can create/request.
 * Walrus:

 Errors messages on each components.

Package/ship scripts
 * extras/ganglia.sh
 * extras/nagios.sh

=== Collection/aggregation/graphing/alerting frameworks ===

extras/: nagios & GANGLIA scripts.
 * nagios script is a statically configured set of passive checks
   * might be better as a series of active checks ("active" in the sense of
     nagios terminology) which are pulling information about which resources to
     check from an authoritative source (at worst the eucalyptus config files,
     at best the running eucalyptus itself)
   * basically just doing a wget on the various web service components to check
     the status, so it's pretty cheap and easy to perform.

Plot against users.

 * Instantaneous views
 * Views over time
 * Error conditions
  * with links to documentation about said errors

Graphing:
 by users, by security groups
 critical ressources that be given out or performance (operated):
  - free ip vs allocated ip
  - S3/walrus, ebs module
  - load, latency
  - capacity of instances used vs available instances
    -> graphical view of euca-describe-availability-zones verbose

Alerting:
 - passive checks:
   Java services (wget) (CC, SC, Walrus, NC)
 - active checks:
   Cluster controller are still running.

Powersave Scheduler Stats
 * which systems are powered on/off
 * total time each system has spent on/off, used/unused
 * power utilization on running nodes

Location: cloud controller as a tab for the management panel?

(?)

Work Items