Nova exception and log handling policy

Registered by Nachi Ueno on 2011-10-11

Exception and logging policy
====================================================

Followings are OpenStack exception and logging levels.

CRITICAL
- destruction of data caused by system
- stop of all function of Nova
- each component output message to log
  - should specify what component caused problem
  ex). when nova find conflict of data

FATAL
- invalid configuration value, abnormal action of external component
-invalid configuration(environment)
- stop the Daemon
- component output message to log
  - should specify what configuration and what environment caused problem when
    component started to run
ex). full of disk

ERROR
- Limiting value and violation of API
- should continue the action of Daemon
- component output message to log
  - should specify what process caused exception
  - should specify why excetion is caused ex). no resource when instance starts to run

WARN
- Exception generated by input from the outside
- Validation error
- The operation of Daemon continues.
- component outputs the message to the log.
  - The location for the exception can be distinguished.
  - The reason for the exception can be distinguished.
ex). Execution of unauthorized API, Disagreement of input parameter

DEBUG
- if FLAGS.verbose is set, only then generate debug logs, otherwise do not generate.
- Only for diagnostic purpose - Useful for developer during bug root cause analysis(RCA).
- Should be very detailed, containing all inputs to API, intermediate data processed by method, output of API
- Use debug() to print intermediate values, and statuses.
Ex: If an API call returns without error but behaviour is not as expected, then set verbosity to debug. Debug messages should help developer see chain of events and specify where the result is unexpected.

INFO
- Use only to convey information.
- Log message string should not contain erroneous or debugging message
- Ex: “Started nova-XX service” , “Shutting down instance”, “Instance rebooted successfully”
“live migration complete” etc.

AUDIT
- Call this log method only for truly important events, for tracking purpose. Generally useful for making billing and accountability easier.
- Ex: “New instance created”, “Instance stopped” etc.

Policy of Exception Class
=====================================
Do not use exception.Error (This class represent no information)
AttributeError must be avoided
  ( It is hard to solve problems from AttributeError)

Policy of Exception Handling
=====================================
Wrap exceptions with more informative exception class.
The intermediate state must be cleaned up.
Exceptions must be cached especially in loop.

- Bad Example
  for resource in [A,B,C]:
      do_something(resource)

- Good Example
 for resource in [A,B,C]:
     try:
         do_something(resource)
     except:
          cleanup(resource) # or something
          raise # if needed

In the bad example, if the exception raised during do_something(B), nothing done for C.

Blueprint information

Status:
Started
Approver:
Nachi Ueno
Priority:
Medium
Drafter:
Nachi Ueno
Direction:
Needs approval
Assignee:
Nachi Ueno
Definition:
Approved
Series goal:
None
Implementation:
Good progress
Milestone target:
milestone icon 2012.1
Started by
Nachi Ueno on 2011-10-20

Related branches

Sprints

Whiteboard

Discussion about Design of Exception Class
===========================================

Analysis of Current Exception handling scheme in Nova (brief multilevel inheritance snapshot):

Python Base class Nova Base Class Nova Sub-class(level1) Nova Sub-class(level2)
IOError => ProcessExecutionError
Exception => Error =>APIError
      =>BuildInProgress
        =>DBError
      =>VolumeIsBusy
           => NovaException =>VirtualInterfaceCreateException
      =>VirtualInterfaceMacAddressException
      =>NotAuthorized =>AdminRequired
      =>Invalid =>InvalidSignature
          =>InvalidInput
                                                                                                                                               =>InvalidInstanceType
          =>InvalidPortRange
          =>InvalidIpProtocol
          =>InvalidContentType
          =>ServiceUnavailable =>
                                                                 => etc. etc.
                                                                                    =>NotFound =>InstanceNotFound
          =>VolumeNotFound =>
          =>SnapshotNotFound
          =>DiskNotFound
                                                                                                                                              =>ImageNotFound=>
                                                                                                                                              =>NetworkNotFound=>
      =>Duplicate =>KeyPairExists
          =>UserExists=>
             =>etc.etc.
                  Nova Sub-class(level3)
=>VolumeServiceUnavailable
=>ComputeServiceUnavailable
=>VolumeNotFoundForInstance
=>KernelNotFoundForImage
=>RamdiskNotFoundForImage
=>NetworkNotFoundForBridge

Issues and solutions:
This scheme does not let the user know which service has raised the exception. There is no service level categorization of exceptions.

Example: If an rpc.cast or rpc.call from Compute raises exception, the user is not able to identify that it is coming from compute or from another service.
Persistent error messages? Do we require storing the exception messages into peristent store so that the user can see them later in his dashboard/system logs? Something like event logs.

 Eg. User wants to see historically which of his requests failed and what were those invalid parameters. Is this really useful for the user?
Feature-wise segregation of exceptions. Do we need to divide the exception classes based on features of nova -Security group, Instance, Volume etc ? Currently the division is based on the
type of error. Eg: notFound.InstanceNotFound.
Create base classes based on features, and then subclass the different categories of errors such as Invalid, NotFound, Duplicate etc.
OR, Create Base classes based on the error categories, and then subclass them based on Feature.
Which of the above options are feasible? Can they add more clarity to the user than the current scheme?

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.