OpenStack Quality Assurance

Nova exception and log handling policy

Registered by Nachi Ueno on 2011-10-11

Exception and logging policy
====================================================

Followings are OpenStack exception and logging levels.

CRITICAL
- destruction of data caused by system
- stop of all function of Nova
- each component output message to log
- should specify what component caused problem
ex). when nova find conflict of data

FATAL
- invalid configuration value, abnormal action of external component
-invalid configuration(environment)
- stop the Daemon
- component output message to log
- should specify what configuration and what environment caused problem when
component started to run
ex). full of disk

ERROR
- Limiting value and violation of API
- should continue the action of Daemon
- component output message to log
- should specify what process caused exception
- should specify why excetion is caused ex). no resource when instance starts to run

WARN
- Exception generated by input from the outside
- Validation error
- The operation of Daemon continues.
- component outputs the message to the log.
- The location for the exception can be distinguished.
- The reason for the exception can be distinguished.
ex). Execution of unauthorized API, Disagreement of input parameter

DEBUG
- if FLAGS.verbose is set, only then generate debug logs, otherwise do not generate.
- Only for diagnostic purpose - Useful for developer during bug root cause analysis(RCA).
- Should be very detailed, containing all inputs to API, intermediate data processed by method, output of API
- Use debug() to print intermediate values, and statuses.
Ex: If an API call returns without error but behaviour is not as expected, then set verbosity to debug. Debug messages should help developer see chain of events and specify where the result is unexpected.

INFO
- Use only to convey information.
- Log message string should not contain erroneous or debugging message
- Ex: “Started nova-XX service” , “Shutting down instance”, “Instance rebooted successfully”
“live migration complete” etc.

AUDIT
- Call this log method only for truly important events, for tracking purpose. Generally useful for making billing and accountability easier.
- Ex: “New instance created”, “Instance stopped” etc.

Policy of Exception Class
=====================================
Do not use exception.Error (This class represent no information)
AttributeError must be avoided
( It is hard to solve problems from AttributeError)

Policy of Exception Handling
=====================================
Wrap exceptions with more informative exception class.
The intermediate state must be cleaned up.
Exceptions must be cached especially in loop.

- Bad Example
for resource in [A,B,C]:
do_something(resource)

- Good Example
for resource in [A,B,C]:
     try:
         do_something(resource)
     except:
          cleanup(resource) # or something
          raise # if needed

In the bad example, if the exception raised during do_something(B), nothing done for C.

Blueprint information

Status:: Started

Approver:: Nachi Ueno

Priority:: Medium

Drafter:: Nachi Ueno

Direction:: Needs approval

Assignee:: Nachi Ueno

Definition:: Approved

Series goal:: None

Implementation:: Good progress

Milestone target:: 2012.1

Started by: Nachi Ueno on 2011-10-20

Related branches

Related bugs

Bug #872475: nova.compute.manager throws invalid Error	Invalid
Bug #872489: utils.execute throws exception.ProcessExecutionError but it is not handled in many case	Fix Released
Bug #872490: The exception handling in utils.py should be improved	Fix Released
Bug #872504: stable/diablo: nova.compute.util.terminate_volumes should handle ApiError	Fix Released
Bug #872513: nova.network.manager should improve exception handling	Opinion
Bug #872522: NetworkNotFound should be raised when a network is not found in network manager	Won't Fix
Bug #872536: nova.compute.api should raise exceptions when the api process terminated	Won't Fix
Bug #872541: nova.compute.api don't raise exception when an instance is locked	Fix Released
Bug #872548: A garbage record in DB created if the AMQP stopped in nova.volume.api	Invalid
Bug #872615: nova.compute.api.create don't validate max_count is a valid number	Invalid
Bug #872618: get_instance_type don't handle exception.InstanceTypeNotFound	Fix Released
Bug #873686: nova.network.lease_fixed_ip doesn't raise correct exceptions	Invalid
Bug #883300: IOError should be FileNotFound on nova.crypt	Fix Released
Bug #883309: nova.virt.images.py should handle IOError	In Progress
Bug #883320: nova.crypto. revoke_certs_by_user should handle ProcessExecutionError	Fix Released
Bug #883322: nova.network.manager should handle exceptions and rollback.	Invalid
Bug #883326: nova.crypto. generate_key_pair should handle OSError	Fix Released

Sprints

Whiteboard

Discussion about Design of Exception Class
===========================================

Analysis of Current Exception handling scheme in Nova (brief multilevel inheritance snapshot):

Python Base class Nova Base Class Nova Sub-class(level1) Nova Sub-class(level2)
IOError => ProcessExecutionError
Exception => Error =>APIError
      =>BuildInProgress
        =>DBError
      =>VolumeIsBusy
           => NovaException =>VirtualInterfaceCreateException
      =>VirtualInterfaceMacAddressException
      =>NotAuthorized =>AdminRequired
      =>Invalid =>InvalidSignature
          =>InvalidInput
                                                                                                                                               =>InvalidInstanceType
          =>InvalidPortRange
          =>InvalidIpProtocol
          =>InvalidContentType
          =>ServiceUnavailable =>
                                                                 => etc. etc.
                                                                                    =>NotFound =>InstanceNotFound
          =>VolumeNotFound =>
          =>SnapshotNotFound
          =>DiskNotFound
                                                                                                                                              =>ImageNotFound=>
                                                                                                                                              =>NetworkNotFound=>
      =>Duplicate =>KeyPairExists
          =>UserExists=>
             =>etc.etc.
                  Nova Sub-class(level3)
=>VolumeServiceUnavailable
=>ComputeServiceUnavailable
=>VolumeNotFoundForInstance
=>KernelNotFoundForImage
=>RamdiskNotFoundForImage
=>NetworkNotFoundForBridge

Issues and solutions:
This scheme does not let the user know which service has raised the exception. There is no service level categorization of exceptions.

Example: If an rpc.cast or rpc.call from Compute raises exception, the user is not able to identify that it is coming from compute or from another service.
Persistent error messages? Do we require storing the exception messages into peristent store so that the user can see them later in his dashboard/system logs? Something like event logs.

Eg. User wants to see historically which of his requests failed and what were those invalid parameters. Is this really useful for the user?
Feature-wise segregation of exceptions. Do we need to divide the exception classes based on features of nova -Security group, Instance, Volume etc ? Currently the division is based on the
type of error. Eg: notFound.InstanceNotFound.
Create base classes based on features, and then subclass the different categories of errors such as Invalid, NotFound, Duplicate etc.
OR, Create Base classes based on the error categories, and then subclass them based on Feature.
Which of the above options are feasible? Can they add more clarity to the user than the current scheme?

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

Ramesh Sahu

Rochelle Grober