OpenQuake (deprecated)

Supervision of OQ jobs

Registered by Muharem Hrnjadovic on 2011-07-11

We need to supervise OQ jobs and so we can cope with crashed or hung processes and with software errors.

Blueprint information

Status:: Complete

Approver:: John Tarter

Priority:: High

Drafter:: Muharem Hrnjadovic

Direction:: Approved

Assignee:: Muharem Hrnjadovic

Definition:: Approved

Series goal:: None

Implementation:: Implemented

Milestone target:: None

Started by: Muharem Hrnjadovic on 2011-09-08

Completed by: Muharem Hrnjadovic on 2011-09-08

Related branches

Related bugs

Bug #809196: Each OQ job needs to have a record in the postgres database	Fix Released
Bug #809197: OQ jobs need unique identifier	Confirmed
Bug #809198: Make the job identifier available in the python code base	Invalid
Bug #809199: Provide function to initialise RabbitMQ backend for OQ signalling	Fix Released
Bug #809200: Make sure a separate RabbitMQ vhost for OQ signalling is available	Fix Released
Bug #809201: A RabbitMQ backend implementation for python logging is needed	Fix Released
Bug #809203: A message consumer process code template is needed	Fix Released
Bug #809212: RabbitMQ backend for logging must transport error logs	Fix Released
Bug #809217: All OQ log records must be linked to job in question	Fix Released
Bug #809220: Log records must be dispatched in a timely fashion	Fix Released
Bug #809222: Complete OQ log records need to be stored in central place	Fix Released
Bug #809231: Supervisors must detect and document failed OQ jobs	Fix Released
Bug #809232: Supervisors must detect and document crashed OQ jobs	Fix Released
Bug #810980: Each OQ job must start its supervisor process	Fix Released
Bug #810987: Supervisors must purge the redis store after a job terminates	Fix Released
Bug #812690: Supervision of OQ jobs blueprint review	Fix Released
Bug #812698: Add an 'error_msg' table to the schema	Fix Released
Bug #812699: Supervisors must detect and document succeeded OQ jobs	Fix Released
Bug #813910: Crashed supervisors must be respawned	Fix Released
Bug #814048: Make the job id available in the java code base	Fix Released
Bug #814051: A RabbitMQ backend implementation for java logging is needed	Fix Released
Bug #814075: Job supervision needs to cope with network failures	Fix Released
Bug #814079: How do we test the python RabbitMQ logging backend?	Fix Released
Bug #814081: How do we test the Java RabbitMQ logging backend?	Fix Released
Bug #814082: How do we test the python RabbitMQ message consumers?	Won't Fix
Bug #825325: Implement job supervisor	Fix Released
Bug #826506: Package the python-pika RabbitMQ bindings	Fix Released
Bug #827349: Provide a celery task init() function	Fix Released
Bug #828616: Add a 'supervisor_pid' to the 'oq_job' table	Fix Released
Bug #833688: package python-ampqlib rev. 1.0.0	Fix Released
Bug #833690: package python-kombu rev. 1.2.1	Fix Released
Bug #833691: package python-celery rev. 2.3.1	Fix Released

Sprints

Whiteboard

= Job supervision =

We need to supervise OQ jobs so we can cope with crashed or hung
processes as well as with software failures.

== Introduction ==

An OQ job can execute on 1+ machines. Typically, there is
    - 1 control node i.e. a machine that executes the openquake process
      (plus 0+ celeryd processes) and drives the calculation by
      partitioning it into a number of tasks, taking delivery of the
      results and serialising these.
    - 0+ worker nodes, i.e. machines that execute a 1+ celeryd processes
      that are running the tasks defined by the "openquake" process on
      the control node.

Please note that services used by all OQ jobs (e.g. RabbitMQ (message
broker), redis (NoSQL data store) and postgres (persistent data store))
are most likely to be deployed on the control node as well.

== Failure scenarios ==

The following failure scenarios are possible:

  1 - control node failure (crash or hung machine)
  2 - worker node failure (crash or hung machine)
  3 - hung openquake process
  4 - hung celeryd process
  5 - crashed openquake process
  6 - crashed celeryd process
  7 - a failure in the openquake process
  8 - a failure in 1+ celeryd processes (on 1+ worker nodes)

A node/process is hung if it does not make *any* progress over a
protracted (and configurable) period of time (e.g. 3+ minutes).

But how is progress measured/perceived in the absence of a progress
feedback mechanism (absence of logging activity for a given time?)

Given the question above dealing with scenarios 3 and 4 will be
postponed until we have a progress feedback mechanism in place.

Hung nodes will become more of an issues once we start pushing work into
the cloud (it is not that rare or uncommon for EC2 virtual machines to
get hung for example).

For the time being we will focus on scenarios 5-8 which constitute the
scope of this blueprint.

== Crashed openquake process ==

An abnormal termination of the openquake process on the control node
entails a failure of the entire OQ job.

We need to be able to
    - detect openquake processes crashes
    - update the postgres database information pertaining to the OQ job
      whose process crashed:
        - change status to failed
        - add brief/detailed error information to be displayed to the
          end user
    - clean up after the crashed process
        - revoke celery tasks
        - purge the redis store

A separate supervisor process is needed in order to detect openquake
process crashes and/or failures. That supervisor is to be started by the
openquake process early on in its execution.

== Crashed celeryd workers ==

The following questions need to be investigated:
    - does celery restart crashed workers?
    - if no, should we have a supervisor that restarts celeryd workers?
    - what happens with reserved tasks when a worker crashes (reserved
      tasks are tasks that have been received, but are still waiting to
      be executed)? See e.g. [1]
    - what is a "WorkerLostError: Worker exited prematurely."? How
      should these be treated?

== Failure in the openquake process ==

A failure in the openquake process is to be treated like an openquake
processes crash and requires a full clean-up. The only difference being
that the supervisor process needs to terminate the openquake process.

== Failure in one or more celeryd processes ==

A failure in one or more celeryd processes is to be treated like a
failure in the openquake process i.e. the latter is to be terminated and
a full clean-up is to be performed.

== Questions ==

How do we deal with *supervisor* process crashes?

== Notes ==

In the process of this work package we will also consolidate logs
pertaining to OQ jobs and store them in a central place for a
configurable period of time.
This is to facilitate post-mortem analysis, detection of obscure bugs
as well as the operation of the (gemsun) compute network in general.

It would be desirable to store all log records pertaining to a
particular job in a contiguous block (and not interleaved with the logs
of other unrelated jobs).

== Solution outline ==

How are failures detected/signalled? Every python logging statement is
translated (by the RabbitMQ logging backend) to an amqp message which is
published on the "signalling" topic exchange with the following routing
key:

log.<severity>.<jobid>

Each OQ supervisor process will subscribe to message topics pertaining
to the supervised job i.e.:

log.[critical|fatal|error].<id-of-supervised-job>

If and when the first of these failure messages is consumed
    - the failed OQ job will be terminated
    - the status of that OQ job will be set to "failed" in the postgres
      database
    - a message with the following routing key will be published:
        job.failed.<jobid>
    - a clean-up will be performed (revocation of pending celery tasks,
      initiation of redis garbage collection)

Also, the supervisor will make sure that *all* related errors will be
stored in a brief/detailed format in the postgres database and
associated with the failed job (so we can provide proper feedback to the
end user).
Is this a good idea? Should we limit ourselves to the first failure when
it comes to end user feedback?

Furthermore, supervisors need to distinguish between an OQ job crash and
a normal completion (e.g. by way of a periodic process listing and
filtering on the supervised job pid):
    - normal successful termination: when the OQ job process has
      terminated and its job status in the postgres database was set to
      "succeeded"
    - crash/abnormal termination: when the OQ job process has terminated
      but its job status in the postgres database is still "running"

In either case an appropriate message will be published
- success: job.succeeded.<jobid>
- crash: job.crashed.<jobid>

and the supervisor itself will terminate.

== References ==

[1] http://permalink.gmane.org/gmane.comp.python.amqp.celery.user/342

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

No subscribers.