Supervision of OQ jobs

Registered by Muharem Hrnjadovic

We need to supervise OQ jobs and so we can cope with crashed or hung processes and with software errors.

Blueprint information

Status:
Complete
Approver:
John Tarter
Priority:
High
Drafter:
Muharem Hrnjadovic
Direction:
Approved
Assignee:
Muharem Hrnjadovic
Definition:
Approved
Series goal:
None
Implementation:
Implemented
Milestone target:
None
Started by
Muharem Hrnjadovic
Completed by
Muharem Hrnjadovic

Related branches

Sprints

Whiteboard

= Job supervision =

We need to supervise OQ jobs so we can cope with crashed or hung
processes as well as with software failures.

== Introduction ==

An OQ job can execute on 1+ machines. Typically, there is
    - 1 control node i.e. a machine that executes the openquake process
      (plus 0+ celeryd processes) and drives the calculation by
      partitioning it into a number of tasks, taking delivery of the
      results and serialising these.
    - 0+ worker nodes, i.e. machines that execute a 1+ celeryd processes
      that are running the tasks defined by the "openquake" process on
      the control node.

Please note that services used by all OQ jobs (e.g. RabbitMQ (message
broker), redis (NoSQL data store) and postgres (persistent data store))
are most likely to be deployed on the control node as well.

== Failure scenarios ==

The following failure scenarios are possible:

  1 - control node failure (crash or hung machine)
  2 - worker node failure (crash or hung machine)
  3 - hung openquake process
  4 - hung celeryd process
  5 - crashed openquake process
  6 - crashed celeryd process
  7 - a failure in the openquake process
  8 - a failure in 1+ celeryd processes (on 1+ worker nodes)

A node/process is hung if it does not make *any* progress over a
protracted (and configurable) period of time (e.g. 3+ minutes).

But how is progress measured/perceived in the absence of a progress
feedback mechanism (absence of logging activity for a given time?)

Given the question above dealing with scenarios 3 and 4 will be
postponed until we have a progress feedback mechanism in place.

Hung nodes will become more of an issues once we start pushing work into
the cloud (it is not that rare or uncommon for EC2 virtual machines to
get hung for example).

For the time being we will focus on scenarios 5-8 which constitute the
scope of this blueprint.

== Crashed openquake process ==

An abnormal termination of the openquake process on the control node
entails a failure of the entire OQ job.

We need to be able to
    - detect openquake processes crashes
    - update the postgres database information pertaining to the OQ job
      whose process crashed:
        - change status to failed
        - add brief/detailed error information to be displayed to the
          end user
    - clean up after the crashed process
        - revoke celery tasks
        - purge the redis store

A separate supervisor process is needed in order to detect openquake
process crashes and/or failures. That supervisor is to be started by the
openquake process early on in its execution.

== Crashed celeryd workers ==

The following questions need to be investigated:
    - does celery restart crashed workers?
    - if no, should we have a supervisor that restarts celeryd workers?
    - what happens with reserved tasks when a worker crashes (reserved
      tasks are tasks that have been received, but are still waiting to
      be executed)? See e.g. [1]
    - what is a "WorkerLostError: Worker exited prematurely."? How
      should these be treated?

== Failure in the openquake process ==

A failure in the openquake process is to be treated like an openquake
processes crash and requires a full clean-up. The only difference being
that the supervisor process needs to terminate the openquake process.

== Failure in one or more celeryd processes ==

A failure in one or more celeryd processes is to be treated like a
failure in the openquake process i.e. the latter is to be terminated and
a full clean-up is to be performed.

== Questions ==

How do we deal with *supervisor* process crashes?

== Notes ==

In the process of this work package we will also consolidate logs
pertaining to OQ jobs and store them in a central place for a
configurable period of time.
This is to facilitate post-mortem analysis, detection of obscure bugs
as well as the operation of the (gemsun) compute network in general.

It would be desirable to store all log records pertaining to a
particular job in a contiguous block (and not interleaved with the logs
of other unrelated jobs).

== Solution outline ==

How are failures detected/signalled? Every python logging statement is
translated (by the RabbitMQ logging backend) to an amqp message which is
published on the "signalling" topic exchange with the following routing
key:

    log.<severity>.<jobid>

Each OQ supervisor process will subscribe to message topics pertaining
to the supervised job i.e.:

    log.[critical|fatal|error].<id-of-supervised-job>

If and when the first of these failure messages is consumed
    - the failed OQ job will be terminated
    - the status of that OQ job will be set to "failed" in the postgres
      database
    - a message with the following routing key will be published:
        job.failed.<jobid>
    - a clean-up will be performed (revocation of pending celery tasks,
      initiation of redis garbage collection)

Also, the supervisor will make sure that *all* related errors will be
stored in a brief/detailed format in the postgres database and
associated with the failed job (so we can provide proper feedback to the
end user).
Is this a good idea? Should we limit ourselves to the first failure when
it comes to end user feedback?

Furthermore, supervisors need to distinguish between an OQ job crash and
a normal completion (e.g. by way of a periodic process listing and
filtering on the supervised job pid):
    - normal successful termination: when the OQ job process has
      terminated and its job status in the postgres database was set to
      "succeeded"
    - crash/abnormal termination: when the OQ job process has terminated
      but its job status in the postgres database is still "running"

In either case an appropriate message will be published
    - success: job.succeeded.<jobid>
    - crash: job.crashed.<jobid>

and the supervisor itself will terminate.

== References ==

[1] http://permalink.gmane.org/gmane.comp.python.amqp.celery.user/342

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.