Comment 1 for bug 703800

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: restart command fails to restart main process when pre-stop stanza exists

On further investigation, this does *not* affect jobs that block the 'stopping' event. So I believe this may be related more to the way job_process_run treates pre-stop. According to comments in init/job_process.c:

        /* We don't change the state if we're in post-start and there's
         * a post-start process running, or if we're in pre-stop and
         * there's a pre-stop process running; we wait for those to
         * finish instead.
         */
        if ((job->state == JOB_POST_START)
            && job->class->process[PROCESS_POST_START]
            && (job->pid[PROCESS_POST_START] > 0)) {
            state = FALSE;
        } else if ((job->state == JOB_PRE_STOP)
            && job->class->process[PROCESS_PRE_STOP]
            && (job->pid[PROCESS_PRE_STOP] > 0)) {
            state = FALSE;
        }

Later on, if (!state) calls job_change_goal(job, JOB_RESPAWN) ..

Digging into job_next_state a bit, I found that a goal of JOB_RESPAWN causes job_next_state to change the goal only if state is JOB_POST_START or JOB_PRE_STOP. JOB_POST_START changes the goal to JOB_START, which makes sense. But JOB_PRE_STOP changes the goal to JOB_START as well! This doesn't make much sense to me, and I thought the obvious fix would be to change it to be JOB_STOP, however this caused test_job to fail here:

...with post-start job and a goal of respawn
BAD: wrong value for job->goal, expected 1 got 0
 at tests/test_job.c:4129 (test_next_state).
/bin/bash: line 5: 29784 Aborted ${dir}$tst
FAIL: test_job

This was confusing until I realized that the test code is wrong:

    /* Check that the next state if we're respawning after a pre-stop
     * job is stopping with the goal changed.
     */
    TEST_FEATURE ("with post-start job and a goal of respawn");
    job->goal = JOB_RESPAWN;
    job->state = JOB_PRE_STOP;

    TEST_EQ (job_next_state (job), JOB_STOPPING);
    TEST_EQ (job->goal, JOB_START);

That should read 'with pre-stop job and a goal of respawn'.

So, this brings to the edge case that I think is causing this problem, and my perceived fix.

If you want a running job to be restarted, it should move from wherever it was, through stopping->pre-stop->killed->post-stop->starting-> ...

However, this doesn't happen. Instead the goal is changed to STOP, but job_change_goal() does not wait for that to happen, which its docblock would suggest.

Instead, it returns with the state set to JOB_PRE_STOP ..

This is my proposed fix.. though I'm not entirely confident this is the right way to handle it, and I haven't had a chance to test this yet.

=== modified file 'init/job.c'
--- init/job.c 2010-12-14 15:32:41 +0000
+++ init/job.c 2011-01-17 07:56:24 +0000
@@ -1265,6 +1265,12 @@
   nih_list_add (&job->blocking, &blocked->entry);

  job_change_goal (job, JOB_STOP);
+ if (job->state == JOB_PRE_STOP)
+ {
+ /* hack: should not happen but does */
+ job->goal = JOB_RESPAWN;
+ job_change_goal (job, JOB_STOP);
+ }
  job_change_goal (job, JOB_START);

  if (! wait)