CantStartEngineError in cell conductor during reschedule - get_host_availability_zone up-call

Bug #1781286 reported by Matthew Edmonds
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

In a stable/queens devstack environment with multiple PowerVM compute nodes, everytime I see this in <email address hidden> logs:

Jul 11 15:48:57 myhostname nova-conductor[3796]: DEBUG nova.conductor.manager [None req-af22375c-f920-4747-bd2f-0de80ee69465 admin admin] Rescheduling: True {{(pid=4108) build_instances /opt/stack/nova/nova/conductor/manager.py:571}}

it is shortly thereafter followed by:

Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server [None req-af22375c-f920-4747-bd2f-0de80ee69465 admin admin] Exception during message handling: CantStartEngineError: No sql_connection parameter is established
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/manager.py", line 652, in build_instances
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server host.service_host))
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server key='availability_zone')
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server result = fn(cls, context, *args, **kwargs)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/objects/aggregate.py", line 541, in get_by_host
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server _get_by_host_from_db(context, host, key=key)]
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 987, in wrapper
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server with self._transaction_scope(context):
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server return self.gen.next()
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1037, in _transaction_scope
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server context=context) as resource:
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server return self.gen.next()
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 640, in _session
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server bind=self.connection, mode=self.mode)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 404, in _create_session
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server self._start()
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 491, in _start
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server engine_args, maker_args)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 513, in _setup_for_connection
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server "No sql_connection parameter is established")
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server CantStartEngineError: No sql_connection parameter is established
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server

The nova_cell1.conf does have [database]connection set:

[database]
connection = mysql+pymysql://root:mysql@127.0.0.1/nova_cell1?charset=utf8

This may be related to https://bugs.launchpad.net/nova/+bug/1736946 , though that is supposedly fixed in stable/queens and the trace is different, hence the new defect.

description: updated
description: updated
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

It's not the [database]/connection that matters, build requests are in API DB which isn't configured in the cell conductor in default devstack.

If you have a recreate, can you log the filter_properties when this fails? Because the logic is based on the num_attempts count from the retries:

except Exception as exc:
            num_attempts = filter_properties.get(
                'retry', {}).get('num_attempts', 1)
            updates = {'vm_state': vm_states.ERROR, 'task_state': None}
            for instance in instances:
                self._set_vm_state_and_notify(
                    context, instance.uuid, 'build_instances', updates,
                    exc, request_spec)
                # If num_attempts > 1, we're in a reschedule and probably
                # either hit NoValidHost or MaxRetriesExceeded. Either way,
                # the build request should already be gone and we probably
                # can't reach the API DB from the cell conductor.
                if num_attempts <= 1:
                    try:
                        # If the BuildRequest stays around then instance
                        # show/lists will pull from it rather than the errored
                        # instance.
                        self._destroy_build_request(context, instance)

Revision history for this message
Matt Riedemann (mriedem) wrote :

I don't see how this could happen. When we initially schedule a server, we populate the retry field in the filter_properties and set num_attempts to 1:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/conductor/manager.py#L1208

We do the same here with what should be the same filter_properties dict passed from conductor -> compute -> conductor during the reschedule loop:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/conductor/manager.py#L563

That second call to populate_retry should increment num_attempts to 2:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/scheduler/utils.py#L646

The only thing I can figure is maybe you have the max_attempts config option value set to 1 or you're forcing the host/node during the server create?

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/scheduler/utils.py#L634

In that case we don't set the retry entry in filter_properties.

tags: added: conductor
Revision history for this message
Matthew Edmonds (edmondsw) wrote :

The except block in comment 1 is not entered, because the exception doesn't get raised until after that, line 652 in stable/queens:

https://github.com/openstack/nova/blob/7ae2dc840a0bccb868122bb4b77e8958a0e842a7/nova/conductor/manager.py#L650-L652

The links in comment 2 all appear to be from master. I wouldn't be surprised if master is also affected, but I'm seeing this in stable/queens, so we probably want to look there first.

[scheduler]max_attempts is not changed from the default (3) and the create requests do not appear to be specifying a specific host/node.

Revision history for this message
Matt Riedemann (mriedem) wrote :

OK looking at the stacktrace I see it's not the '_destroy_build_request' call that's blowing up on reschedule, it's the up-call to get the availability zone for the next chosen host from the list of alternates:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/conductor/manager.py#L647

And if an AZ is not requested during server create, we are free to move the instance to another AZ during reschedule. So it seems we've fallen into a dreaded up-call hole here that needs to be tracked:

https://docs.openstack.org/nova/latest/user/cellsv2-layout.html#operations-requiring-upcalls

summary: - CantStartEngineError in cell conductor during rebuild
+ CantStartEngineError in cell conductor during reschedule -
+ get_host_availability_zone up-call
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: cells
Revision history for this message
Matthew Edmonds (edmondsw) wrote :

This appears to trace back to https://review.openstack.org/#/c/446053/

But it's unclear to me what needs to consider the AZ if the user didn't specify... E.g. migrate doesn't care about the AZ in that case, and could move the instance to a different AZ (see https://review.openstack.org/#/c/567701/)

Revision history for this message
Matt Riedemann (mriedem) wrote :

One idea for fixing this would be to set the AZ for each host in the list of Selection objects that come back from the scheduler - that happens at "the top" where we have access to the API DB and thus the aggregates table.

We send the list of Selection objects down to the compute service which, during a reschedule, would pass them back to the cell conductor and then rather than call get_host_availability_zone() we can just get the AZ for the alternate host from the Selection object and avoid the "up call".

This could potentially also be used to fix bug 1497253 where we're doing boot from volume and nova-compute creates the volume and [cinder]/cross_az_attach=False so nova has to tell cinder to create the volume in the same AZ as the instance.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/581910

Revision history for this message
Matt Riedemann (mriedem) wrote :

I had to re-learn why this part of comment 2 isn't a problem:

"""
The only thing I can figure is maybe you have the max_attempts config option value set to 1 or you're forcing the host/node during the server create?

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/scheduler/utils.py#L634

In that case we don't set the retry entry in filter_properties.
"""

That's because if CONF.scheduler.max_attempts=1 or force_hosts/nodes, we don't set the 'retry' entry in the filter_properties dict passed between conductor and compute:

https://github.com/openstack/nova/blob/536e5fa57f72f71217fd9f2160df0284a76102e1/nova/scheduler/utils.py#L638

And then if we hit a build failure in compute, we don't cast back to the cell conductor to reschedule, we just fail:

https://github.com/openstack/nova/blob/536e5fa57f72f71217fd9f2160df0284a76102e1/nova/compute/manager.py#L1825

So if we get here in conductor during a reschedule:

https://github.com/openstack/nova/blob/536e5fa57f72f71217fd9f2160df0284a76102e1/nova/conductor/manager.py#L585

The 'retry' key is going to be set and num_attempts won't be 1.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/582412

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/582412
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=276130c6d1637f8e0f39ba274e2619a115a8fa1a
Submitter: Zuul
Branch: master

commit 276130c6d1637f8e0f39ba274e2619a115a8fa1a
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 12 18:25:07 2018 -0400

    Add note about reschedules and num_attempts in filter_properties

    The "retry" entry in filter_properties is not set if reschedules
    are disabled, which happens in these cases:

    1. [scheduler]/max_attempts=1
    2. The server is forced to a specific host and/or node.

    More times than I'd like to admit, I've had to re-learn that
    filter_properties['retry']['num_attempts'] will always be >1 in
    conductor build_instances during a reschedule because if
    reschedules are disabled, the compute service aborts the build
    on failure and we don't even get back to conductor.

    This change adds a note since it's hard to keep in your head how
    the retry logic is all tied together from the API, superconductor,
    compute and cell conductor during a reschedule scenario.

    Change-Id: I83536b179000f41f9618a4b6f2a16b4440fd61ba
    Related-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/581910
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1160ae7a6585c1567e2e91badb4c312533d15686
Submitter: Zuul
Branch: master

commit 1160ae7a6585c1567e2e91badb4c312533d15686
Author: Matt Riedemann <email address hidden>
Date: Wed Jul 11 19:21:44 2018 -0400

    Add another up-call to the cells v2 caveats list

    Due to change I8d426f2635232ffc4b510548a905794ca88d7f99 in Pike,
    which ironically was meant to avoid up-calls (I think), it
    introduced an up-call during reschedules for server create and
    resize to set the instance.availability_zone based on the
    alternate host selected during the reschedule.

    This adds the up-call to our list of known issues in the cells
    v2 docs so we can track the issue and make people aware of it.

    Change-Id: Id819f91477613a013b89b1fb0b2def3b0fd4b08c
    Related-Bug: #1781286

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note related bug 1781300.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think this is also for reschedules during server create because of this call:

https://github.com/openstack/nova/blob/4c37ff72e5446c835a48d569dd5a1416fcd36c71/nova/conductor/manager.py#L657

build_instances in the cell conductor won't be able to hit aggregates in the API DB if the cell conductor isn't configured for the API DB.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matthew Edmonds (edmondsw) wrote :

Yes Matt, I believe it was server create where I originally hit this.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Yup I hit it during server create as well (this is on a Train devstack with 2 cells and 3 computes, where 2 computes are in cell1 and that's where I got a reschedule on server create because of a port binding failure on the first compute attempted). Note also that the server is stuck in BUILD status due to this, it is not set to ERROR status:

stack@crosscell:~$ sudo journalctl -a -u <email address hidden> | grep req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d
Apr 02 23:29:17 crosscell nova-conductor[25503]: ERROR nova.scheduler.utils [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] [instance: 7a69a14e-6a00-426a-a35c-5340597c30af] Error from last host: crosscell2 (node crosscell2): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1950, in _do_build_and_run_instance\n filter_properties, request_spec)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2320, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance 7a69a14e-6a00-426a-a35c-5340597c30af was re-scheduled: Binding failed for port 6079221f-1de8-4406-a7f1-842bc44ed0fd, please check neutron logs for more information.\n']
Apr 02 23:29:18 crosscell nova-conductor[25503]: DEBUG nova.conductor.manager [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] Rescheduling: True {{(pid=25503) build_instances /opt/stack/nova/nova/conductor/manager.py:618}}
Apr 02 23:29:18 crosscell nova-conductor[25503]: DEBUG nova.scheduler.utils [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] Attempting to claim resources in the placement API for instance 7a69a14e-6a00-426a-a35c-5340597c30af {{(pid=25503) claim_resources /opt/stack/nova/nova/scheduler/utils.py:1002}}
Apr 02 23:29:18 crosscell nova-conductor[25503]: ERROR oslo_messaging.rpc.server [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] Exception during message handling: CantStartEngineError: No sql_connection parameter is established

stack@crosscell:~$ openstack server list
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
| 7a69a14e-6a00-426a-a35c-5340597c30af | server1cell1 | BUILD | | cirros-0.4.0-x86_64-disk | m1.tiny |
+--------------------------------------+--------------+--------+----------+--------------------------+---------+

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/685807

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note for backports: this problem goes back to Pike but we won't be able to backport the fix since it's going to require RPC API version changes.

no longer affects: nova/pike
no longer affects: nova/queens
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/685997

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/685998

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/686017

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/686047

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/686050

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/686053

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/686226

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/685997
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=38fb7f82abd7fffc00ebc050ee5230f1137e76d8
Submitter: Zuul
Branch: master

commit 38fb7f82abd7fffc00ebc050ee5230f1137e76d8
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 12:03:17 2019 -0400

    Handle get_host_availability_zone error during reschedule

    If a build fails and reschedules to a cell conductor which does
    not have access to the API DB, the call to get_host_availability_zone
    will fail with a CantStartEngineError because it's trying to do an
    "up-call" to the API DB for host aggregate info. The reschedule
    fails and the instance is stuck in BUILD status without a fault
    injected for determining what went wrong.

    This change simply handles the failure and cleans up so the instance
    is put into a terminal (ERROR) state.

    Change-Id: I6bfa6fa767403fb936a6ae340b8687eb161732fc
    Partial-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/686264

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/686292

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/685998
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8c332c4dd79f8bb5fdc5f6722a80c9a117ff52c4
Submitter: Zuul
Branch: master

commit 8c332c4dd79f8bb5fdc5f6722a80c9a117ff52c4
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 12:08:18 2019 -0400

    Add functional regression test for build part of bug 1781286

    This adds a functional regression test to recreate bug 1781286
    where rescheduling from a build failure where the cell conductor
    does not have access to the API DB fails when trying to get
    availability zone information about a selected alternate host.
    The test is a bit tricky in that it has to stub out the AZ query
    to fail but only after we hit the compute service and then remove
    that stub so we can use the API again.

    Note that change I6bfa6fa767403fb936a6ae340b8687eb161732fc handles
    the error in conductor and puts the instance into ERROR state. The
    bug is not yet resolved though since we should be able to avoid
    the up-call by stashing the alternate host availability zone on the
    Selection object during scheduling and use that later during the
    reschedule. Once that is working the functional test will be updated.

    Change-Id: I62179d6b93ea1a23c4906477ee19b422bfcb72a2
    Related-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/686017
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f07697ebff31daa5ee8c8b3a29b55403522ba445
Submitter: Zuul
Branch: master

commit f07697ebff31daa5ee8c8b3a29b55403522ba445
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 13:17:23 2019 -0400

    Add functional regression test for migrate part of bug 1781286

    This is similar to I62179d6b93ea1a23c4906477ee19b422bfcb72a2
    except it covers a reschedule during a cold migration rather
    than an initial server create.

    Change-Id: Ic6926eecda1f9dd7183d66c67f04f308f6a1799d
    Related-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/685807
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bcd4584a7c6a8d6b91a6db2fe8b38671494ff9c1
Submitter: Zuul
Branch: master

commit bcd4584a7c6a8d6b91a6db2fe8b38671494ff9c1
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 30 17:56:58 2019 -0400

    Add Selection.availability_zone field

    This adds an availability_zone field to the Selection object,
    using the same type and nullable value as the same field in the
    Instance object. This will be used to store the service_host
    AZ to pass from the superconductor layer to the compute and cell
    conductor layer to avoid an up-call to get the host AZ information
    from the API DB during a reschedule.

    Note that the field is nullable because a host may not be in an
    AZ and CONF.default_availability_zone can technically be set to
    None though it defaults to "nova".

    Change-Id: Ia50c5f4dd2204f1cafa669097d1e744479c4d8c8
    Related-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/686047
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f1ad0b13e8f8296739f75b11e257a6fef2cc538c
Submitter: Zuul
Branch: master

commit f1ad0b13e8f8296739f75b11e257a6fef2cc538c
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 16:52:46 2019 -0400

    Set Instance AZ from Selection AZ during build reschedule

    This builds on change Ia50c5f4dd2204f1cafa669097d1e744479c4d8c8
    to use the Selection.availability_zone value when rescheduling
    during initial server create so that the cell conductor does not
    have to make an up-call to the aggregates table in the API DB
    which will fail if the cell conductor is not configured to use
    the API DB.

    The functional test added in I62179d6b93ea1a23c4906477ee19b422bfcb72a2
    is updated to show the failure is gone and we get the AZ from the
    Selection object during the reschedule.

    For the case that the availability_zone field is not in the Selection
    object, test_build_reschedule_get_az_error still covers that.

    Change-Id: I1f1c25cb4de924a1d6c3a979b758efd736bdbff0
    Partial-Bug: #1781286

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/686050
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ac85b76178017b4bda30502a6ebb9c990435ec72
Submitter: Zuul
Branch: master

commit ac85b76178017b4bda30502a6ebb9c990435ec72
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 17:30:54 2019 -0400

    Set Instance AZ from Selection AZ during migrate reschedule

    This builds on change Ia50c5f4dd2204f1cafa669097d1e744479c4d8c8
    to use the Selection.availability_zone value when rescheduling
    during a resize or cold migrate so that the cell conductor does not
    have to make an up-call to the aggregates table in the API DB
    which will fail if the cell conductor is not configured to use
    the API DB.

    The functional test added in Ic6926eecda1f9dd7183d66c67f04f308f6a1799d
    is updated to show the failure is gone and we get the AZ from the
    Selection object during the reschedule.

    For the case that the availability_zone field is not in the Selection
    object, there are existing unit tests in
    nova.tests.unit.conductor.tasks.test_migrate which will make sure we
    are not unconditionally trying to access the Selection.availability_zone
    field.

    Change-Id: I103d5023d3a3a7c367c7eea7fb103cb8ec52accf
    Closes-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/686053
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a87fbdda9d74949fa149d73c8ad821399d817c50
Submitter: Zuul
Branch: master

commit a87fbdda9d74949fa149d73c8ad821399d817c50
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 17:45:08 2019 -0400

    Update cells v2 up-call caveats doc

    With the fix for bug 1781286 for reschedules during server
    create and resize/migrate, we can update the cells v2 docs
    saying the up-call issue for that big is now fixed.

    Change-Id: I9ff116de8b63c0fbfb880008718b1386178b1d1a
    Related-Bug: #1781286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/686226
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b5e6c389d733d4dbd94380add7e3fa6c4d1e3fa8
Submitter: Zuul
Branch: stable/train

commit b5e6c389d733d4dbd94380add7e3fa6c4d1e3fa8
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 12:03:17 2019 -0400

    Handle get_host_availability_zone error during reschedule

    If a build fails and reschedules to a cell conductor which does
    not have access to the API DB, the call to get_host_availability_zone
    will fail with a CantStartEngineError because it's trying to do an
    "up-call" to the API DB for host aggregate info. The reschedule
    fails and the instance is stuck in BUILD status without a fault
    injected for determining what went wrong.

    This change simply handles the failure and cleans up so the instance
    is put into a terminal (ERROR) state.

    Change-Id: I6bfa6fa767403fb936a6ae340b8687eb161732fc
    Partial-Bug: #1781286
    (cherry picked from commit 38fb7f82abd7fffc00ebc050ee5230f1137e76d8)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/686264
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53bcf0b1eeb4d34cbbec6200026c9bac5921db97
Submitter: Zuul
Branch: stable/stein

commit 53bcf0b1eeb4d34cbbec6200026c9bac5921db97
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 12:03:17 2019 -0400

    Handle get_host_availability_zone error during reschedule

    If a build fails and reschedules to a cell conductor which does
    not have access to the API DB, the call to get_host_availability_zone
    will fail with a CantStartEngineError because it's trying to do an
    "up-call" to the API DB for host aggregate info. The reschedule
    fails and the instance is stuck in BUILD status without a fault
    injected for determining what went wrong.

    This change simply handles the failure and cleans up so the instance
    is put into a terminal (ERROR) state.

    NOTE(mriedem): The fill_provider_mapping mock on the unit test is
    removed since that method did not exist in Stein, it was introduced
    in Train: I76f777e4f354b92c55dbd52a20039e504434b3a1

    Change-Id: I6bfa6fa767403fb936a6ae340b8687eb161732fc
    Partial-Bug: #1781286
    (cherry picked from commit 38fb7f82abd7fffc00ebc050ee5230f1137e76d8)
    (cherry picked from commit b5e6c389d733d4dbd94380add7e3fa6c4d1e3fa8)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/686292
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a981356506dfc5e2fab34b0ae5a9ee624c4b264
Submitter: Zuul
Branch: stable/rocky

commit 5a981356506dfc5e2fab34b0ae5a9ee624c4b264
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 1 12:03:17 2019 -0400

    Handle get_host_availability_zone error during reschedule

    If a build fails and reschedules to a cell conductor which does
    not have access to the API DB, the call to get_host_availability_zone
    will fail with a CantStartEngineError because it's trying to do an
    "up-call" to the API DB for host aggregate info. The reschedule
    fails and the instance is stuck in BUILD status without a fault
    injected for determining what went wrong.

    This change simply handles the failure and cleans up so the instance
    is put into a terminal (ERROR) state.

    Conflicts:
          nova/tests/unit/conductor/test_conductor.py

    NOTE(mriedem): The conflict is due to not having change
    Ibfb0a6db5920d921c4fc7cabf3f4d2838ea7f421 in Rocky.
    Also note that the call to _cleanup_when_reschedule_fails does not
    pass a "legacy_request_spec" variable since change
    If8a13f74d2b3c99f05365eb49dcfa01d9042fefa is not in Rocky.

    Change-Id: I6bfa6fa767403fb936a6ae340b8687eb161732fc
    Partial-Bug: #1781286
    (cherry picked from commit 38fb7f82abd7fffc00ebc050ee5230f1137e76d8)
    (cherry picked from commit b5e6c389d733d4dbd94380add7e3fa6c4d1e3fa8)
    (cherry picked from commit 53bcf0b1eeb4d34cbbec6200026c9bac5921db97)

tags: added: in-stable-rocky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.