Cross-cell resize

Registered by Matt Riedemann on 2018-09-19

References:

* https://etherpad.openstack.org/p/nova-ptg-stein

* https://etherpad.openstack.org/p/nova-ptg-stein-cells

* http://lists.openstack.org/pipermail/openstack-dev/2018-August/thread.html#133693

High level notes:

* We are doing resize because cells can be sharded by flavors and resize is the only non-admin way for users to opt into migrating from one cell with an old flavor (gen1) to a new flavor (gen2) in a new cell. This eases up admins/operators to drain old cells with old hardware.

* Currently a resize restricts the selected destination host to the existing cell; we'll add a policy rule to allow overriding that behavior in the scheduler so candidate target hosts are pulled from all cells. As part of this, we'll add a weigher which by default selects hosts from the current cell if possible to avoid unnecessary cross-cell migrations.

* We'll add a new task to conductor to orchestrate the cross-cell resize since it will be substantially different from the existing cold migrate / resize task.

* The conductor will perform pre-migration checks similar to the live migration task where the destination compute will be validated to make sure things like volumes and ports attached to the instance will continue to work on the destination host in the target cell.

* A cross-cell resize will leverage the existing shelve offload operation in the compute so that we shelve offload from the source host in cell1 and unshelve into the target host in cell2.

* The new conductor task will orchestrate creation of the instance and its related records (BDMs and tags) in the target cell and updating the instance mapping to point at the new cell. When the instance is deleted from the source cell and when the instance mapping record is updated (during or after the unshelve to the new cell) is TBD.

* The API will have to deal with the same instance living temporarily in multiple cells when listing instances and hide one of them based on the instance mapping (or simply the task_state/migration status). Given this should be a small window of time, at least for volume-backed instances, it may be left for a later bug to address/optimize.

* Cross-cell resize will support the same confirm/revert semantics as normal resize today. Reverting a cross-cell resize will delete the instance from the target cell and recreate it in the source cell (note that the original source host might change if the instance was offloaded).

* There are some shelve-related bugs which fixing would be in our best interest before we build more functionality onto the shelve / unshelve workflow, those are linked to this blueprint.

* A formal design spec will follow once a proof of concept is written and basic testing has begun.

Blueprint information

Status:
Complete
Approver:
Sylvain Bauza
Priority:
Undefined
Drafter:
Matt Riedemann
Direction:
Approved
Assignee:
Matt Riedemann
Definition:
Approved
Series goal:
Accepted for ussuri
Implementation:
Implemented
Milestone target:
None
Started by
Matt Riedemann on 2018-10-10
Completed by
Eric Fried on 2020-01-07

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/cross-cell-resize,n,z

Addressed by: https://review.openstack.org/603930
    WIP: Cross-cell resize

Addressed by: https://review.openstack.org/614012
    Add Migration.cross_cell_move and get_by_uuid

Addressed by: https://review.openstack.org/614035
    Add Destination.cross_cell_move field

Addressed by: https://review.openstack.org/614036
    Add InstanceAction/Event create() method

Addressed by: https://review.openstack.org/614037
    Change HostManager to allow scheduling to other cells

Addressed by: https://review.openstack.org/614353
    Add CrossCellWeigher

Addressed by: https://review.openstack.org/616037
    Spec for cross-cell resize

Addressed by: https://review.openstack.org/621310
    Isolate cell-targeting code in MigrationTask

Addressed by: https://review.openstack.org/621311
    Extract compute API _create_image to compute.utils

Addressed by: https://review.openstack.org/621312
    Extract shelve API logic to compute.utils

Addressed by: https://review.openstack.org/621313
    Add can_connect_volume() compute driver method

Addressed by: https://review.openstack.org/627890
    WIP: Add initial cross-cell resize tasks

Addressed by: https://review.openstack.org/627891
    WIP: Add snapshot task for cross-cell resize

Addressed by: https://review.openstack.org/627892
    WIP: Create instance data in target cell prior to resize

Spec was approved on 2019-01-07 for Stein. -- melwitt 20190109

Addressed by: https://review.openstack.org/631123
    Add Instance.hidden field

Addressed by: https://review.openstack.org/631581
    WIP: Add CrossCellMigrationTask

Addressed by: https://review.openstack.org/633293
    WIP: Add prep_snapshot_based_resize_at_dest compute method

Addressed by: https://review.openstack.org/633298
    Move resize.prep.start/end notifications to helper method

Addressed by: https://review.openstack.org/633853
    Execute TargetDBSetupTask

Addressed by: https://review.openstack.org/634831
    Move resize.(start|end) notification sending to helper method

Addressed by: https://review.openstack.org/634832
    Add prep_snapshot_based_resize_at_source compute method

Addressed by: https://review.openstack.org/635079
    Move finish_resize.(start|end) notifications to helper method

Addressed by: https://review.openstack.org/635080
    WIP: Add finish_snapshot_based_resize_at_dest compute method

Addressed by: https://review.openstack.org/635646
    WIP: Add FinishResizeAtDestTask

Addressed by: https://review.openstack.org/635668
    WIP: Execute CrossCellMigrationTask from MigrationTask

Addressed by: https://review.openstack.org/635684
    WIP: Plumb allow_cross_cell_resize into compute API resize()

Addressed by: https://review.openstack.org/636224
    WIP: Filter duplicates from compute API get_migrations_sorted()

Addressed by: https://review.openstack.org/636253
    WIP: Start functional testing for cross-cell resize

Addressed by: https://review.openstack.org/636410
    Make Claim._claim_test handle SchedulerLimits object

Addressed by: https://review.openstack.org/636411
    RT: improve logging in _update_usage_from_migration

Addressed by: https://review.openstack.org/636412
    Make move_allocations handle empty source allocations

Addressed by: https://review.openstack.org/636413
    Stub out port binding create/delete in NeutronFixture

Addressed by: https://review.openstack.org/637058
    WIP: Add confirm_snapshot_based_resize_at_source

Addressed by: https://review.openstack.org/637070
    WIP: Add ConfirmResizeTask

Addressed by: https://review.openstack.org/637075
    WIP: Add confirm_snapshot_based_resize conductor RPC method

Addressed by: https://review.openstack.org/637316
    WIP: Confirm cross-cell resize from the API

Addressed by: https://review.openstack.org/637605
    Add nova.compute.utils.delete_image

Addressed by: https://review.openstack.org/637630
    WIP: Add revert_snapshot_based_resize_at_dest compute method

Addressed by: https://review.openstack.org/637647
    WIP: Add finish_revert_snapshot_based_resize_at_source compute method

Addressed by: https://review.openstack.org/638046
    WIP: Add RevertResizeTask

Addressed by: https://review.openstack.org/638047
    WIP: Add revert_snapshot_based_resize conductor RPC method

Addressed by: https://review.openstack.org/638048
    WIP: Revert cross-cell resize from the API

Addressed by: https://review.openstack.org/638268
    Confirm cross-cell resize while deleting a server

Addressed by: https://review.openstack.org/638269
    Add cross-cell resize policy rule and enable in API

Addressed by: https://review.openstack.org/638314
    WIP: Fix the leak in the cross-cell revert resize code

Addressed by: https://review.openstack.org/639382
    Improve CinderFixtureNewAttachFlow

Addressed by: https://review.openstack.org/639453
    Deal with cross-cell resize in _remove_deleted_instances_allocations

I'm deferring this from Stein since we're two days from feature freeze and this has a long ways to go. Will re-propose for Train. -- mriedem 20190305

Addressed by: https://review.openstack.org/641176
    WIP: Fix RT usage issues in cross-cell resize functional tests

Addressed by: https://review.openstack.org/641179
    Fix ProviderUsageBaseTestCase._run_periodics for multi-cell

Addressed by: https://review.openstack.org/641521
    Add functional recreate test for bug 1818914

Addressed by: https://review.openstack.org/641792
    Remove unused context parameter from RT._get_instance_type

Addressed by: https://review.openstack.org/641806
    Update usage in RT.drop_move_claim during confirm resize

Addressed by: https://review.openstack.org/642183
    Refactor ComputeManager.remove_volume_connection

Addressed by: https://review.openstack.org/642590
    Add power_on kwarg to ComputeDriver.spawn() method

Addressed by: https://review.openstack.org/642591
    Add functional test for cross-cell migrate with target host

Addressed by: https://review.openstack.org/642592
    Validate image/create during cross-cell resize functional testing

Addressed by: https://review.openstack.org/642807
    Re-propose cross-cell-resize spec for Train

Addressed by: https://review.openstack.org/643450
    Add zones wrinkle to TestMultiCellMigrate

Addressed by: https://review.openstack.org/643451
    Add negative test for cross-cell finish_resize failing

Addressed by: https://review.openstack.org/643852
    Extract compute API _create_image to compute.utils

Addressed by: https://review.openstack.org/650984
    DNM: Add instance hard delete

Re-approved for Train. -- mriedem 20190410

Gerrit topic: https://review.openstack.org/#/q/topic:bp/cross-cell-resize

Addressed by: https://review.openstack.org/651650
    Add archive_deleted_rows wrinkle to cross-cell functional test

Addressed by: https://review.openstack.org/651653
    FUP for I68498afd481f7291a6102928d7999b4be49ded7a

Gerrit topic: https://review.opendev.org/#/q/topic:bp/cross-cell-resize

Addressed by: https://review.opendev.org/641179
    Fix ProviderUsageBaseTestCase._run_periodics for multi-cell

Addressed by: https://review.opendev.org/639382
    Improve CinderFixtureNewAttachFlow

Addressed by: https://review.opendev.org/641521
    Add functional recreate test for bug 1818914

Addressed by: https://review.opendev.org/641792
    Remove unused context parameter from RT._get_instance_type

Addressed by: https://review.opendev.org/641806
    Update usage in RT.drop_move_claim during confirm resize

Addressed by: https://review.opendev.org/614012
    Add Migration.cross_cell_move and get_by_uuid

Addressed by: https://review.opendev.org/614036
    Add InstanceAction/Event create() method

Addressed by: https://review.opendev.org/650984
    DNM: Add instance hard delete

Addressed by: https://review.opendev.org/631123
    Add Instance.hidden field

Addressed by: https://review.opendev.org/627892
    Add TargetDBSetupTask

Addressed by: https://review.opendev.org/631581
    Add CrossCellMigrationTask

Addressed by: https://review.opendev.org/633853
    Execute TargetDBSetupTask

Addressed by: https://review.opendev.org/621313
    Add can_connect_volume() compute driver method

Addressed by: https://review.opendev.org/633293
    Add prep_snapshot_based_resize_at_dest compute method

Addressed by: https://review.opendev.org/627890
    Add PrepResizeAtDestTask

Addressed by: https://review.opendev.org/634832
    Add prep_snapshot_based_resize_at_source compute method

Addressed by: https://review.opendev.org/637605
    Add nova.compute.utils.delete_image

Addressed by: https://review.opendev.org/627891
    Add PrepResizeAtSourceTask

Addressed by: https://review.opendev.org/642183
    Refactor ComputeManager.remove_volume_connection

Addressed by: https://review.opendev.org/642590
    Add power_on kwarg to ComputeDriver.spawn() method

Addressed by: https://review.opendev.org/635080
    Add finish_snapshot_based_resize_at_dest compute method

Addressed by: https://review.opendev.org/635646
    Add FinishResizeAtDestTask

Addressed by: https://review.opendev.org/614035
    Add Destination.allow_cross_cell_move field

Addressed by: https://review.opendev.org/635668
    Execute CrossCellMigrationTask from MigrationTask

Addressed by: https://review.opendev.org/635684
    Plumb allow_cross_cell_resize into compute API resize()

Addressed by: https://review.opendev.org/636224
    Filter duplicates from compute API get_migrations_sorted()

Addressed by: https://review.opendev.org/614037
    Change HostManager to allow scheduling to other cells

Addressed by: https://review.opendev.org/636253
    Start functional testing for cross-cell resize

Addressed by: https://review.opendev.org/642591
    Add functional test for cross-cell migrate with target host

Addressed by: https://review.opendev.org/642592
    Validate image/create during cross-cell resize functional testing

Addressed by: https://review.opendev.org/643450
    Add zones wrinkle to TestMultiCellMigrate

Addressed by: https://review.opendev.org/643451
    Add negative test for cross-cell finish_resize failing

Addressed by: https://review.opendev.org/637058
    WIP: Add confirm_snapshot_based_resize_at_source

Addressed by: https://review.opendev.org/637070
    WIP: Add ConfirmResizeTask

Addressed by: https://review.opendev.org/637075
    Add confirm_snapshot_based_resize conductor RPC method

Addressed by: https://review.opendev.org/637316
    Confirm cross-cell resize from the API

Addressed by: https://review.opendev.org/637630
    WIP: Add revert_snapshot_based_resize_at_dest compute method

Addressed by: https://review.opendev.org/639453
    Deal with cross-cell resize in _remove_deleted_instances_allocations

Addressed by: https://review.opendev.org/637647
    WIP: Add finish_revert_snapshot_based_resize_at_source compute method

Addressed by: https://review.opendev.org/638046
    WIP: Add RevertResizeTask

Addressed by: https://review.opendev.org/638047
    Add revert_snapshot_based_resize conductor RPC method

Addressed by: https://review.opendev.org/638048
    Revert cross-cell resize from the API

Addressed by: https://review.opendev.org/638268
    Confirm cross-cell resize while deleting a server

Addressed by: https://review.opendev.org/651650
    Add archive_deleted_rows wrinkle to cross-cell functional test

Addressed by: https://review.opendev.org/614353
    Add CrossCellWeigher

Addressed by: https://review.opendev.org/638269
    Add cross-cell resize policy rule and enable in API

Gerrit topic: https://review.opendev.org/#/q/topic:multi-cell-job

Addressed by: https://review.opendev.org/655222
    WIP: Add nova-multi-cell job

Addressed by: https://review.opendev.org/656656
    Enable cross-cell resize in the nova-multi-cell job

Addressed by: https://review.opendev.org/658478
    Support cross-cell moves in external_instance_event

Addressed by: https://review.opendev.org/658904
    Robustify attachment tracking in CinderFixtureNewAttachFlow

Addressed by: https://review.opendev.org/661398
    Fix hard-delete of instance with soft-deleted referential constraints

Addressed by: https://review.opendev.org/661859
    Add functional test for anti-affinity cross-cell migration

Addressed by: https://review.opendev.org/662833
    Handle lazy-load of Migration.cross_cell_move

Addressed by: https://review.opendev.org/669012
    Refresh instance in MigrationTask.execute Exception handler

Addressed by: https://review.opendev.org/669013
    Add negative test for prep_snapshot_based_resize_at_source failing

Addressed by: https://review.opendev.org/678951
    FUP for I66d8f06f19c5c631e33208580428aa843abb38d2

Deferring to Ussuri since we're 1 week from Train feature freeze and there is still a ton of code to land for this feature so I want to avoid this being a distraction for Train. Will re-propose the spec for Ussuri. -- mriedem 20190905

Addressed by: https://review.opendev.org/683002
    Re-propose cross-cell-resize spec for Ussuri

[efried 20190918] Fast approving per previously approved spec process http://specs.openstack.org/openstack/nova-specs/readme.html#previously-approved-specifications

Addressed by: https://review.opendev.org/676228
    FUP to I30916d8d10d70ce25523fa4961007cedbdfe8ad7

Addressed by: https://review.opendev.org/676231
    FUP to I4d181b44494f3b0b04537d5798537831c8fdf400

Addressed by: https://review.opendev.org/688832
    WIP: Add negative test to delete server during cross-cell resize claim

Addressed by: https://review.opendev.org/691991
    libvirt: flatten rbd image during cross-cell move spawn at dest

Addressed by: https://review.opendev.org/692689
    Pass exception through TaskBase.rollback

Addressed by: https://review.opendev.org/692856
    Follow up to I3e28c0163dc14dacf847c5a69730ba2e29650370

Addressed by: https://review.opendev.org/693936
    Remove unused CannotMigrateWithTargetHost

Addressed by: https://review.opendev.org/693937
    Make API always RPC cast to conductor for resize/migrate

Addressed by: https://review.opendev.org/695334
    Flesh out RevertResizeTask.rollback

Addressed by: https://review.opendev.org/695335
    Add functional cross-cell revert test with detached volume

Addressed by: https://review.opendev.org/695336
    Add test_resize_cross_cell_weigher_filtered_to_target_cell_by_spec

Addressed by: https://review.opendev.org/695337
    Simplify FinishResizeAtDestTask event handling

Addressed by: https://review.opendev.org/696197
    Amend cross-cell-resize spec

Addressed by: https://review.opendev.org/696212
    Flesh out docs for cross-cell resize/cold migrate

Addressed by: https://review.opendev.org/696213
    WIP: Implement reschedule logic for cross-cell resize/migrate

Addressed by: https://review.opendev.org/697162
    WIP: Implement cleanup_instance_network_on_host for neutron API

Addressed by: https://review.opendev.org/698028
    Follow up to I5b9d41ef34385689d8da9b3962a1eac759eddf6a

Addressed by: https://review.opendev.org/698051
    Add sequence diagrams for cross-cell-resize

Addressed by: https://review.opendev.org/698304
    DNM: debug cross-cell resize

Addressed by: https://review.opendev.org/698322
    Add cross-cell resize tests for _poll_unconfirmed_resizes

Addressed by: https://review.opendev.org/698787
    Refresh target cell instance after finish_snapshot_based_resize_at_dest

Addressed by: https://review.opendev.org/698935
    Fix accumulated non-docs nits for cross-cell-resize series

Addressed by: https://review.opendev.org/699237
    Plumb graceful_exit through to EventReporter

Addressed by: https://review.opendev.org/699238
    Use graceful_exit=True in ComputeTaskManager.revert_snapshot_based_resize

Addressed by: https://review.opendev.org/699259
    FUP for docs nits in cross-cell-resize series

Addressed by: https://review.opendev.org/700202
    FUP to Iff8194c868580facb1cc81b5567d66d4093c5274

[efried 20200107] Marking complete. Remaining patches in this bp as of right now:
- https://review.opendev.org/#/c/699259/ (docs FUP, merging)
- https://review.opendev.org/#/c/700202/ (trivial FUP, merging)
- https://review.opendev.org/#/c/688832/ (WIP extra testing, nice to have, not essential)
- https://review.opendev.org/#/c/696213/ (Reschedule logic. Lack of reschedule is a stated limitation. If needed, can be worked later, under a separate bp.)

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.