Just thought I'd mention that I just finished investigating an issue that turned out to be the first item above, so it's a practical problem rather than theoretical.
We had a race (in kilo, but with very similar code to what is in liberty) between instances being migrated that are in the RESIZE_MIGRATED state (so the host/node have been updated but the numa_topology is stale) and the resource audit running on the destination.
The audit sees the instance and processes it in _update_usage_from_instances() but using the stale instance.numa_topology, thus possibly accounting for the wrong host CPUs.
We've just submitted a local workaround that modifies _update_usage_from_instances() to ignore instances with a task_state of RESIZE_MIGRATED. (So that they get handled by _update_usage_from_migrations(). So far it seems to help.
Just thought I'd mention that I just finished investigating an issue that turned out to be the first item above, so it's a practical problem rather than theoretical.
We had a race (in kilo, but with very similar code to what is in liberty) between instances being migrated that are in the RESIZE_MIGRATED state (so the host/node have been updated but the numa_topology is stale) and the resource audit running on the destination.
The audit sees the instance and processes it in _update_ usage_from_ instances( ) but using the stale instance. numa_topology, thus possibly accounting for the wrong host CPUs.
We've just submitted a local workaround that modifies _update_ usage_from_ instances( ) to ignore instances with a task_state of RESIZE_MIGRATED. (So that they get handled by _update_ usage_from_ migrations( ). So far it seems to help.