Comment 1 for bug 1498126

Revision history for this message
Chris Friesen (cbf123) wrote :

Just thought I'd mention that I just finished investigating an issue that turned out to be the first item above, so it's a practical problem rather than theoretical.

We had a race (in kilo, but with very similar code to what is in liberty) between instances being migrated that are in the RESIZE_MIGRATED state (so the host/node have been updated but the numa_topology is stale) and the resource audit running on the destination.

The audit sees the instance and processes it in _update_usage_from_instances() but using the stale instance.numa_topology, thus possibly accounting for the wrong host CPUs.

We've just submitted a local workaround that modifies _update_usage_from_instances() to ignore instances with a task_state of RESIZE_MIGRATED. (So that they get handled by _update_usage_from_migrations(). So far it seems to help.