Swift ring overload concept

Registered by John Dickinson on 2015-02-01

From the commit message:

The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.

Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.

Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.

If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.

This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.

For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.

Overload only has an effect when replica-dispersion and device weights
come into conflict.

The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.

In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.

Blueprint information

Status:
Complete
Approver:
None
Priority:
Undefined
Drafter:
John Dickinson
Direction:
Approved
Assignee:
Samuel Merritt
Definition:
Approved
Series goal:
Accepted for kilo
Implementation:
Implemented
Milestone target:
milestone icon 2.2.2
Started by
John Dickinson on 2015-02-01
Completed by
John Dickinson on 2015-02-01

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.