Support chanaging the ring size

Registered by David Hadas on 2013-02-14

This work follows in the steps of previous efforts to allow changes to the ring size during the life of a cluster.
I.e. to allow growing the number of partitions of the swift ring which is set to be 2^part_power.
Such a change can allow a cluster to start small and grow as needed, as it removes the requirement to define number of partitions in advance during initial cluster installation.

The basic idea behind such efforts is that one can double every entry in the ring without changing the placement
(i.e. such that the same keys will continue to be mapped to the same a/c/o servers).

As well explained in https://bugs.launchpad.net/swift/+bug/933803, doubling the ring introduces an additional challenge at the a/c/o servers. The a/c/o servers are using the partition number as part of the path in which they store objects. Naively increasing the number of partitions therefore will keep the mapping between the swift devices (server disks) constant but will require restructuring the the directory tree at each device.

This work seek to resolve that challenge and design a solution that will not require restructuring the the directory tree at each device.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
David Hadas
Direction:
Needs approval
Assignee:
David Hadas
Definition:
Drafting
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Change 1: Enhance the ring to support partition doubling + a per device part_power

At present the ring is being used to publish a per cluster part_power. The ring partitions are than being used for both placement and to define the directory structure of the a/c/o servers.
Under the suggested design, the ring will maintain a per device part_power in addition to the ring part power. The device part_power will be set to equal the ring part power when the device is first added to the ring and will never change.
This change is both backward compatible and by itself does not change the system behavior.

Here is how the ring device list looks like after the change with part_power of 10:

/tmp/tmp.builder, build version 6
1024 partitions, 3 replicas, 3 zones, 6 devices, 0.00 balance
The minimum number of hours before a partition can be reassigned is 0
Devices: id zone ip address port name p_power weight partitions balance meta
              0 1 1.2.3.4 1234 sdb1 10 128.00 512 0.00
              1 1 1.2.3.5 1234 sdb1 10 128.00 512 0.00
              2 2 2.2.3.4 1234 sdb1 10 128.00 512 0.00
               3 2 2.2.3.5 1234 sdb1 10 128.00 512 0.00
              4 3 3.2.3.4 1234 sdb1 10 128.00 512 0.00
              5 3 3.2.3.5 1234 sdb1 10 128.00 512 0.00

The same ring after it was increased twice and a device added at each step:
4096 partitions, 3 replicas, 4 zones, 8 devices, 21.44 balance
The minimum number of hours before a partition can be reassigned is 0
Devices: id zone ip address port name p_power weight partitions balance meta
              0 1 1.2.3.4 1234 sdb1 10 128.00 1366 6.48
              1 1 1.2.3.5 1234 sdb1 10 128.00 1365 6.40
              2 2 2.2.3.4 1234 sdb1 10 128.00 1365 6.40
              3 2 2.2.3.5 1234 sdb1 10 128.00 1365 6.40
               4 3 3.2.3.4 1234 sdb1 10 128.00 1365 6.40
              5 3 3.2.3.5 1234 sdb1 10 128.00 1366 6.48
                  6 4 4.2.3.4 1234 sdb1 11 228.00 2285 -0.01
              7 4 4.2.3.5 1234 sdb1 12 230.00 1811 -21.44

Change 2: Fixate the directory structure at the a/c/o servers
At present the a/c/o servers construct the directory tree using the part number sent by the proxy – therefore changing the part power should be avoided.

Under the suggested design, the storage_directory function of the common.utils is modified to accept the device name. The storage_directory will than construct the path based on the device part_power (extracted from the ring during __init__). For backward compatibility, if the ring does not have a part_power for the device, the partition sent by the proxy is used instead.

Note that gradual/partial upgrade is possible with this change as it has no affect over the other servers or over the server directory tree structure. As long as the ring part_power is not changed, this patch should have no affect.

Note also that following that change, when the ring is doubled, two ring partitions will be stored at the same path of the a/c/o servers. I.e.
/srv/1/node/sdb1/objects/part/ will now host objects associated with 2*part and 2*part+1 of the ring.
If the ring is doubled again, the same path will host 2*part... 2*part+3. Etc.

Change 3: Fixing the db_replicator
db_replicator
Since the db replicator uses the storage_directory function to access its files, the only required additional change is to ensure it extracts the nodes to replicate to, based on the hash extracted from the db file path instead of the partition extracted from the same db file path.

This is a minor change and is backward compatible

Change 4: Fixing the object replicator pickled hash files
Since each directory may now include multiple ring partitions, the pickled hash file needs to be extended. The suggested change is to store multiple lists of hashes inside the pickled hash file such that each ring partition would have its own list of hashes. For example, after doubling the ring once, the pickled hash file will include two lists of hashes, one for each of the ring partitions pointing to the directory. The pickled file will also include the ring_part_power used to create it such that hashes will be recalculated following a ring doubling event. At present, this seems to be an unavoidable price for ring doubling.

This change is backwards compatible, during an upgrade, the pickled hash files can be seamlessly upgraded to the new format without a need to recalculate object hashes.

Change 5: Fixing the object REPLICTAE interface
The object REPLICATE interface is used to deliver information about partitions from one replicator to another. The receiving server deducts the local path from the incoming partition numbers.

Following a ring doubling event, deducing the local path requires converting the received ring partitions based on the difference between the ring part_power and the device part_power.

Under the suggested change, the object replicator would send a 'ring_part_power' header as part of the REPLICTAE request along side the partitions to allow the receiver to deduce the local path.

This change is backward compatible since not sending the header simply results in no partitions being used as received.

--------------------

-- [gholt] -- Unfortunately, these blueprints suck for conversations, unless I'm missing something.. --

I've only had around fifteen minutes to look at this, sorry, but it does sound like you've found most the corners that hide. I do wonder about the impact of a huge cluster that was small on those early devices though; something to think through. Might be better to store the device ring power on the device itself; that way if the device fails and is replaced, it can use the new power.

--------------------

-- [glange] -- For change 2, what would be the storage_directory() look like? You'll add a device name parameter. How would that be used to get the device part_power? There is no ring passed into that function.

--------------------

-- [plocher]-- What is the long term impact of many doublings (say, from 1PB to 5 to 10 ... 100)?
In particular,
     > ...ring is doubled, two ring partitions will be stored at the same path...
If there is value in having many object/part subdirs, won't that value be diminished or lost after this change? If it doesn't matter, why maintain a structure that requires encoding a volatile config choice into the filesystem?

Gerrit topic: https://review.openstack.org/#q,topic:ring-split,n,z

Addressed by: https://review.openstack.org/21888
    Doubling the ring

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.