Improve Downloading Images from Swift to IPA

Registered by Josh Gachnang on 2014-04-21

Having 1000's of Ironic Python Agents (IPAs) all downloading the same large file at the same time from Swift will result in terrible performance both in the hard drives on the Swift nodes and throughput on the network everywhere between Swift and the agents. Suggestions to improve performance thus far include:

1) Ask for parts of the image at a time to attempt to leverage caches in the machine. A large issue is synchronization across a group of IPAs to request the image together.

2) Cache layer (e.g. Varnish). No support in Ironic required, helpful to point out to deployers.

3) Deploy multiple copies of an image and choose one at random to download. Issues include being messy and unknown support. Helpful to point out to deployers as an option. Needs a way in Swift or Glance to support bundling images to say "these X images are all the same, choose one". Very little needs to be done to support in Ironic (one image to a list of images, choose at random).

4) Download via torrent. Can be implemented as a feature in Swift (possibly already there in some form) or Ironic (seed comes from API). The agent would run a torrent client/server. This would require long lived agents to improve the performance, so a small subset can download from Swift then seed to the rest of the nodes. A newly booted agent first downloads a seed from Swift/Ironic, adds itself to a swarm, and starts leeching then seeding to other nodes. Speed between nodes that are sharing a switch should be very fast and wouldn't bog down inter-rack throughput. At the very least, this method prevents bogging down the throughput between Swift and IPAs. Support for multiple images would be a simple extension of the above process.

The process for 4) would be: Agent boots, possibly before getting a deploy command, gets the seed either during lookup with Ironic or via command line/kernel args, checks the swarm, and determines if it needs to download the original image(s) directly from Swift or from the swarm. Ironic could provide some support to prevent 1000s of nodes being booted at the same time from all downloading from Swift anyway.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Josh Gachnang
Direction:
Needs approval
Assignee:
Josh Gachnang
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

I think 2 and 3 should go in the deployer docs as possibilities, and I'd like to implement 4. I think 4 is complex but the most scalable.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.