Bring RBD support to libvirt_images_type

Registered by Sébastien Han

The main goal is to provide an alternative backend store for the ephemeral disk images. This is usually managed by the flag 'libvirt_images_type'. At the moment, only raw, qcow2, lvm are supported. In a rbd context, the mechanism is fairly identical to the boot-from-volume feature where nova needs to be able to attach a device from a ceph cluster. The good point of the method is that from a client perspective it's completely transparent, this also makes the live-migration easier if the VM is already stored in Ceph (boot-from-volume).

The "libvirt_images_volume_group" option should be refactor as well to something more generic like "libvirt_images_store" (or something else, can find anything better now) in order to either specify a volume group when lvm is selected or a pool when rbd is selected.

Configuration flag example:

libvirt_images_type=rbd
libvirt_images_store=ephemeral # name of the pool in Ceph

Note about the CephX (authentication). libvirt needs to be configured with an user and a secret. Hopefully this will be addressed by a key management system in a near future. In the meantime, the following flags can be used (if not deprecated with cinder/grizzly):

- rbd_user
- rbd_secret_uuid

Ideally this new store will also implement the snapshot functionality. This will put all the snapshots into the ceph pool specified in nova.conf.

Blueprint information

Status:
Complete
Approver:
Russell Bryant
Priority:
Medium
Drafter:
Sébastien Han
Direction:
Needs approval
Assignee:
Haomai Wang
Definition:
Approved
Series goal:
Accepted for havana
Implementation:
Implemented
Milestone target:
milestone icon 2013.2
Started by
Sébastien Han
Completed by
Russell Bryant

Related branches

Sprints

Whiteboard

Gerrit topic: https://review.openstack.org/#q,topic:bp/bring-rbd-support-libvirt-images-type,n,z

Addressed by: https://review.openstack.org/36042
    Add RBD supporting to libvirt for creating local volume

It'd be useful to add some specialised ceph options, like the striping options: http://ceph.com/docs/next/man/8/rbd/#striping

[Xiaoxi]
    Any cons to use the boot-from-volume? You said some pros for boot-from-volume, but seems you are not that satisfied with this way, can I learn more about the reason?
    Basically I am -1 for this idea because:
    1. you can do it by boot-from-volume if you really want a local driver from ceph
    2. IF this BP get approved and merged, there will be sheepdog support, SAN support, EMC support...we are reinvent the wheel(cinder) and will make nova too complex
    3. thanks to the educate from AWS, user is usually expect high performance, low latency (similar to a physical hard disk) , low durability for local drive, and usually use it as swap/cache/tempfile/log, Is it a good idea to have ceph, which has good HA but less good performance(compared to physical disk) to address such expectation and usage?

[leseb]

Answers inline @Xiaoxi:

    1. you can do it by boot-from-volume if you really want a local driver from ceph
This involves, creating a volume from an image, then boot and so forth… Most of the customers will want to quickly run a bunch of VMs in parallel.

Please have a look at the article too: http://www.sebastien-han.fr/blog/2013/06/24/what-i-think-about-cephfs-in-openstack/

    2. IF this BP get approved and merged, there will be sheepdog support, SAN support, EMC support...we are reinvent the wheel(cinder) and will make nova too complex

I don't think this makes Nova "that" complex, I just see them as plugins for image disks (but I may be wrong then feel free to teach me). But yes, you're right this could lead to more implementations.

    3. thanks to the educate from AWS, user is usually expect high performance, low latency (similar to a physical hard disk) , low durability for local drive, and usually use it as swap/cache/tempfile/log, Is it a good idea to have ceph, which has good HA but less good performance(compared to physical disk) to address such expectation and usage?

I don't think it's a good argument to say that Ceph has less performance than a physical disk since it's obvious. A distributed system can hardly have the same performance as a local system unless you put enough money into it. This is the case with Ceph, I performed a lot of tests with several configurations and I was able to get more performance than a physical disk but it's a different subject.

You were mentionning AWS, so I assume you're refering to a public cloud platform. Not everyone is building a public cloud, most of the openstack interest is around private cloud which often implies HA of the VMs. A lot of users would love to have less performance but more HA and reliable VMs. That's btw one the first question that I get ask from a customer: 'how about the HA of my VMs?'.
Moreover the lvm drivers exists for libvirt_image_type, this could be implemented with either local disks or LUNs mapped on the compute nodes, thus with the latest you'll easily end up with the same situation that you described earlier (poor performance compare to a local disk). And trust me I have some customers with such setups since it's currently the only transparent way to achieve an honorable HA of the VMs (yes I don't want to talk about the ugly NFS solution with /var/lib/nova/instances). In the end, there is a use case for everything and I truly believe that this BP _must_ be implemented in order to satisfy most of the openstack users.

[Xiaoxi]
Basically, before we going into any detail, we have to agree that, all this BP want to do, can be done via some existing pieces.For root volume, you can create a volume from cinder then boot-from-volume. For ephemeral volume, you can create a volume from cinder and then attached to this. I totally agree that customer need something easier to use,and this seems what your company is working on. The debate is about which is the right way for hide the user complexity.

My point is, you should do this outside nova, using a wrapper in dashboard to hide the complexity.

 1.As said in the blog of sebastien-han, you can always hide such complexity inside your api/dashboard, it will not stop you from fast-boot a lot of VM. Amazon is a good example, when you boot a vm, actually the boot volume is from EBS and AWS hide such complexity to user.
2. Imaging if this is the right way to go, ok, all the backend in cinder shall have a plugin for nova with all what they can do for cinder(snapshot,create_from_volume....),after that ,since you have quite a lot backend, you may need a scheduler,oh, don't you want to backup your local volume? Yes? then let's have a backup service. OK, cool you have already "reinvent" a cinder now.

3. You are right for HA is important, but please note,
    a. HA for a VM is much more about the root volume. The HA is aiming to protect your VM can be migrate to another node during failure. NOT protecting your user data in local volume, if you want a good HA for your data ,why not use a volume?
    b. Sorry but I must talk about the ugly NFS sulution, please try a live-migration test in openstack, without a shared NFS for (var/lib/nova/instance) , your live-migration will not success. Using boot-from-volume, or as this bp said, to have a ceph backend for local storage, cannot address this problem. (Please correct me since I did this experiment several month ago).

At last, I agree, there is use case for everything, but, not all use case should be address by upstream unless it's a COMMON USE CASE. You are always free to folk a unitedstack(tm) version of openstack and address your custom requirement.

[Haomai]
Answers inline @Xiaoxi:

All your answers points that this bp is absolutely needed by special use case. I want to emphasize that integration Ceph with Nova is needed by almost OpenStack users who uses Ceph as Cinder backend. You have underestimated the complexity of the storage capacity allocation.

The real complicated problem about OpenStack is deploy not codes self. Try to do more in codes not make users do more in deploy produce is what OpenStack developers did in the past and will continue to do in the future.

At last, this is not a UnitedStack use case but a Ceph users problem.

[leseb]
Answers inline @Xiaoxi:

>Basically, before we going into any detail, we have to agree that, all this BP want to do, can be >done via some existing pieces.For root volume, you can create a volume from cinder then boot->from-volume. For ephemeral volume, you can create a volume from cinder and then attached to >this. I totally agree that customer need something easier to use,and this seems what your company >is working on. The debate is about which is the right way for hide the user complexity.

It's not really that they need something easy to use but more a transparent way to boot VMs in an HA fashion.

>My point is, you should do this outside nova, using a wrapper in dashboard to hide the complexity.

Not everyone uses the dashboard, thus how about the API?

>1.As said in the blog of sebastien-han, you can always hide such complexity inside your >api/dashboard, it will not stop you from fast-boot a lot of VM. Amazon is a good example, when you >boot a vm, actually the boot volume is from EBS and AWS hide such complexity to user.

I also mentioned that API modification brings a lot of issues in terms of compatibility. So ok, for the dashboard, but the API remains a huge issue, since customers often plug their own application and do REST call to the OpenStack APIs.

>2. Imaging if this is the right way to go, ok, all the backend in cinder shall have a plugin for nova >with all what they can do for cinder(snapshot,create_from_volume....),after that ,since you have >quite a lot backend, you may need a scheduler,oh, don't you want to backup your local volume? >Yes? then let's have a backup service. OK, cool you have already "reinvent" a cinder now.

Not sure that we want to go that far… Don't forget that this is only an operator choice, users will never ever notice that the VM runs over a Ceph block device. And it's also really convenient for the operators to respect their SLA. Why don't we just let this first layer? nothing less nothing more. But I agree that we must put some barriers.

>3. You are right for HA is important, but please note,
> a. HA for a VM is much more about the root volume. The HA is aiming to protect your VM can be >migrate to another node during failure. NOT protecting your user data in local volume, if you want a >good HA for your data ,why not use a volume?

The main points of the HA brought by Ceph (or another distributed system):

* What's your block device without a KVM process? If the VM fails, you can use your volume unless you recover it and re-create a new one, which is not acceptable. This is why we need fast recovery and failover. In a Ceph context, you just need to 'nova evacuate' the VM to another node. Thus only the KVM moves. This methods brings good data HA and VM HA too.

> b. Sorry but I must talk about the ugly NFS solution, please try a live-migration test in openstack, >without a shared NFS for (var/lib/nova/instance) , your live-migration will not success. Using boot->from-volume, or as this bp said, to have a ceph backend for local storage, cannot address this >problem. (Please correct me since I did this experiment several month ago).

* If you want to migrate a VM without a shared storage use the block migration it will work. This will move the disk and the workload, it's obviously longer than the live-migration since you also have to migrate the storage too. Then you can't call this fast recovery. I don't want to talk about NFS, just because of the centralised design that doesn't scale at all. But I agree that the solution works.

>At last, I agree, there is use case for everything, but, not all use case should be address by >upstream unless it's a COMMON USE CASE. You are always free to folk a unitedstack(tm) version >of openstack and address your custom requirement.

Well, I believe providing basic and transparent HA for the VM is a _really_ common private cloud use case.

I also agree with Haomai, that a bunch of transparence and automation from the code is better than a user automation.

[Xiaoxi]
@leseb ,seems we are not fully understand each other, may due to my language issue.

* If you want to migrate a VM without a shared storage use the block migration it will work. This will move the disk and the workload, it's obviously longer than the live-migration since you also have to migrate the storage too. Then you can't call this fast recovery. I don't want to talk about NFS, just because of the centralised design that doesn't scale at all. But I agree that the solution works.

What I want to say is, even the VM is booted from a volume provide by CEPH, without a shared NFS for /var/lib/instance, you still cannot do the live migration. I tried this by create a volume from ceph, boot a vm from that volume ,and run the nova live-migration, got a failure since the new machine cannot get the VM metadata(libvirt xml and etc, not the volume and data, it's already in Ceph). Yes it is several month ago I did this test, I am not sure if it works now.

*I also mentioned that API modification brings a lot of issues in terms of compatibility. So ok, for the dashboard, but the API remains a huge issue, since customers often plug their own application and do REST call to the OpenStack APIs.

Since such transparent HA can be done by a combination of existing API, how about the idea to have a API extension(create_vm_from_volume) in nova as a wrapper? It will not bring any issue for customer, they can use the api extension if they want ,and also the origin create_vm api if they don't want a HA. Since it also a restful API so any user application can work with it.

*Not sure that we want to go that far… Don't forget that this is only an operator choice, users will never ever notice that the VM runs over a Ceph block device. And it's also really convenient for the operators to respect their SLA. Why don't we just let this first layer? nothing less nothing more. But I agree that we must put some barriers.

Of course you will not want to go that far, but what the barriers are ? Can the barriers in your mind be accepted by community ? The barriers are the most important question the BP should answer at the very beginning.

And of course this is an operator choice to use Ceph for root volume. But the problem is , if we go to the way as proposed in BP, ok, you have ceph done by this BP, some days later, someone else want a sheepdog for root volume, then they do so, as you said ,there are quite a lot use case so finally, one day the nova local volume module with large enough backend support, no one can stop it from becoming a reinvented cinder since all the use_case and requirement for cinder volume should also be the use_case for local volume.

This is the main reason I think an API extension should be better ,it leverage the complexity of handing volume related stuff to Cinder. So we don't need to have separate support for every storage type, the one can work with cinder, the one then can be used for root volume without any coding stuff.

[Haomai]
@Xiaoxi

I mentioned again and again that it's not only the user API reason, unified storage can give a very clear deployment. I don't know whether you have experience for deploying a OpenStack cloud and write a automation flow to expand. There exists a revolution that we put all storage requirements to a storage pool. Users will get much benefits from the change. You should focus on the deployment and what OpenStack solve is make users deploy a Cloud simply..

The real production environment requires more central storage need. The user api or others just a side effect. The bp mainly give OpenStack a way to ease users for storage.

[Xiaoxi]
@Haomai,
Thanks a lot but I do deploying a Openstack cloud for quite a lot of times and I definitely believe we have much larger scale than you have.

I am saying very clear that you CAN have a unified storage even WITHOUT the BP, what's unified storage? That's you put all your storage(volume, boot volumes, images, and maybe obejct storage) to a same storage stack,here is Ceph. Please tell me if you think you cannot do anything WITHOUT this BP, I could tell you how to do with existing API.

Also in Han's blog, he agreed that if we modify the API we can have that done, the concern he has is he want the API compatible, so the solution goes to api extensions(in api/contrib).

And yes, let's talking about unified storage, follow your logic, if I use sheepdog for my volumes, then definitely I should have a BP to "Bring Sheepdog support to libvirt_images_type", and someone may use HP SAN, another BP "Bring HP SAN support to libvirt_image_type" appears....Please answer the question very direct :How can we prevent this part become as complex as cinder?

The only reason to support the way you are doing is when you want to play some tricks on Ceph, for example ,deplying Ceph with Openstack Compute and build some local pool who always place the primary copy on local node. This is really hard to deploy if use cinder to manage the volumes but much easier for the way you implement.

[Haomai]
@Xiaoxi

I think the debate is clear about whether this bp is only useful for me or my company. Could you give others's vote to select? I will insist that what users needed is what OpenStack should do.

[jdurgin]
Many deployers are interested in using ceph for all disks transparently, i.e. without making users worry about booting from volumes or using a particular api. Like I noted on the review, I think this is a good short-term implementation of that. It would be even better if it could do copy-on-write clones from glance, but that can be added in a later patch.

[Weiguo]
@Xiaoxi

we have tried live-migration with Ceph/rbd boot volume and without sharing /var/lib/nova/instance, it does work and new local copy of libvirt.xml will be created on the target host. However we do end up sharing /var/lib/nova/instance in our production as libvirt.xml seems to be required for "nova evacuate" function. This is also understandable from how qemu/libvirt works as libvirtd can pass on all the VM metadata to the other libvirtd on the remote host as live-migration requires qemu & libvirtd running concurrently @ both source and target nodes.

@leseb

How do you get "nova evacuate" work without any sharing? Assuming the source node crashes and is in down state, nova on the target node has no way to fully reconstruct the libvirt.xml file? When we tested this in late May, it simply won't be able to find the xml file and fail to get VM started. From what we observed, "nova evacuate" only does VM registration for the target node but nothing else.

@All

I think what is really missing for cinder is to allow local disk (under LVM or not) to be managed by cinder and be attached to a nova instance as a volume without going through iSCSI (lvmiscsi driver). Clearly with the existing code base, you can get fast provisioning with cinder boot volume and cloning feature but still treat the instance as ephemeral. But there is no way (at least I can tell) for a VM instance to leverage the local disk of low latency once being booted from a cinder volume.

<jpretorius>
I see that this has merged into the H code-base- excellent work to all involved! What would the effort be to backport this into grizzly-stable? I see this as a really big win for us!
</jpretorius>

(?)

Work Items