Improved large sparse files (virtual disks) - Eric Blake

Registered by Eric Blake on 2012-04-28


Virtual disk images have proven to be an interesting example of large sparse files. They often contain several gigabytes of data, interspersed with potentially large holes, and are often accessed over networks such as via NFS or iSCSI. In managing these files on the host, applications such as libvirt have run into scenarios where performance can be improved. For instance, it should be possible to quickly identify and alter sections of a file that are sparse, to perform a one-time copy of a disk image without polluting the host's file system cache, to expose any ability of underlying storage devices to do copy-on-write cloning, and to conditionally probe file characteristics without hanging if the file is no longer accessible due to a network outage.

Sparse file handling has improved with recent lseek(SEEK_HOLE) additions, as well as new ioctls to punch holes into existing files, although this support is still not available to all file systems. Right now, the only way to avoid polluting the host's file system cache is to use O_DIRECT, but this makes life much more difficult for the application to conform to particular I/O patterns. The posix_fadvise() interface appears to provide the framework for asking the kernel to perform uncached operations without making the application worry about strict alignment, but for this to be useful, the kernel would need to provide additional guarantees (perhaps in sysfs) about what guarantees are possible through posix_fadvise. BTFRS and other file systems are starting to provide copy-on-write file cloning at a file system level, and there are also storage devices that still have varying controls to achieve the same effects at a device level; coming up with a unified interface to request this feature would be handy. Finally, the proposed xstat() kernel interface to conditionally control how much data is retrieved when querying file metadata would make it easier to work with large files that live across potentially unreliable network connections.

Discussion will focus on interaction between kernel, file systems, and user space applications, to determine how to improve existing interfaces (such as posix_fadvise) or add new interfaces (such as xstat) that can be used to improve performance of large files. While my background for this proposal stems from handling of virtual disk images, there are probably several other file types that will benefit from improvements in this area.

Eric Blake (Red Hat) is currently a primary contributor to the libvirt project, which presents a unified management interface into multiple virtualization technologies, such as KVM, LXC, and Xen. He is also active in the Austin Group for developing POSIX interfaces, as well as a contributor to the gnulib project for providing ports of POSIX and other interfaces to a large variety of platforms.

Blueprint information

Not started
Needs approval
Eric Blake
Series goal:
Milestone target:

Related branches




Work Items

This blueprint contains Public information 
Everyone can see this information.