fsck.ext3: Device or resource busy while trying to open /dev/xvda2

Bug #898373 reported by Scott Moser
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
High
Unassigned

Bug Description

I'm opening this against cloud-init, but I do not actually think that is valid.
In an ec2 instance for test of alpha-1 images [1], we saw errors on boot like:

  Loading, please wait...
  [1342606.840604] udevd[81]: starting version 175
  Begin: Loading essential drivers ... done.
  Begin: Running /scripts/init-premount ... done.
  Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
  Begin: Running /scripts/local-premount ... done.
  [1342607.148903] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
  Begin: Running /scripts/local-bottom ... done.
  done.
  Begin: Running /scripts/init-bottom ... done.
  lxcmount stop/pre-start, process 178
  fsck from util-linux 2.19.1
  fsck from util-linux 2.19.1
  fsck: fsck.swap: not found
  fsck: Error 2 while executing fsck.swap for /dev/xvda3
  fsck.ext3: Device or resource busy while trying to open /dev/xvda2
  Filesystem mounted or opened exclusively by another program?
  cloud-init start-local running: Wed, 30 Nov 2011 19:25:20 +0000. up 3.98 seconds
  no instance data found in start-local

The swap errors are due to bug 898365.
The cloud-config given to the instance included:
  mounts:
   - [ ephemeral0, /opt , auto, "defaults,noexec" ]
   - [ swap, null ]

And given that bug, the /etc/fstab during this reboot probably had:
  /dev/xvda2 /opt auto defaults,noexec,comment=cloudconfig 0 2
  /dev/xvda3 None auto defaults,nobootwait,comment=cloudconfig 0 2

The thing that confuses me, and the reason I do not think this is cloud-init's fault is that this happens prior to cloud-init start-local. So, yes, there is a broken fstab, but not for /dev/xvda2.

The errors listed above are on first reboot after cloud-init wrote those fstab entries, and then issued 'mount -a' from a upstart job (cloud-config.conf) launched at:
   start on (filesystem and started rsyslog)

--
[1] https://jenkins.qa.ubuntu.com/job/precise-server-ec2/ARCH=i386,REGION=us-west-1,STORAGE=instance-store,TEST=cloud-config,label=ubuntu-server-ec2-testing/1/artifact/None/i386/m1.small/instance-store/i-835a3dc4/uec2-20111130-1920-5d6ee3bc489541-terminated.console.txt

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: cloud-init 0.6.2-0ubuntu1 [modified: usr/share/pyshared/cloudinit/CloudConfig/cc_mounts.py]
ProcVersionSignature: User Name 3.2.0-2.5-virtual 3.2.0-rc3
Uname: Linux 3.2.0-2-virtual i686
ApportVersion: 1.90-0ubuntu1
Architecture: i386
Date: Wed Nov 30 21:44:25 2011
Ec2AMI: ami-a7cc07ce
Ec2AMIManifest: ubuntu-us-east-1/images-testing/ubuntu-precise-daily-i386-server-20111130.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.small
Ec2Kernel: aki-805ea7e9
Ec2Ramdisk: unavailable
PackageArchitecture: all
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Related bugs:
 * bug 928990: fsck / dirty filesystem on instance is death

Related branches

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

Just a note, this did not occur on all instances with this user-data.

Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

after looking at the console pointed at in comment 3, I thought i might be able to trigger this by issuing 'reboot-instances' on a running node. I was not successful in doing so.

In a instance of:
 eu-west-1 ami-738fb007 ubuntu-precise-daily-i386-server-20120119

I issued reboot via 'euca-reboot-instances' 10 times and colllected log output (the low number is really because it takes 5m to get a console output). None of these resulted in the fsck issue.

Revision history for this message
Scott Moser (smoser) wrote :

after doing this, i caught a fsck device busy error:

args="--region=eu-west-1 i-087e0b41'
for((i=12;i<50;i++)); do
  echo $(date): i=$i; euca-reboot-instances $args && sleep 1m &&
  euca-get-console-output $args > /tmp/console.hard-reboot$i; done

the 37th boot (attaching) showed this error.

Revision history for this message
Scott Moser (smoser) wrote :
Scott Moser (smoser)
description: updated
Revision history for this message
Stefan Bader (smb) wrote :

Hm, when I look at /etc/init/cloud-init-local.conf and */cloud-init.conf they seem to say start on mounted MOUNTPOINT=/. So I'd somehow expect mountall to be finished with / at that point. Still the output of cloud-init and mountall seem to intermix often. As if they are executed around the same time...

Revision history for this message
Stefan Bader (smb) wrote :

Ok, / is non-sense as it is certainly mounted and also the failing fsck is fot /mnt. But generally it looks like mountall and cloud-init running concurrently and anything that opens the device may block a rw access for the fsck. So any blkid could do that. The window is not very big but who knows...

Revision history for this message
Scott Moser (smoser) wrote :

I think there are 2 things that could be causing this from cloud-init:
 a.) cloudinit/CloudConfig/cc_resize.py [1]
   This uses blkid, but almost certainly only passes blkid a device node that represents '/'. It basically does a stat on /, gets major and minor, creates a device node that has the same major, minor and calls blkid on that like:
    blkid -c /dev/null -sTYPE -ovalue /tmp/THATDEV
   It does this to find the filesystem type of /.

 b.) cloudinit/DataSourceOVF.py [2]
   This would run early in the boot, and possibly race with fsck. It is trying to determine if each device in /dev matching '^(sr[0-9]+|hd[a-z]|xvd.*)' is a cdrom. Notable in that list is 'xvd.*'. If that a device matching that is not mounted, then we do a read(512) in an attempt to determine if there is media in the CD-ROM drive, and if *that* succeeds, it will try to mount read-only and look for some files, then unmount.

It seems to me that 'b' is a potential source of the issue, as either the 'read()' or the mount, check, umount could cause the resource busy option.

2 things that could address that:
 a.) restrict devices being checked to 'xvd[a-z]' (eliminating the possible mount of /dev/xvda[0-9] entirely. This could still cause issue on other devices like /dev/xvdb.
 b.) mount with 'ro,noload' (per 'man mount')

I'm *guessing* that fsck is trying to open the device in read-only and the mount in 'b' is not truely read-only.

Any thoughts on how I can essentially poll these devices looking for "ovf data" and avoid this race?
Woudl a badly timed 'read()' interfere with fsck ? (ie, if it was an open RO, not RW?)

One thing that would filter out this device would be to filter out non-CD ROM devices, but I actually like the fact that the device does not have to be a CDROM.

--
[1] http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/cloud-init/precise/view/head:/cloudinit/CloudConfig/cc_resizefs.py
[2] http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/cloud-init/precise/view/head:/cloudinit/DataSourceOVF.py

Revision history for this message
Scott Moser (smoser) wrote :

Note, assuming our theory on the circumstances of the race is correct above (that it is 'read' or mount of that device), I was unable to reproduce with a loop device, which I thought would have caused a similar issue:

1. truncate --size 100m /tmp/my.img
2. mkfs -F -t ext3 /tmp/my.img
3. LODEV=$(sudo losetup --show --find /tmp/my.img)
4. sudo python -c 'import time, sys; print sys.argv[1]; fp=open(sys.argv[1]); time.sleep(30); fp.close();' /dev/loop1 &
   # inside the 30 seconds that that will have an open RO file handle, do step 5
5. sudo fsck.ext3 -f /dev/loop1 # '-f' forces fsck even if not marked dirty
   or
    sudo fsck.ext3 /dev/loop1

Other things I've found:
 * using 'open(sys.argv[1],"rw")' does not change anything, fsck will still continue.
 * changing step 4 to 'mount -o ro /dev/loop1 /mnt' *will* cause fsck.ext3 to complain that the filesystem is mounted and print a similar error message.

So, I'm suspecting that it is indeed the OVF code's mount -o,ro that is causing this problem.

I'm not sure how to resolve it, other than limiting the mounts checked to FS of iso9660 only.

Revision history for this message
Stefan Bader (smb) wrote :

Hm weird. I would have expected at least a rw open on the block device should fail, let alone a ro (since fsck probably wants to modify the fs). Though I'd need to check exactly what access mode mount -oro uses (maybe its some ro exclusive). Actually it is not really a read-only open on the block device (that much I know) because this still will cause changes to the superblock (mount time and so on) if that is possible and the block device itself is not read-only.
Beside that you could restrict the devices touched by looking at /sys/block/?/device/type which is 5 (cdrom) only for cd type devices.
The other option would be, if there is an event that mountall emits as the final one (done with automounting), that would be a safer time to start with the cloud-init, wouldn't it?

Revision history for this message
Scott Moser (smoser) wrote :

I believe this is fixed in cloud-init revision 515
http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/revision/515

Note, we could also restrict the device type searched by checking /sys/block/devname/device/type as stefan suggested.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.6.3~bzr519-0ubuntu1

---------------
cloud-init (0.6.3~bzr519-0ubuntu1) precise; urgency=low

  * New upstream snapshot.
    - [Mike Milner] add support for managing CA Certificates (LP: #915232)
    - in ci-info lines, use '.' to for empty field for easier machine reading
    - support empty lines in "#include" files (LP: #923043)
    - [Jef Baeur] support configuration of salt minions Bauer) (LP: #927795)
    - DataSourceOVF: only search for OVF data on ISO9660 filesystems (LP: #898373)
    - DataSourceConfigDrive: support getting data from openstack config drive
      (LP: #857378)
    - [Juerg Haefliger] formating and pylint cleanups
  * increase timeouts for initial config check for metadata service
    to address slow metadata service in openstack
  * add awareness of ConfigDrive data source
 -- Scott Moser <email address hidden> Thu, 16 Feb 2012 17:27:05 -0500

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.