cloud-init fails to fetch metadata with OpenStack

Bug #894279 reported by Juerg Haefliger
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
High
Scott Moser
cloud-init (Ubuntu)
Fix Released
Low
Unassigned

Bug Description

Tested the UEC Lucid, Maverick, Natty and Oneiric images with OpenStack Diablo. When launching multiple (20+) instances at the same time the OpenStack metadata server becomes quite slow and cloud-init fails to fetch the metadata due to its short timeout of 2 (3 for Oneiric) seconds in DataSourceEc2.py. This is an unreasonably small timeout value. I'd suggest a timeout of 60 secs and 5 retries as the default.

init: cloud-init-local main process (217) terminated with status 1
init: cloud-init-nonet main process (290) killed by TERM signal
init: udev-fallback-graphics main process (317) terminated with status 1
init: plymouth main process (59) killed by SEGV signal
init: plymouth-splash main process (345) terminated with status 2
cloud-init start running: Fri, 30 Sep 2011 12:01:48 +0000. up 2.17 seconds
2011-09-30 12:01:50,874 - DataSourceEc2.py[WARNING]: waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id
2011-09-30 12:01:50,875 - DataSourceEc2.py[WARNING]: 12:01:50 [ 1/30]: url error [timed out]
2011-09-30 12:01:53,879 - DataSourceEc2.py[WARNING]: 12:01:53 [ 2/30]: url error [timed out]
2011-09-30 12:01:54,883 - DataSourceEc2.py[WARNING]: 12:01:54 [ 3/30]: url error [[Errno 113] No route to host]
2011-09-30 12:01:57,887 - DataSourceEc2.py[WARNING]: 12:01:57 [ 4/30]: url error [timed out]
2011-09-30 12:01:58,893 - DataSourceEc2.py[WARNING]: 12:01:58 [ 5/30]: url error [[Errno 113] No route to host]
2011-09-30 12:02:01,897 - DataSourceEc2.py[WARNING]: 12:02:01 [ 6/30]: url error [timed out]
2011-09-30 12:02:05,902 - DataSourceEc2.py[WARNING]: 12:02:05 [ 7/30]: url error [timed out]
2011-09-30 12:02:09,907 - DataSourceEc2.py[WARNING]: 12:02:09 [ 8/30]: url error [timed out]
2011-09-30 12:02:13,913 - DataSourceEc2.py[WARNING]: 12:02:13 [ 9/30]: url error [timed out]
2011-09-30 12:02:17,918 - DataSourceEc2.py[WARNING]: 12:02:17 [10/30]: url error [timed out]
2011-09-30 12:02:21,923 - DataSourceEc2.py[WARNING]: 12:02:21 [11/30]: url error [timed out]
2011-09-30 12:02:26,773 - DataSourceEc2.py[WARNING]: 12:02:26 [12/30]: url error [[Errno 113] No route to host]
2011-09-30 12:02:31,778 - DataSourceEc2.py[WARNING]: 12:02:31 [13/30]: url error [timed out]
2011-09-30 12:02:36,785 - DataSourceEc2.py[WARNING]: 12:02:36 [14/30]: url error [timed out]
2011-09-30 12:02:41,791 - DataSourceEc2.py[WARNING]: 12:02:41 [15/30]: url error [timed out]
2011-09-30 12:02:46,797 - DataSourceEc2.py[WARNING]: 12:02:46 [16/30]: url error [timed out]
2011-09-30 12:02:52,013 - DataSourceEc2.py[WARNING]: 12:02:52 [17/30]: url error [[Errno 113] No route to host]
2011-09-30 12:02:58,019 - DataSourceEc2.py[WARNING]: 12:02:58 [18/30]: url error [timed out]
2011-09-30 12:03:04,026 - DataSourceEc2.py[WARNING]: 12:03:04 [19/30]: url error [timed out]
2011-09-30 12:03:10,033 - DataSourceEc2.py[WARNING]: 12:03:10 [20/30]: url error [timed out]
2011-09-30 12:03:16,040 - DataSourceEc2.py[WARNING]: 12:03:16 [21/30]: url error [timed out]
2011-09-30 12:03:23,023 - DataSourceEc2.py[WARNING]: 12:03:23 [22/30]: url error [[Errno 113] No route to host]
2011-09-30 12:03:30,030 - DataSourceEc2.py[WARNING]: 12:03:30 [23/30]: url error [timed out]
2011-09-30 12:03:37,038 - DataSourceEc2.py[WARNING]: 12:03:37 [24/30]: url error [timed out]
2011-09-30 12:03:44,046 - DataSourceEc2.py[WARNING]: 12:03:44 [25/30]: url error [timed out]
2011-09-30 12:03:51,054 - DataSourceEc2.py[WARNING]: 12:03:51 [26/30]: url error [timed out]
2011-09-30 12:03:59,063 - DataSourceEc2.py[WARNING]: 12:03:59 [27/30]: url error [timed out]
2011-09-30 12:04:07,071 - DataSourceEc2.py[WARNING]: 12:04:07 [28/30]: url error [timed out]
2011-09-30 12:04:15,082 - DataSourceEc2.py[WARNING]: 12:04:15 [29/30]: url error [timed out]
2011-09-30 12:04:23,091 - DataSourceEc2.py[WARNING]: 12:04:23 [30/30]: url error [timed out]
2011-09-30 12:04:29,097 - DataSourceEc2.py[CRITICAL]: giving up on md after 160 seconds

no instance data found in start

Related branches

Juerg Haefliger (juergh)
affects: cloud-utils (Ubuntu) → cloud-init (Ubuntu)
Revision history for this message
Scott Moser (smoser) wrote :

2 seconds might have been low, I agree, however the following works fine:
 $ for s in cnn.com brickies.net slashdot.org espn.com; do
    python -c 'import urllib2, sys; req=urllib2.Request(sys.argv[1]); \
     resp=urllib2.urlopen(req, timeout=float(sys.argv[2])); resp.read()'
     "http://$s" .1
   done

That does the same urllib2.urlopen with a timeout that cloud-init does.
Note the '.1' second timeout for all of those sites that are doing loads
more than serving a 10s of bytes of text (ok, brickies.net is not, but it
is also on shared hosting). I had to set that to .08 to get failures on
any of the listed sites. 3 seconds (the default in the code) should be
plenty.

But... we can up it,and it probably was arbitrarily low.

we had to set a timeout otherwise there is none, and due to the fact
that cloud-init retries the MD again and again (giving it time to come
up due to bugs/bad-design in EC2 and Eucalyptus). If the address
169.254.169.254 has been firewalled off, you potentially hang indefinitely
trying to connect (in the case where you're not running in a cloud).

I urge anyone interested in this to help to fix this bug in openstack.
There is a proposed solution at http://pad.lv/851159 although I have not
tested. Additionally, it is suggested (by vish and others) that you can
reduce this slowness by running an API server on each of the nodes. I
don't know how feasible that is. It is simply absurd to not be able to
respond to requests to the metadata service in less than 3 seconds (or 5
or 10).

The issue we were trying to protect against really was "non-cloud"
installation of cloud-init. Ie, in 10.04 or maybe maverick, install of
cloud-init on your laptop, and reboot would be blocked boot by cloud-init
for something like 1500 seconds. No fun.

Regarding your suggestion of 60 second timeout, and 5 retries, that
would also be insufficient. There is a case where the metadata service
isn't ready yet (this occurred on Eucalyptus and we've also seen it on
EC2, although not recently on either). In that case, the connections
would immeditely get "connection refused", so the 5 retries would occur
within 5 seconds, and never find the Metadata service.

Revision history for this message
Juerg Haefliger (juergh) wrote :

I agree that Nova needs to be fixed but we need a cloud-init fix for the current Nova as well. As for running an API server on each compute node, I don't think that's a viable solution for our environment. We've modified memcached to cache the metadata requests and that seems to help make things faster.

I see your point about the metadata service not being ready in time. Do you know how long it can take for the service to become ready? How about a smarter sleeptime that always increases (linearly) from 1 to 20 secs (or some configurable upper limit)?
sleeptime = int(19 * x / (sleeps - 1)) + 1
With that, the 5 retries would happen over a time of 51 secs (1, 5, 10, 15, 20 sec sleeps in between the 5 requests).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I'll mark this confirmed as I think it's agreed there is aproblem, just not what to do about it.

Changed in cloud-init (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

I'll get to this early next week.
but one more comment. The initial poll that is timing out for openstack is just to check to see if the metadata service is there (ie, "is this EC2?"). Then immediately after that, cloud-init crawls the metadata service. So, we make 1 call until it succedes , and then maybe 18 calls or so. So if its a 10 second wait, that blocks boot by 180 seconds on the crawl :-(

Revision history for this message
Scott Moser (smoser) wrote :

@Juerg,
  Could you test out the cloud-init at trunk right now and see if it fixes your problem?
  The commit at http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/revision/488 hopefully has a readable description of hte change.

  To build a upstream cloud-init, you can just run:
     ./tools/bddeb

Changed in cloud-init:
assignee: nobody → Scott Moser (smoser)
importance: Undecided → High
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.6.3~bzr497-0ubuntu1

---------------
cloud-init (0.6.3~bzr497-0ubuntu1) precise; urgency=low

  * New upstream snapshot.
    - cloud-config support for configuring apt-proxy
    - selection of local mirror based on presense of 'ubuntu-mirror' dns
      entry in local domain. (LP: #897688)
    - DataSourceEc2: more resilliant to slow metadata service (LP: #894279)
    - close stdin in all programs launched by cloud-init (LP: #903993)
    - revert management of /etc/hosts to 0.6.1 style (LP: #890501, LP: #871966)
    - write full ssh keys to console for easy machine consumption (LP: #893400)
    - put INSTANCE_ID environment variable in bootcmd scripts
    - add 'cloud-init-per' script for easily running things with a given freq
      (this replaced cloud-init-run-module)
    - support configuration of landscape-client via cloud-config (LP: #857366)
    - part-handlers now get base64 decoded content rather than 2xbase64 encoded
      in the payload parameter. (LP: #874342)
 -- Scott Moser <email address hidden> Thu, 22 Dec 2011 04:07:38 -0500

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Scott Moser (smoser)
Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Craig Sanders (cas) wrote :

Hi, It's great that this is fixed in precise....but has it been back-ported to lucid?

There's going to be a lot of lucid instances running for a long time to come, even after precise is released.

I've built myself a lucid development VM, and tried to build the latest version of cloud-init from precise. it won't build for a number of reasons - starting with lucid's python 2.6 version being too old, and no python-mocker. I've spent several hours on it so far, and wanted to check if i'm just wasting time duplicating work that's already been done before continuing any further.

thanks,

craig

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.