Some builds started to fail with "unknown user: jenkins-build"

Bug #941784 reported by Paul Sokolovsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android Infrastructure
Fix Released
Critical
Paul Sokolovsky

Bug Description

Well, it's not that they "started", we had such cases previously, but it's more or less common pattern lately, like at least 3-4 cases in last 2 weeks.

From build log:

++ sudo -E -H -u jenkins-build bash -es TUFOSUZFU1RfUkVQTz1naXQ6Ly9hbmRyb2lkLmdpdC5saW5hcm8ub3JnL3BsYXRmb3JtL21hbmlmZXN0LmdpdApNQU5JRkVTVF9CUkFOQ0g9bGluYXJvX2FuZHJvaWRfNC4wLjMKTUFOSUZFU1RfRklMRU5BTUU9ZGVmYXVsdC54bWwKVEFSR0VUX1BST0RVQ1Q9cGFuZGFib2FyZApUQVJHRVRfU0lNVUxBVE9SPWZhbHNlClRPT0xDSEFJTl9VUkw9aHR0cDovL3NuYXBzaG90cy5saW5hcm8ub3JnL2FuZHJvaWQvfmxpbmFyby1hbmRyb2lkL3Rvb2xjaGFpbi00LjYtMjAxMi4wMi8xL2FuZHJvaWQtdG9vbGNoYWluLWVhYmktbGluYXJvLTQuNi0yMDEyLjAyLTEtMjAxMi0wMi0xMF8wMC0xMy0wMy1saW51eC14ODYudGFyLmJ6MgpUT09MQ0hBSU5fVFJJUExFVD1hcm0tbGludXgtYW5kcm9pZGVhYmkKVEFSR0VUX05PX0hBUkRXQVJFR0ZYPTEKUkVQT19TRUVEX1VSTD1odHRwOi8vYW5kcm9pZC1idWlsZC5saW5hcm8ub3JnL3NlZWQvdW5pc2VlZC50YXIuZ3oKTEFWQV9TVUJNSVQ9MQpMQVZBX1NVQk1JVF9GQVRBTD0wCg==
sudo: unknown user: jenkins-build

I.e. it happens when initial bootstrap build script switches from root to start actual build-specific script.

Related branches

Changed in linaro-android-infrastructure:
importance: Undecided → Medium
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

From the corresponding slave init log (slave-i-6942a70d.log):

Get:51 http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/ natty/main uuid-dev amd64 2.17.2-9.1ubuntu4 [28.3 kB]
Failed to fetch http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/pool/main/libx/libxdmcp/libxdmcp-dev_1.1.0-1ubuntu1_amd64.deb Size mismatch
Failed to fetch http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/pool/main/x/x11proto-input/x11proto-input-dev_2.0.1-1ubuntu1_all.deb Size mismatch
Failed to fetch http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/pool/main/x/x11proto-kb/x11proto-kb-dev_1.0.5-1_all.deb Size mismatch
Failed to fetch http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/pool/main/libx/libxcb/libxcb1-dev_1.7-2ubuntu2_amd64.deb Size mismatch
Failed to fetch http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/pool/main/libx/libx11/libx11-dev_1.4.2-1ubuntu3_amd64.deb Size mismatch
Fetched 95.3 MB in 26s (3,646 kB/s)
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Command exited with non-zero status 100
14.69user 10.20system 1:09.70elapsed 35%CPU (0avgtext+0avgdata 103344maxresident)k
33304inputs+1299864outputs (331major+1445498minor)pagefaults 0swaps
Verifying that java exists
java full version "1.6.0_26-b03"
Copying slave.jar
Launching slave agent

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok, from that log 2 things are visible:

1. New Ubuntu S3 mirror does have some (non-deterministic?) issues.
2. Jenkins ignores non-zero exist status of slave init script and continues with using it for builds (as long as it can start java on it).

What happens overall is that we run that slave init script with set -xe, and "adduser --system jenkins-build" is the last command in it, so of any previous commands will fail, script will be aborted, user not created, but slave will still be used by jenkins for builds.

summary: - Some build starte dto fail with "unknown user: jenkins-build"
+ Some builds started to fail with "unknown user: jenkins-build"
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Reported to lp:932088

There're 9 cases like that so far:

ubuntu@ip-10-243-34-224:/var/lib/jenkins$ grep -l -a -E '^Failed to fetch.+s3' slave-i-*
slave-i-2aa5af4f.log
slave-i-30425a55.log
slave-i-6942a70d.log
slave-i-908792f5.log
slave-i-c6f0e9a3.log.1
slave-i-ddff1bb9.log
slave-i-e1ca2d85.log
slave-i-e4e0ea81.log
slave-i-f62e2393.log

Changed in linaro-android-infrastructure:
status: New → Triaged
assignee: nobody → Paul Sokolovsky (pfalcon)
milestone: none → 2012.03
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

11 total cases as of today.

Changed in linaro-android-infrastructure:
importance: Medium → High
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok, a workaround (creating the user first, installing packages later) was deployed, but it led to the effect I was wary of: it "fixed" most of issues, we had only one failure for nightlies, but that one showed really weird build errors, because unlucky package which wasn't installed was a toolchain: https://android-build.linaro.org/jenkins/job/linaro-android_toolchain-4.6-bzr/192/console

So, this won't do. What really needs to be done, as suggested by James, is to retry an op in case of failure. Let's go for it.

Changed in linaro-android-infrastructure:
importance: High → Critical
status: Triaged → In Progress
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Deployed a "stubborn apt-get" fix, let it be well tested overnight.

Changed in linaro-android-infrastructure:
status: In Progress → Fix Committed
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

After deployment, there were 18 cases when retries were used:

$ grep -l -a "apt-get failed" /var/lib/jenkins/slave-i-* | wc -l
18

Of them,

$ grep -l -a -E "apt-get failed.+aborting" /var/lib/jenkins/slave-i-* | wc -l
10

still failed at the end.

Final failure errors were like:

W: Failed to fetch http://us-east-1.ec2.archive.ubuntu.com.s3.amazonaws.com/ubuntu/dists/natty-updates/universe/binary-amd64/Packages 403 Forbidden

I.e. it's the same infamous 403 error we tried to get rid of with S3 mirror.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

On the other hand, all 10 of those "fatal" errors happened in "apt-get update" (not "apt-get install"), so they might be not truly random ones, but due to maintenance for example.

Those really shouldn't be fatal, apt-get itself says:

E: Some index files failed to download. They have been ignored, or old ones used instead.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Also, 18 instances with retries seems to be pretty many, q: how many instances were created since deployment at all?

A: 51 starting from 2012-03-02 midnight, so 18 not that bad actually.

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

I don't know if these are helpful:
http://aws.amazon.com/articles/1109#04
https://forums.aws.amazon.com/thread.jspa?threadID=21514

It looks like posting on the AWS forums may get some support on the matter. Since we have tried an internal and external apt source, it would seem to be a problem with the EC2 instance and posting on the forums would seem to be the way to go.

Would it be easy to deploy a test instance somewhere else, like http://www.bigv.io/ or http://www.rackspace.co.uk/cloud-hosting/cloud-products/cloud-servers/ or your local machine and run a set of builds to test to see if we have problems outside Amazon's cloud?

Changed in linaro-android-infrastructure:
status: Fix Committed → In Progress
Revision history for this message
Paul Sokolovsky (pfalcon) wrote : Re: [Bug 941784] Re: Some builds started to fail with "unknown user: jenkins-build"

On Mon, 05 Mar 2012 12:22:41 -0000
James Tunnicliffe <email address hidden> wrote:

> I don't know if these are helpful:
> http://aws.amazon.com/articles/1109#04
> https://forums.aws.amazon.com/thread.jspa?threadID=21514

First link doesn't open for me (connection hangs), 2nd is of 2008 with
vague responses of AWS folks like "The issue may have been caused by an
unusually long propagation delay." Both symbolize state of EC2 issues
very well ;-).

> It looks like posting on the AWS forums may get some support on the
> matter. Since we have tried an internal and external apt source,

Not sure what you mean here. Initial mirror was on a plain EC2
instance(s), with files served by by Apache, new mirror is on S3, with
files served by AWS specialized file service. In the end, they both
have the same issues (with the very first being misreporting of errors).

> it
> would seem to be a problem with the EC2 instance and posting on the
> forums would seem to be the way to go.
>
> Would it be easy to deploy a test instance somewhere else, like
> http://www.bigv.io/ or http://www.rackspace.co.uk/cloud-hosting/cloud-
> products/cloud-servers/ or your local machine and run a set of builds
> to test to see if we have problems outside Amazon's cloud?

That sounds like a (long) good plan, but I'm not sure we're the right
team to execute it. For example, it might be a good thing for Ubuntu
EC2 team to do in bounds of https://bugs.launchpad.net/bugs/932088 .
Speaking of us, that would for sure require separate ticket for it, and
based on other issues we have, it might get Low priority. That said,
I'd like to register on their forum and post about this issue, unless
carried away by other more urgent issues.

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

Revision history for this message
Данило Шеган (danilo) wrote :

There were some fixes in S3 mirroring yesterday (when I approached Ben and Scott who are working on this), so I wonder if we can see if anything has improved since. I see no new failures of the kind using

 $ grep -l -a -E "apt-get failed.+aborting" /var/lib/jenkins/slave-i-*

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Well, there were no random slave init failures for last 2 days, so something definitely improved. I'd warn however against considering an improvement the all-time fix - that thread of 2008 proves it otherwise. But today final fix to minimize effects of possible EC2 instability was deployed, and should give us needed robustness, supported on multiple levels.

Changed in linaro-android-infrastructure:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.