Builds regularly hang at the very end of build process

Bug #940226 reported by Paul Sokolovsky
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android Infrastructure
Won't Fix
Medium
Unassigned

Bug Description

https://android-build.linaro.org/jenkins/job/linaro-android_vexpress-ics-gcc46-armlt-stable-open/101/ is a build which ran (well, hanged) for more than a day! And that's one of 3-4 such I saw in last month. We recently deployed the Build Timeout plugin, which was expected to guard against build hanging in proper build (compilation) phase. But this (and others I saw) lock-ups happen during SSH transfer phase.

Actually, on a second thought, SSH transfer is just one build step, and entire build process should be covered by Build Timeout plugin, so it doesn't work that reliable (I tested it does work in general sense, of course). Also, SSH plugin has own timeouts for networking operation, they don't help either (or it helps not in networking access).

So, issues should be reported to both plugins upstream. However, it's clear that we need other, last-resort stop-gap measure to kill runaway build slaves. And such measure was requested yet in May 2011, and implementation was slipping since then. well, because issues come in irregular waves - it hits, we're concerned, it subsides - we think we fixed it and other pressing projects overtake. Well, we can't fix it - it's all stream of random errors rooted in complexity of systems we use. System has many layers, so we should fight with errors on many levels to make system robust.

Suggestion: create BP for this, schedule for immediate execution (12.03).

Related branches

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

+ /mnt/jenkins/workspace/linaro-android_vexpress-ics-gcc46-armlt-stable-open/build-tools/build-scripts/post-build-lava.py Don't know how to test this board. Skip testing. SSH: Connecting from host [ip-10-243-34-224] SSH: Connecting with configuration [snapshots.linaro.org] ... SSH: Disconnecting configuration [snapshots.linaro.org] ... SSH: Transferred 0 file(s) SSH: Connecting from host [ip-10-243-34-224] SSH: Connecting with configuration [snapshots.linaro.org file-move] ... SSH: EXEC: STDOUT/STDERR from command [reshuffle-files linaro-android_vexpress-ics-gcc46-armlt-stable-open/101] ... WARNING: Expected directory /home/android-build-linaro/android/.tmp/linaro-android_vexpress-ics-gcc46-armlt-stable-open/101 does not exist SSH: EXEC: completed after 201 ms SSH: Disconnecting configuration [snapshots.linaro.org file-move] ... SSH: Transferred 0 file(s)

So, it happened here on 2nd batch of transfers, when we transfer lava-job-info. So, that transfer was completed, 0 files were transferred because we still don't have support for vexpress LAVA config in android-build. But it never moved to next step - calling reshuffle-files.

Changed in linaro-android-infrastructure:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Nothing bad is shown for that EC2 instance in AWS console, but it shows that by now that instance ran for 37h, i.e. it did quite a few builds before this fatal one - kind of it worn off and, stuck, as and old engine.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

The instance is fully SSHable, loadavg 0.10, 0.04, 0.05, top doesn't show anything like java hoarding CPU (but shows java to run and being alive).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

I tried to abort #130 above, the stacktrace mentions fingerprinting, but even "Recording fingerprints" appeared after I clicked "abort", so there's still good chance it hangs in SSH.

https://android-build.linaro.org/jenkins/job/linaro-android_vexpress-ics-gcc46-armlt-stable-open/130/console :

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

We now have very crude, spammy script to monitor runaway slaves, on whose output I act manually. That's of course not scalable at all, and needs further elaboration.

Changed in linaro-android-infrastructure:
status: Confirmed → In Progress
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok, "babysitting" aspect of this was handled in https://blueprints.launchpad.net/linaro-android-infrastructure/+spec/ec2-android-infra-monitoring . Let's target this to trying to figure out why builds hang from time to time.

summary: - Need external build slave babysitting now!
+ Build regularly hang at the very end of process
summary: - Build regularly hang at the very end of process
+ Build regularly hang at the very end of build process
Changed in linaro-android-infrastructure:
assignee: nobody → Paul Sokolovsky (pfalcon)
summary: - Build regularly hang at the very end of build process
+ Builds regularly hang at the very end of build process
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok the idea how to find where exactly builds hang - in SSH plugin, or after it finishes - is to add another dummy last step of build sequence, which for example will print message. After that, we'll be sure where to look for bugs (thought based on current evidence it seems it hangs after SSH plugin finishes, but let's prove that).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

The migration to add last dummy step to build sequence performed. So, now if hanged builds will be stuck after 'echo "Build finished"' line, we know it happens in Jenkins core code executing after build steps finish, otherwise - in SSH plugin.

Changed in linaro-android-infrastructure:
assignee: Paul Sokolovsky (pfalcon) → nobody
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok, so we got first case after adding extra build step: https://android-build.linaro.org/jenkins/job/linaro-android_vexpress-ics-gcc47-armlt-tracking-open/59/console

The build hangs after "echo Build finished", so the culprit are not build steps (incl. SSHing), but the Jenkins code which runs after user's build sequence failed.

Again, we never saw such cases on ci.linaro.org, and comparing a-b vs ci.* config, on a-b, we have artifact fingerprinting enabled. Fingerprinting caused few issues already (lp:887657), so it seems it can be the next suspect.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

https://android-build.linaro.org/jenkins/job/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/85/console is different from other problematic builds, hanged in:

INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/opt.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/parseutils.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/pixdesc.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/pixfmt.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/random_seed.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/rational.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/samplefmt.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/sha.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/timecode.h
INSTALL mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/external/ffmpeg/libavutil/timestamp.h
INSTALL libavutil/avconfig.h
INSTALL libavutil/libavutil.pc
INSTALL install-progs-yes
INSTALL ffmpeg
INSTALL ffprobe
INSTALL ffserver
make[1]: Leaving directory `/mnt/jenkins/workspace/linaro-android_vexpress-rtsm-ics-gcc47-armlt-stable-open/build/out/target/product/vexpress_rtsm/obj/ffmpeg'
Build step 'Execute shell and set build status' changed build result to FAILURE
Build step 'Execute shell and set build status' marked build as failure

But probably just result of build failure followed by lock up.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ever since we're using recent LTS Jenklins versions, I see such issue rarely, if at all. Downprioritizing and keep watching.

Changed in linaro-android-infrastructure:
status: In Progress → Confirmed
importance: High → Medium
Revision history for this message
Alan Bennett (akbennett) wrote :

Due to the age of this issue, we are acknowledging that this issue will likely not be fixed. If this issue is still important, please add details and re open the issue.

Changed in linaro-android-infrastructure:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.