snapshots.l.o very high cpu usage, causing http timeouts

Bug #1183411 reported by Ben Copeland
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android
Fix Released
Undecided
Unassigned
linaro-license-protection
Fix Released
High
Milo Casagrande

Bug Description

snapshots.l.o quite often gets stuck at 100% cpu load. TOP reports apache is 100% cpu loads, which obviously after a while load average increases to over 40.00.

Example:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
  4487 www-data 20 0 937m 69m 7644 S 33.9 1.9 10:15.45 apache2
  6533 www-data 20 0 935m 45m 6024 S 32.9 1.2 0:35.55 apache2
  4459 www-data 20 0 938m 70m 7640 S 32.5 1.9 6:30.87 apache2

I have included a 2 week screenshot from cloudwatch showing the CPU usage. I have had a look at the apache logs at the time of cpu spikes, and there is to abnormal activity to report.

The problem is, when the server gets stuck in in this state, it often hangs and causes http timeouts. Restarting apache causes the load not to drop. The cpu can be at 100% for a few hours, before it decides to drop again.

Related branches

Revision history for this message
Ben Copeland (bcc) wrote :
Fathi Boudra (fboudra)
Changed in linaro-license-protection:
importance: Undecided → High
status: New → Confirmed
milestone: none → 2013.05
Revision history for this message
vishal (vishalbhoj) wrote :

The download links on android build page are not functional when the load is high on snapshots.

Revision history for this message
Fathi Boudra (fboudra) wrote :

I suspect this bug has some nasty side effects wrt ci.linaro.org/android-build.linaro.org:
- I observed random failures on build artifacts publishing (https://bugs.launchpad.net/linaro-ci/+bug/1180669)
latest one is https://ci.linaro.org/jenkins/view/engineering-builds/job/package-and-publish-linux-linaro/hwpack=vexpress64,label=precise_hwpack_cloud/134/console
- page rendering fetch data from snapshots and the page is stuck
see https://android-build.linaro.org/builds/~linaro-android-restricted/test-private-test-suite/#build=48
downloads link isn't accessible.

Fathi Boudra (fboudra)
Changed in linaro-android:
milestone: none → 13.05
Milo Casagrande (milo)
Changed in linaro-license-protection:
assignee: nobody → Milo Casagrande (milo)
Revision history for this message
Milo Casagrande (milo) wrote :

I started monitoring server side situation since yesterday.

We are getting errors in the logs of the form of:

Wed May 29 07:11:59 2013] [error] [client ] mod_wsgi (pid=11135): Exception occurred processing WSGI script '/srv/snapshots.linaro.org/configs/wsgi/wsgi_snapshots.py'.
[Wed May 29 07:11:59 2013] [error] [client ] IOError: failed to write data
[Wed May 29 07:16:06 2013] [error] [client ] mod_wsgi (pid=14291): Exception occurred processing WSGI script '/srv/snapshots.linaro.org/configs/wsgi/wsgi_snapshots.py'.
[Wed May 29 07:16:06 2013] [error] [client ] IOError: failed to write data
[Wed May 29 07:24:38 2013] [error] [client ] mod_wsgi (pid=10407): Exception occurred processing WSGI script '/srv/snapshots.linaro.org/configs/wsgi/wsgi_snapshots.py'.
[Wed May 29 07:24:38 2013] [error] [client ] IOError: failed to write data
[Wed May 29 07:35:16 2013] [error] [client ] mod_wsgi (pid=10571): Exception occurred processing WSGI script '/srv/snapshots.linaro.org/configs/wsgi/wsgi_snapshots.py'.
[Wed May 29 07:35:16 2013] [error] [client ] IOError: failed to write data

It might be interesting to add some more output to the wsgi script to really see what is happening.
Another thing I noticed is that accessing this page:

http://snapshots.linaro.org/openembedded/sources

Takes a lot, and it can end up in a "server error" page. Also loading that page leads to Apache throttling to 100% CPU usage.

Revision history for this message
Milo Casagrande (milo) wrote :

Further investigations:
- Ran strace on the apache PIDs, and looks like when apache is throttling up a lot, it is because it is serving files from the www/openembedded/sources/ directory, many times also the same file
- Running netstat, looks like when apache is swamped in that way, all the requests for files in www/openembedded/sources/ are coming from the same IP, using some sort of wget client

Revision history for this message
Milo Casagrande (milo) wrote :

Spoke with Ben: we pre-emptively blocked the offending IP address that was swamping snapshots.l.o.
We should monitor the situation and see if it changes, and if somebody complains about not being able to access snapshots.l.o.

Milo Casagrande (milo)
Changed in linaro-license-protection:
status: Confirmed → In Progress
Revision history for this message
Fathi Boudra (fboudra) wrote :

I've seen this issue before. BUILD-INFO.txt should use wildcard instead of listing all the files individually.

Fathi Boudra (fboudra)
Changed in linaro-android:
status: New → Fix Released
Milo Casagrande (milo)
Changed in linaro-license-protection:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.