LAVA Validation Lab

Scalability and deployment strategies for LAVA

Registered by Paul Larson on 2011-10-12

Right now, we're just using a single server, and deploying packages that line up with our monthly releases. As LAVA is quickly growing and becoming more and more important in Linaro, we should explore how to make it more scalable and how we can support a more agile cycle of development, testing, and deployment of lava components.

Session Notes:
Current performance is "decent"
* LMC uses flock to serialize runs
- ACTION: investigate doing this on a ramdisk
- Offloading this to another machine that is not handling other dispatcher activities would be better
- offload jenkins too
- main host should just be for interactive things

Celery needed to help us spawn workers
- maybe can
Using the cloud
- some worker tasks with low transfer requirements could be used right away if we have anything like that
- eventually, deploying web server nodes, database, etc to cloud would be possible, and keeping dispatcher local
     - use juju for easy deployment
Measurements
- collectd is running, but not terribly useful
- database transactions/min would be useful
- google analytics type hit rate counter
- ACTION: investigate graphite and statsd
- sentry monitoring for django apps
- should we run in canonical datacenter for things other than the dispatcher?
  - deployment might be an issue there
  - submitting RTs for changes
  - could we experiment with running some secondary staging server there
  ACTION: Launchpad has a script that measures how long transactions run for. This should help us avoid making db transactions that take too long

Postgres schema migrations are disruptive, distributed postgres might have issues with this
many processes have to talk to the database (scheduler, dashboard,..)
could they talk through the queue
web server performance (responsiveness)
database performance
System load (with a notification)
- investigate using nagios or new relic for this
Memory/swap usage
uwsgi has a feature to let you know and/or take action if a request takes longer than a given threshold
ACTION: Make sure we're making use of this when we do the new deployment
ACTION: Does postgres have a slow queries log?
Caching
* global enablement is not going to work
* enable globally for anon users and dropping timeout to a few minutes would be better
- measurements first
* cache reports at api level
* tests are cache aware because they will see the stale data - need to figure out how to turn off by testing
* wall clock time is ok to check if this is improving
* is there a way to measure cache hit/miss rate?
- memcached could probably help measure this