Scalability and deployment strategies for LAVA

Registered by Paul Larson

Right now, we're just using a single server, and deploying packages that line up with our monthly releases. As LAVA is quickly growing and becoming more and more important in Linaro, we should explore how to make it more scalable and how we can support a more agile cycle of development, testing, and deployment of lava components.

Session Notes:
Current performance is "decent"
* LMC uses flock to serialize runs
 - ACTION: investigate doing this on a ramdisk
 - Offloading this to another machine that is not handling other dispatcher activities would be better
 - offload jenkins too
 - main host should just be for interactive things

Celery needed to help us spawn workers
 - maybe can
Using the cloud
 - some worker tasks with low transfer requirements could be used right away if we have anything like that
 - eventually, deploying web server nodes, database, etc to cloud would be possible, and keeping dispatcher local
     - use juju for easy deployment
Measurements
 - collectd is running, but not terribly useful
 - database transactions/min would be useful
 - google analytics type hit rate counter
 - ACTION: investigate graphite and statsd
 - sentry monitoring for django apps
 - should we run in canonical datacenter for things other than the dispatcher?
  - deployment might be an issue there
  - submitting RTs for changes
  - could we experiment with running some secondary staging server there
  ACTION: Launchpad has a script that measures how long transactions run for. This should help us avoid making db transactions that take too long

Postgres schema migrations are disruptive, distributed postgres might have issues with this
many processes have to talk to the database (scheduler, dashboard,..)
could they talk through the queue
web server performance (responsiveness)
database performance
System load (with a notification)
 - investigate using nagios or new relic for this
Memory/swap usage
uwsgi has a feature to let you know and/or take action if a request takes longer than a given threshold
ACTION: Make sure we're making use of this when we do the new deployment
ACTION: Does postgres have a slow queries log?
Caching
* global enablement is not going to work
* enable globally for anon users and dropping timeout to a few minutes would be better
  - measurements first
* cache reports at api level
* tests are cache aware because they will see the stale data - need to figure out how to turn off by testing
* wall clock time is ok to check if this is improving
* is there a way to measure cache hit/miss rate?
 - memcached could probably help measure this

 Sentry could be used to track lots of different kinds of errors, including deploy failures
 ACTION: Investigate sentry

 Zyga: look at getting caching enabled, celery, sentry
 Michael: statsd/graphite
 Dave/Paul: other monitoring things

Blueprint information

Status:
Complete
Approver:
Paul Larson
Priority:
Undefined
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
Obsolete
Series goal:
Accepted for linaro-11.11
Implementation:
Unknown
Milestone target:
None
Completed by
Andy Doan

Related branches

Sprints

Whiteboard

we have celery deployed for things like this now.

(?)

Work Items