Manage logs taking the available free space to consideration

Registered by Miroslav Anashkin on 2014-02-10

It was discovered that long-living large production environments has the following issues with the logs:

1. Log files are getting bigger and may not fit the existing root partition on master node. For large 100+ nodes environments 100 GiB of free space allows to keep only 1 GiB of logs per node.

2. Default size of the root partition on all nodes may be not sufficient to keep the logs.

3. Diagnostic snapshot creation takes additional space on master node and stalls if there is not enough free space. Moreover, it becomes impossible to create new diagnostic snapshot until the stalled task removed manually from Nailgun database.

4. As we delete environments, the related logs remains on master node.

5. Diagnostic snapshot script checks the available nodes one by one and considering node is offline by timeout. Let timeout is 1 minute. Then, for 100 nodes it may take up to 100 minutes just to check these nodes are online.

I propose the following to get rid of the issues above:

1. Keep all remote logs on master node compressed. It should save space. Additionally, if all these logs already compressed, we may exclude long compression task from diagnostic snapshot creation and use backround compression with low priority. Consider using 7-zip/lzma for background compression, since it produces smaller file sizes for regular log structures.

2. Make it possible to download diagnostic shapshot for selected node (nodes) only plus common diagnostic snapshot part.

3. Add warning about the master node is getting out of free space on /var (and on / )to Fuel UI.

4. Add master node capacity planning section to the docs, to describe master node free space requirements and to help deployment planning.

5. Add free space check on master node prior to diagnostic snapshot creation. Warn the user if it seems there is not enough free space, but allow the snapshot creation anyway - we use compression and actual snapshot size may be smaller than estimated.

6. Add status sign to each environment in Fuel UI. Show full status on click to this sign. In free space estimate and other requirements to this status.

7. Add overall status area to Fuel UI.

8. Modify diagnostic snapshot creation script to force it to delete or finish failed snapshot task in database - otherwise Generate Snapshot button remains inaccessible.

9. Move the logs, related to deleted nodes/environments to separate trash directory. Compress if necessary. Add button to clean up trash folder on demand.

10. Include trash folder size to master node status info.

11. Do not include trash folder to diagnostic snapshot but make it available to download on per-environment basis.

12. Add new logrotate job, based on remained free space on /var partition. Settings in this job should allow to reduce log backups number and/or other ways to reduce current logs size. This setting should do the best to keep at least 1 GB of free space on master node. This free 1 GB is vital for RabbitMQ.

13. Make offline node check in diagnostic snapshot script a multi-threaded task. Number of threads may be calculated on available CPU power.

Whiteboard

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.