Fuel for OpenStack

Manage logs taking the available free space to consideration

Registered by Miroslav Anashkin on 2014-02-10

It was discovered that long-living large production environments has the following issues with the logs:

1. Log files are getting bigger and may not fit the existing root partition on master node. For large 100+ nodes environments 100 GiB of free space allows to keep only 1 GiB of logs per node.

2. Default size of the root partition on all nodes may be not sufficient to keep the logs.

3. Diagnostic snapshot creation takes additional space on master node and stalls if there is not enough free space. Moreover, it becomes impossible to create new diagnostic snapshot until the stalled task removed manually from Nailgun database.

4. As we delete environments, the related logs remains on master node.

5. Diagnostic snapshot script checks the available nodes one by one and considering node is offline by timeout. Let timeout is 1 minute. Then, for 100 nodes it may take up to 100 minutes just to check these nodes are online.

I propose the following to get rid of the issues above:

1. Keep all remote logs on master node compressed. It should save space. Additionally, if all these logs already compressed, we may exclude long compression task from diagnostic snapshot creation and use backround compression with low priority. Consider using 7-zip/lzma for background compression, since it produces smaller file sizes for regular log structures.

2. Make it possible to download diagnostic shapshot for selected node (nodes) only plus common diagnostic snapshot part.

3. Add warning about the master node is getting out of free space on /var (and on / )to Fuel UI.

4. Add master node capacity planning section to the docs, to describe master node free space requirements and to help deployment planning.

5. Add free space check on master node prior to diagnostic snapshot creation. Warn the user if it seems there is not enough free space, but allow the snapshot creation anyway - we use compression and actual snapshot size may be smaller than estimated.

6. Add status sign to each environment in Fuel UI. Show full status on click to this sign. In free space estimate and other requirements to this status.

7. Add overall status area to Fuel UI.

8. Modify diagnostic snapshot creation script to force it to delete or finish failed snapshot task in database - otherwise Generate Snapshot button remains inaccessible.

9. Move the logs, related to deleted nodes/environments to separate trash directory. Compress if necessary. Add button to clean up trash folder on demand.

10. Include trash folder size to master node status info.

11. Do not include trash folder to diagnostic snapshot but make it available to download on per-environment basis.

12. Add new logrotate job, based on remained free space on /var partition. Settings in this job should allow to reduce log backups number and/or other ways to reduce current logs size. This setting should do the best to keep at least 1 GB of free space on master node. This free 1 GB is vital for RabbitMQ.

13. Make offline node check in diagnostic snapshot script a multi-threaded task. Number of threads may be calculated on available CPU power.

Read the full specification

Blueprint information

Status:: Not started

Approver:: Mike Scherbakov

Priority:: Medium

Drafter:: Miroslav Anashkin

Direction:: Approved

Assignee:: Bogdan Dobrelya

Definition:: Drafting

Series goal:: None

Implementation:: Not started

Milestone target:: next

Related branches

Related bugs

Bug #1318517: /var/log/ on nodes is being polluted over time, and it’s root partition	Invalid
Bug #1328879: [customer-bp] Shotgun should ensure enough disk space for diagnostic snapshot	Won't Fix
Bug #1371757: Deployment Fails when /var is Full.	Fix Released
Bug #1376209: Logs on controllers are not rotated: /var/log/murano/ directory has insecure permissions	Fix Released
Bug #1378327: [fuel-library] Incorrect logrotate configuration leads to lack of free disk space on master node	Fix Released
Bug #1394864: default partitioning doing bad on small HDD/SSD/Other drives	Fix Released
Bug #1543491: Generating a diagnostic snapshot triggers to an error "exit code:1 stderr:"	Fix Committed
Bug #1546023: snapshot dump timeout with /var/log/ - 16G	Fix Committed

Sprints

Whiteboard

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information

Everyone can see this information.

Subscribers

Dave Johnston

Fabrizio Soppelsa

Georgy Kibardin