granular multistep configuration change

Registered by bjolo

Problem Description

CONTEXT:
Production private/public cloud; Hundreds of compute nodes; hundreds of projects; Tens of thousands of VMs/Containers; Thousands of VMs created/deleted per day; thousands of dependent users. Cloud downtime cost estimated to millions of dollars per hour.

In this context, any change that goes into the production cloud will not be done half heartedly. The change management process and tooling needs to adhere to best practices. There are of course different views on what that is, but it will include some/most of the following requirements.
 - Tested in dev/staging
 - Version controlled
 - Traceable
 - Trackable
 - Granular
 - Being able to rollback

Current solutions
Currently kolla provides two workflows for configuration changes as described below. However both of them are not quite sufficient for various reasons.

Kolla-ansible reconfigure:
  - Operator make changes in /etc/kolla/config, followed by kolla-ansible reconfigure.
  - New config is generated and pushed out. Containers are either restarted or recreated depending on config_strategy.
  - Pros:
     - Fully automated all the way. But this is also the problem and the background to why this BP is submitted.
  - Cons:
   - Takes a really long time for large clusters
   - No way to inspect, verify or approve the config changes before they are pushed out and activated
   - No granularity. All nodes are reconfigured
   - Does not do full stop on node failure. i.e. follows normal ansible behavior where all target nodes are executed in parallel. Is this desired? If one DB node fails reconfigure, do we really want to continue with other nodes?

 Kolla-ansible genconfig
  - Not really a officially supported kolla workflow since it was added for kolla-kubernetes
  - Changes are made to /etc/kolla/config.
  - Kolla-ansible genconfig generates the new config files, and pushes them out to /etc/kolla on target nodes
  - Operator needs to manually restart affected containers. i.e. requires config_strategy: COPY_ALWAYS
  - Pros
   - Fast to generate new configs and operators can inspect that the result is as intendent.
   - Granular in the sense that admins restarts containers manually at their will. Or choose not to activate the new config.
  - Cons:
   - Only works for config_strategy: COPY_ALWAYS
   - Not officially the supported way to do config changes
   - No central source of truth in the sense that config files on target nodes can be changed manually

Proposed Solution
As stated by many, kolla is not a product, but a project. Given this, it is debatable where to draw the line. Kolla should not enforce or implement a rigorous CM process, but Kolla should provide the mechanisms needed to construct one. To follow the kolla core design principle it should work out of the box with no massive config, but give operators the tools to override and build out to suit their needs. Below is one idea, but needs to be discussed.

To perform a configuration change is a 3 step process. Generate config, propagate config and instantiate config. This could be implemented as a 3 step pipeline, but it could be argued that step 2 and 3 should just be one step.

 1. Generate config
 - Command: kolla <cloud> generate-config
 - All config files are generated and output on kolla master node in a structure that represent the deployed cloud.
  ○ /etc/kolla/config --> /etc/kolla/genconfig/<node>
 - Operators can review the generated config and approve or scrap it. Config and genconfig dirs can be under git control
 - Optional config setting and command to push changes to central git server

 2. Propagate config
 - Command: kolla <cloud> propagate-config [node | node-group | service ]
 - Pushing the generated config to destination node.
 - Default is to push the files located in /etc/kolla/genconfig/<node> to /etc/kolla on target node.
 - Source can be overriden to support different URL and git reference instead.

 3. Instantiate config
 - Command: kolla <cloud> instantiate [latest | <git commit id> ] <service> [ <node> | <node group> ]
 - Affected containers are restarted/recreated to instantiate the new config
 - Depending on orchestration tool and command this could be done
  ○ Manually or automatic
  ○ Per service and/or host | node group

Example Workflows
Based on the 3 step approach above, operators/product developers can construct their own workflows. Below are some ideas:

Out of the box:
 - Kolla-ansible reconfigure is basically just calling the three steps above in one command. i.e. same functionality as it is today but implemented differently.
 - Config changes are generated on master node, pushed to target nodes and containers are restarted/regenerated as needed.

Version controlled:
 - Operator laborates with new config settings. Once satisfied with the output in /etc/kolla/genconfig, change is saved and commited to local git.
 - Change is pushed to target nodes using the git commit label.
 - Change is instantiated

Git review:
 - Operator laborates with new config settings. Once satisfied with the output in /etc/kolla/genconfig, change is saved and pushed to gerrit for review.
  ○ Change is reviewed and approved by peers and merged to prod.
 - Source URL and git reference is overriden in the config. Target nodes pulls from gerrit server
 - Git commit is pulled and instantiated.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
bjolo
Direction:
Needs approval
Assignee:
None
Definition:
Approved
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Nice ideas. This will be a big change, and we may need a bp spec for this. Or just push the description in one patch, which will be helpful for making comments.
normally, /etc/kolla/config under version control is enough, and we can support --check and --diff parameter in ansible to show the different line for the audition. /etc/kolla/genconfig is not good.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.