Security group rules for devices RPC call refactoring

Registered by Miguel Angel Ajo on 2014-07-02

The security_group_rules_for_devices RPC call doesn't scale well, because all the security group rule entries are expanded with each specific IP address ( see [1] )in a security group, when we have default rules like:

allow all from 'default' group

This leads to :
* very big messages (I've seen >20-600MB)
* very long processing time at neutron-server side when we start having lots of instances under the same tenant/security group.
* lockups, when RPC messages timeout, and then the same security_group_rules_for_devices is issued back to neutron.

For a more detailed insight:

 * security_group_rules_for_devices it's an RPC call from the openvswitch agent to
     neutron-server, see [1]
      - This call receives as argument a list of device_ids, device_ids are connected to ports.
      - Neutron builds a list of security group rules and returns the list of security group rules
        per device_id

  Ok, and now let's look at the default security group rules which is 4 rules:
  - [IPv6, egress all]
  - [IPv6, ingress from default security group]
  - [IPv4, egress all]
  - [IPv4, ingress from default security group]

        This means two things:
        - Machines can initiate traffic to anywhere.
        - Machines can be reached from anyone on the same security
          group.

  So, what happens:

      1) As we add instances, those instances join the 'default' security
         group, if we don't change this explicitly.

      2) That means, the openvswitch-agents on compute nodes, get a
         notification, that they must refresh the security groups for
         devices in such updated security group (I'm partly guessing the
         logic here, but +/- is what it happens).

      3) That means, for every device in a node, it will get an explicit
         rule for each other device IP in such security group rule list.

      4) Those rules are translated into IP tables rules [2] (see line 97,
         neutron-openvswi-i013859e0-e = the specific rules for INPUT(i)
         at port 013859e0)

         (look at https://blueprints.launchpad.net/neutron/+spec/add-ipset-to-security
          for this)

      5)*** The RPC message size will grow in VMs_on_hipervisor * VMs_on_security_group,

           1) neutron-server has a bad time (high load) to render those
           messages, gathering from DB, building the JSON response, sending it by
              AMQP

           2) AMQP suffers: long time to transmit a message, timeouts, etc..

           3) When one of those big replies timeout, it's asked for again.. and
                again, and again, entering in situation that can't be recovered.

This problem goes worse as we have bigger compute nodes (capable of having
more instances) or we go into denser clouds based in docker containers.

 [1] Logged, and pretty printed RPC messages: http://www.fpaste.org/104401/14008522/
 [2] Resulting iptable rules, see line 97: http://www.fpaste.org/104431/40085672/

Blueprint information

Status:
Complete
Approver:
Miguel Angel Ajo
Priority:
High
Drafter:
Miguel Angel Ajo
Direction:
Approved
Assignee:
Miguel Angel Ajo
Definition:
Approved
Series goal:
Accepted for juno
Implementation:
Implemented
Milestone target:
milestone icon 2014.2
Started by
Kyle Mestery on 2014-08-08
Completed by
Kyle Mestery on 2014-09-04

Related branches

Sprints

Whiteboard

20-July (mestery): Juno-3 as medium.

Gerrit topic: https://review.openstack.org/#q,topic:bp/security-group-rules-for-devices-rpc-call-refactor,n,z

Addressed by: https://review.openstack.org/104522
    Refactor the security_group_rules_for_devices_rpc call result

Addressed by: https://review.openstack.org/111876
    Refactor security group rpc call

Gerrit topic: https://review.openstack.org/#q,topic:bp/add-ipset-to-security,n,z

Addressed by: https://review.openstack.org/112010
    Framework to start/stop neutron services for functional testing.

Addressed by: https://review.openstack.org/115575
    Add test to compare security_group_info_for_devices with old rpc

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.