Check PXC/Galera-specific replication latency

Registered by Daniel Nichter

For example: pt-osc, pt-table-checksum. These tools check if any slaves are behind and meter themselves accordingly.

For Galera there are a few things that are important:

- Galera uses 'flow control' as a replication lag feedback loop. If the replication queue gets too large on any node, it will use flow control to slow down writes. This causes write-stalls (by design). These tools should avoid that.
- The default queue size (gcs.fc_limit - measured in pending transactions) is 16 (which changes a bit by default depending on how many nodes you have). This can be tune up to the several hundreds. Typically any queue sizes > 0 may indicate some amount of lag on the slaves.

There are several status variables that should be useful here:
- wsrep_flow_control_paused -- % of time (between 0 and 1) that flow control was in effect since the last SHOW GLOBAL STATUS
- wsrep_flow_control_sent -- FC messages SENT by a node (indicates the node that is laggy). This might be better since it's a global counter, but you'd need to check all nodes for this.
- wsrep_flow_control_recv -- FC messages received (from anywhere in the cluster) -- just checking the local node for this should be sufficient.
- wsrep_local_recv_queue -- current size of the recv queue
- wsrep_local_recv_queue_avg -- average queue size since last SHOW GLOBAL STATUS

Blueprint information

Status:
Not started
Approver:
None
Priority:
Medium
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
Accepted for 2.2
Implementation:
Not started
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.