FTWRL should only run when safe to do so

Registered by Alexey Kopytov

Converting bug #1100141 to a blueprint.

Blueprint information

Status:
Complete
Approver:
Alexey Kopytov
Priority:
High
Drafter:
Sergei Glushchenko
Direction:
Approved
Assignee:
Sergei Glushchenko
Definition:
Approved
Series goal:
Accepted for 2.1
Implementation:
Implemented
Milestone target:
milestone icon 2.1.4
Started by
Sergei Glushchenko
Completed by
Alexey Kopytov

Whiteboard

Flush tables with read lock can run even though there may be an running query that has been executing for hours. In this case everything will be locked up in "Waiting for table flush" or "Waiting for master to send event" states. Killing the "flush tables with read lock" does not correct the issue either. In this case the only way to get the server operating normally again is to kill off the long running selects that blocked it to begin with.

With the above in mind we should make FTWRL safer to prevent production downtime. To do this I suggest the following:

flush-time that innobackupex will wait before issuing a FTWRL, (default 1800 seconds? configurable), during this time innobackupex will wait for running processes to finish. It will poll the process list and once there is no actively running queries it will issue the FTWRL. If --rsync option is set is still should run rsync prior to the FTWRL.

Once FTWRL has been run it should start another process that checks to make sure that process isn't blocked by anything (something that just started just as FTWRL was issued). If there is anything blocking at this point it should immediately kill the query so that FTWRL can finish successfully and the backup can complete. Logging what it killed would be nice.

========================================================================
May 14, 2013 by Sergei

The goal of this work is to minimize amount of the time when MySQL operates in read-only mode.

If there are long running queries FTWRL can stuck, leaving server in read-only mode until waiting for these queries to complete.

In order to prevent this two things are implemented.

- innobackupex can wait for a good moment to issue the global lock.
- innobackupex can kill selects which are prevent the global lock from being acquired

Good moment to issue a global lock is the moment when there no long queries are running. Of course we cannot predict the time needed for specific query to complete. We assuming that queries which run for a long time already will likely not be completed shortly, and queries which are running for a short time so far, will likely be completed shortly. innobackupex uses the value of option --lock-wait-threshold option to make a decision that query is long running and will likely block global lock for a while.

We cannot wait for a good moment forever. --lock-wait-timeout option limits the time of waiting. If good moment did not happen during this time, innobackupex will give up and bail out with error message. Backup will not be taken. Zero value of this option turns off the feature (which is default).

Second thing is to kill all the queries which prevent global lock from being acquired. All queries which run longer than FTWRL are possible blockers. We should just kill them all. One thing one could take care about is that we should give a chance for short running queries to complete. This can be specified by option --kill-long-queries-timeout. This time we giving for queries to complete, after all of them will be killed. Default value is zero, which turns this feature off.

One might want not to wait for a good moment, but just kill every blocker query. It is possible, one should just specify --kill-long-queries-timeout.

Another possibility is to wait only for a moment when no long UPDATES are running and kill all the SELECTS, and UPDATES which where not detected before. To reach this, one should use --lock-wait-query-type=update. --lock-wait-query-type=update specifies which queries we should avoid when we issue FTWRL. Possible values are {all|update}. Use all to avoid all the queries, and wait to avoid only UPDATE/ALTER/REPLACE/INSERT etc. queries.

One more option tells innobackupex which queries it should kill if FTWRL stuck for longer than allowed by kill-long-queries-timeout. --kill-log-query-type={all|select}. Which allows to kill either all queries or only SELECT queries.

Options summary:

--lock-wait-timeout=N (seconds) - how long to wait for a good moment. Default is 0, not to wait.

--lock-wait-query-type={all|update} - which long queries should be finished before we issue FLUSH TABLES WITH READ LOCK. Default is all.

--lock-wait-threshold=N (seconds) - how long query should be running before we consider it long running and potential blocker of global lock.

--kill-long-queries-timeout=N (seconds) - how many time we give for queries to complete after FTWRL is issued before start to kill. Default if 0, not to kill.

--kill-long-query-type={all|update} - which queries should be killed once kill-long-queries-timeout is expired.

(?)

Work Items

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.