FTWRL should only run when safe to do so
Converting bug #1100141 to a blueprint.
Blueprint information
- Status:
- Complete
- Approver:
- Alexey Kopytov
- Priority:
- High
- Drafter:
- Sergei Glushchenko
- Direction:
- Approved
- Assignee:
- Sergei Glushchenko
- Definition:
- Approved
- Series goal:
- Accepted for 2.1
- Implementation:
-
Implemented
- Milestone target:
-
2.1.4
- Started by
- Sergei Glushchenko
- Completed by
- Alexey Kopytov
Related branches
Sprints
Whiteboard
Flush tables with read lock can run even though there may be an running query that has been executing for hours. In this case everything will be locked up in "Waiting for table flush" or "Waiting for master to send event" states. Killing the "flush tables with read lock" does not correct the issue either. In this case the only way to get the server operating normally again is to kill off the long running selects that blocked it to begin with.
With the above in mind we should make FTWRL safer to prevent production downtime. To do this I suggest the following:
flush-time that innobackupex will wait before issuing a FTWRL, (default 1800 seconds? configurable), during this time innobackupex will wait for running processes to finish. It will poll the process list and once there is no actively running queries it will issue the FTWRL. If --rsync option is set is still should run rsync prior to the FTWRL.
Once FTWRL has been run it should start another process that checks to make sure that process isn't blocked by anything (something that just started just as FTWRL was issued). If there is anything blocking at this point it should immediately kill the query so that FTWRL can finish successfully and the backup can complete. Logging what it killed would be nice.
=======
May 14, 2013 by Sergei
The goal of this work is to minimize amount of the time when MySQL operates in read-only mode.
If there are long running queries FTWRL can stuck, leaving server in read-only mode until waiting for these queries to complete.
In order to prevent this two things are implemented.
- innobackupex can wait for a good moment to issue the global lock.
- innobackupex can kill selects which are prevent the global lock from being acquired
Good moment to issue a global lock is the moment when there no long queries are running. Of course we cannot predict the time needed for specific query to complete. We assuming that queries which run for a long time already will likely not be completed shortly, and queries which are running for a short time so far, will likely be completed shortly. innobackupex uses the value of option --lock-
We cannot wait for a good moment forever. --lock-wait-timeout option limits the time of waiting. If good moment did not happen during this time, innobackupex will give up and bail out with error message. Backup will not be taken. Zero value of this option turns off the feature (which is default).
Second thing is to kill all the queries which prevent global lock from being acquired. All queries which run longer than FTWRL are possible blockers. We should just kill them all. One thing one could take care about is that we should give a chance for short running queries to complete. This can be specified by option --kill-
One might want not to wait for a good moment, but just kill every blocker query. It is possible, one should just specify --kill-
Another possibility is to wait only for a moment when no long UPDATES are running and kill all the SELECTS, and UPDATES which where not detected before. To reach this, one should use --lock-
One more option tells innobackupex which queries it should kill if FTWRL stuck for longer than allowed by kill-long-
Options summary:
--lock-
--lock-
--lock-
--kill-
--kill-
Work Items
Dependency tree

* Blueprints in grey have been implemented.