Percona XtraBackup moved to https://jira.percona.com/projects/PXB

FTWRL should only run when safe to do so

Bug #1100141 reported by Ryan Huddleston on 2013-01-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraBackup moved to https://jira.percona.com/projects/PXB	Invalid	Undecided	Unassigned

Bug Description

Flush tables with read lock can run even though there may be an running query that has been executing for hours. In this case everything will be locked up in "Waiting for table flush" or "Waiting for master to send event" states. Killing the "flush tables with read lock" does not correct the issue either. In this case the only way to get the server operating normally again is to kill off the long running selects that blocked it to begin with.

With the above in mind we should make FTWRL safer to prevent production downtime. To do this I suggest the following:

flush-time that innobackupex will wait before issuing a FTWRL, (default 1800 seconds? configurable), during this time innobackupex will wait for running processes to finish. It will poll the process list and once there is no actively running queries it will issue the FTWRL. If --rsync option is set is still should run rsync prior to the FTWRL.

Once FTWRL has been run it should start another process that checks to make sure that process isn't blocked by anything (something that just started just as FTWRL was issued). If there is anything blocking at this point it should immediately kill the query so that FTWRL can finish successfully and the backup can complete. Logging what it killed would be nice.

Tags:

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-01-16:

The problem with using a flush-time is that there is no way to
prevent newer queries from starting (unless there is a write lock
on all tables or something like that). So, it is possible that
innobackupex will wait forever. It is also not possible to do "It
will poll the process list and once there is no actively running
queries it will issue the FTWRL." since it is possible that in
time between polls another query has sneaked in.

FTWRL ensures a barrier of sorts in that all the queries after
that (if FTWRL is waiting on 'waiting for table flush') will
queue up in FIFO and will complete after FTWRL (so only queries
will complete, the writes/updates will still wait).

What can/may be done (as the last paragraph of description suggests) is for FTWRL to be run and if it is waiting too long for table flush (due to bug or bad queries), is to kill said queries after a configurable timeout. But this can be unsafe too. Note that this won't be subject to race conditions like earlier since MySQL ensures that queries after FTWRL are in queue after that.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-01-16:

It is also possible to issue FTWRL per table than globally, but, it may not help in this case and/or may lead to inconsistent backup (or may be possible only for Innodb tables).

Revision history for this message

Ryan Huddleston (rshuddleston) wrote on 2013-01-16:

I still think it's important the FTWRL not start if we know a long running query is currently running. Many environments have hundreds of new connections per second coming in so we cannot afford for FTWRL to take longer than a second or so. For example if we know a query has been running for 3 hours we should not kick off FTWRL as it will be unclear when that query will finish and the databases will be completely locked during that time. So we either have two choices in this case:

1) wait for the query to finish for a period of time
2) immediately kill off the query that is preventing us from starting FTWRL

I suggest making it configurable.

It's not important that we avoid every race condition as we aren't trying to prevent every short running query. We are trying to prevent already running long-running queries from blocking FTWRL.

In addition once FTWRL has started the behavior I would like is to kill anything that is preventing that from going through quickly. This can be configurable but for customers with hundreds of new connections per second we want to ensure it's done in a timely manner and should kill anything that is causing a delay.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2013-01-17:

Converted this feature request to a blueprint: https://blueprints.launchpad.net/percona-xtrabackup/+spec/safe-ftwrl

We should look into implementing this for 2.1.