Redesign pt-table-checksum

Registered by Daniel Nichter on 2011-08-05

pt-table-checksum needs to be simplified and focused on automatic, self-tuning operation for checking replication integrity.

1. Remove chunking and introduce self-adjusting nibbling.

Remove the ability to "chunk" tables by dividing them into rows based on calculated boundaries. Instead use the "nibble" algorithm that some other tools use. Add a --chunk-time option to specify the target time for the chunk. This should default to 0.5 seconds. For each table, the tool should begin with an initial chunk size, in number of rows, and then adjust the number of rows as it processes the table, trying to achieve the targeted time per chunk. For the first table processed, the target value is --chunk-size, which should default to 1000 rows. For each subsequent table, the target value is determined by dividing the total rows checksummed so far in all tables, by the total time taken for the checksum queries to run, and then multiplied by --chunk-time. As the tool progresses through the table, it should adjust its chunk size by a weighted average of the rows per second achieved in the table, disregarding the speed of other tables previously checksummed. The rows-per-second of the most recent chunk should be weighted as 1, the previous chunk as 0.75, the one before that as 0.75 squared, and so on. This should allow the tool to adjust quickly to the correct target value, and then remain fairly stable.

For clarification, each chunk's size is determined as follows:

* If it's Chunk 1 of the first table: use the --chunk-size option. If it's not the first table, then...
* Chunk 1: multiply the whole server's rows-per-second by the desired chunk time
* Chunk 2: multiply the first chunk's rows-per-second by the desired chunk time.
* Chunk 3 and greater: multiply the weighted average rows-per-second from this table by the desired chunk time.

The goal is for the tool to adjust chunk sizes based on the speed of checksumming, treating each table differently (but using a reasonable starting value) and changing the speed in response to changing conditions, such as the server experiencing more load, or moving from "hot" to "cold" data as the tool progresses through the table, or perhaps some rows being bigger than others, or any other change in conditions. The exponentially decaying weighted average is intended to make the tool responsive to changes, but not unstable.

Exception: if --chunk-size is given explicitly on the command line (e.g. if $o->got('chunk-size')), then use only that chunk size; disable dynamic chunk sizing.

Do not use ORDER BY in the nibble SQL. The checksum query will have exactly one row, so there is no need for ORDER BY, and we should omit it lest it actually cause the optimizer to choose a bad plan for some reason. -- ORDER BY is required for the other statements: boundaries, etc. (

2. Remove multi-threading.

The tool should run only in --replicate mode against a single server, doing nibble queries. No CHECKSUM TABLE or parallel checksums across many servers.

3. Remove diversity of checksum algorithms.

The tool should use only the BIT_XOR() algorithm.

4. Change the output formatting.

The tool should print out one line per table, beginning with a header line, like the following. Note that the ROWS, CHUNKS, and TIME columns are wider than the beta code committed recently: 9 digits wide for ROWS, 9 for CHUNKS, and 8 for TIME. The TS column is new.

09-22T05:15:45 0 0 4 1 0 0.101 db.tbl1
09-22T05:15:46 0 0 19 1 0 0.012 db.tbl2
09-22T05:15:46 0 0 6 1 0 0.012 db.tbl3

The meaning of the columns is:

* TS is the abbreviated date and time of day.

* ERRORS is the number of chunks that had an error or warning during checksumming or during updating the checksum table with the master's CRC and count.

* DIFFS is the number of distinct chunks with a difference from the master, when using --replicate-check. The tool should recurse to replicas as normal and run the difference query against each one, select the chunk IDs that differ from the master, and then combine and unique-ify the list of chunk IDs from all replicas. The value in the column is the cardinality of that list.

* ROWS is the sum of the rows in the chunks.

* CHUNKS is the number of chunks.

* SKIPPED is the number of chunks skipped due to too many retries or other reasons.

* TIME is the total time consumed (in checksum queries only) for all chunks.

* TABLE is the db.table name.

5. Handling errors/warnings during checksumming.

Errors during checksumming will occur for a variety of reasons. For example, queries will time out, or we will run pt-kill to snipe pt-table-checksum queries if they run too long. The tool must continue and handle these things as normal events. Errors and warnings should be captured and handled as follows:

  * Lock wait timeout, or query fails because it is killed: retry the query silently. Print an error and skip the chunk after one retry. The query was running too long or causing too much blocking, and it is possible that it is cached enough that it will run quickly the second time, but if not, then it shouldn't be tried a third time.
  * Connection fails with "server has gone away": print error, reconnect, choose the right database, set the --set-vars, set the transaction isolation level, set the binlog_format, and retry. If unable to reconnect, it is a fatal error.
  * Connections to slaves (for checking delay, etc) can time out and fail, or be killed, or the slave can be shut down. These should be reopened just like the main $dbh.
  * When a fatal error occurs, explicitly disconnect all $DBH connections.
  * Errors and warnings should be printed, except errors that will be retried should be silent the first time.
  * Warnings such as "WHERE clause too large" should not be repeated for a single table (only print once per table).
  * Change "WHERE clause too large for boundaries column; pt-table-sync may fail" to "WHERE clause too large for boundaries"
  * All errors and warnings should be printed with a timestamp in front (just time of day, no date needed).
  * If an error is retried and succeeds, such as a lock wait timeout, don't increment the ERRORS column. This is part of normal operation and is nothing to alarm the user about.
  * Errors and warnings both count towards the ERRORS column.
  * No other error reporting is needed by default; MKDEBUG should be enough.
  * When there is a CTRL-C signal, the tool should finish the current chunk, print the report line for the table, and exit.
  * Don't change binlog_format after the initial connection creation.

6. Tables and schema.

The tool should work through one database and one table at a time, not accumulating any memory as it goes. The tool must have low, stable memory usage for servers with millions of tables and billions of rows per table. The mysql.general_log and mysql.slow_log tables should be skipped (on all tools, not just this tool). The checksum table itself should also be skipped.

7. Progress indication.

Progress should be reported for each table individually. The estimate of the number of rows to process should come from EXPLAIN SELECT COUNT(*) from the table, and progress should be measured by rows checksummed. The EXPLAIN SELECT COUNT(*) should be issued once before beginning to checksum the table, and not updated afterwards.

$tbl->{progress} needs to be paused while waiting for $replica_lag_pr or $sys_load_pr else it may report weird results because time passes for both Progress but really no progress is being made on $tbl->{progress}.

8. Checksum only when replicas are caught up.

Between each chunk, the tool should check all replicas for replication delay and pause until the delay is less than --max-lag, as it used to in Maatkit. When the tool is waiting for a replica to catch up, it should print "Replica lag is $N seconds. Waiting" When it is still waiting, it should print the lag too, as well as the amount of time it's been waiting. The time reported should be the longest time observed on any replica. If possible, the hostname (no port etc is needed) would also be helpful, like "Replica lag is $N seconds on $hostname. Waiting." This will allow the user to see how far behind the slowest replica is. Between sleeps while checking for replication lag, issue a "SELECT 'pt-table-checksum keepalive'" query to the main $dbh. This will help keep it from timing out and disconnecting.

When any replica is not running (Seconds_behind_master is NULL or other error), pause checksumming and report "Replica $name is stopped. Waiting." Report this immediately when first seen, then report at intervals like usual.

9. Finding replicas.

The tool should support a --recursion-method=dsn=<DSN> method that specifies a table of DSNs it should select from instead of doing recursion through the processlist or SHOW SLAVE HOSTS. The dsn table should have an id column and a dsn column. The SELECT from that table should be ordered by ID.

10. Emptying results from previous runs.

The tool should support an option --empty-replicate-table that is enabled by default. This option causes the tool to issue the following query before checksumming db1.tbl1 (for example):

    DELETE FROM $checksum_table WHERE db='db1' AND tbl='tbl1';

This DELETE must not happen when the first table is being processed if using --resume.

11. Checking results on replicas.

The tool must support --replicate and --replicate-check at the same time, and--replicate-check should be enabled by default. When both options are given, the tool should pause after checksumming all chunks for a table, and wait for the checksums to execute on the replicas. The tool can detect when this is true by a query such as the following:

   SELECT MAX(chunk) FROM $checksum_table WHERE db=$db AND tbl=$tbl AND master_crc IS NOT NULL

This will return a number less than the max chunk the tool just executed, until the replicas are all caught up. This should report progress and ETA by the chunk number of the replica that's furthest behind. The progress indicator should be initialized to the max chunk number of the table, and the message should be "Replicated $db.$tbl to chunk $chunk of $maxchunk on $host ($secs lag). Waiting.\n". This must be done before printing the table's results. If any tables have differences, the tool should exit nonzero.

Requiring "AND master_crc IS NOT NULL" avoids a race condition when the system is fast but replication is slow. In such cases, we can select on the slave before the update for master_crc replicates; this causes a false-positive diff.

12. Store more information in the --replicate table.

The --replicate table should have three columns to indicate the boundaries:
  * nibble_index: The name of the index that is used for nibbling; VARCHAR(200).
  * lower_boundary: The set of values for the columns in this index, for the lower boundary; should be a TEXT column.
  * upper_boundary: Ditto, for the upper boundary
Suppose the table has this structure:
  CREATE TABLE t(a int, b int, KEY foo (a,b));
The --replicate table might have the following: | foo | 5,10 | 5,100 | We can quote values and use the quote-matching regular expressions we use in query parsing to pull them apart again. The --replicate table should also have the following columns: the chunk time (float), and error/warning count. The error/warning count should be only for the checksum query, not for the query that follows to update the other columns in the --replicate table (don't add any more queries/updates to this table than we do now).

See also 21. Extended checksum table columns

13. Resuming.

The tool should not have the feature to resume from a file anymore. Instead, it should resume from the most recently inserted row in the --checksum table. We should add an index to the CREATE TABLE statement, on the 'ts' column, to enable this. When --resume is enabled, the tool should execute queries like this:

  SELECT MAX(ts) FROM $checksum_tbl WHERE master_crc IS NOT NULL

  SELECT db, tbl, chunk, nibble_index, lower_boundary, upper_boundary... FROM $checksum_tbl WHERE ts <= $max_ts ORDER BY db DESC, tbl DESC, chunk DESC LIMIT 1;

The second query should provide the ability to re-populate all of the necessary data to resume checksumming. It is not important to re-populate all of the context for adjusting chunk sizes when resuming. The resumed checksumming must not start at chunk 0, because that would overwrite earlier entries in the table.

When resuming, print "Resuming from $db.$tbl at chunk $chunk, timestamp $ts"

When resuming, the tool should not delete existing entries from the --replicate table for the partially done table (see above). It is assumed that the tool previously deleted the entries in the table, and the chunk from which it is resuming is the maximum chunk for that table. The tool should verify this with a query such as the following:

  SELECT MAX(chunk) FROM $checksum_tbl WHERE db=$db AND tbl=$tbl AND master_crc IS NOT NULL

If max(chunk) doesn't match the chunk selected from the previous query, warn "Not resuming from max chunk ($N != $M); resuming may not work correctly"

The insert into the --replicate table is only a single row per chunk, so the chunk either got checksummed, or not; there is no half-done. However, if the checksum query completed, but the subsequent updates to fill in other columns didn't (i.e. master_crc and master_cnt done via $update_sth), then master_crc will be NULL and the queries above won't select it, so it will start from the 2nd to last chunk and its first checksum will overwrite/complete the last checksum that was interrupted before master_crc|cnt were updated.

Init the $tbl->{progress} jobsize to the estimated number of rows in the table ($nibble_iter->row_estimate()) - the numbers of rows already checksummed. This latter value is obtained by:

   SELECT SUM(master_cnt) FROM $checksum_tbl WHERE db=$db AND tbl=$tbl AND master_crc IS NOT NULL

14. Features to be removed.

Remove the following: --algorithm, --chunk-column, --chunk-range, --count, --crc, --lock, --modulo, --offset, --optimize-xor, --probability, --resume-replicate, --since, --save-since, --schema, since-column, --single-chunk, --verify, --wait, --zero-chunk, --arg-table, --explain-hosts, --checksum, --slave-lag, --tab, --unchunkable-tables,--sleep-coef, --sleep

15. Avoiding infinite loops and bad query plans while nibbling.

When the table has no primary key or unique index, the tool should choose the index with the highest cardinality, which is probably the one with the most columns. Therefore the tool should prefer indexes with multiple columns over single-column indexes when there is no unique/primary index. This can help avoid oversized chunks.

When the tool gets the next upper boundary with a LIMIT/OFFSET query, it should return 2 rows, and examine them. If the values are identical, then the upper boundary of the rows is not unique, and the chunk could be oversized. In this case, the tool should examine the EXPLAIN rows estimate, and apply --chunk-size-limit to avoid too large chunks. Otherwise, if those rows differ, then the tool is guaranteed to size the chunk exactly right, and it should not pay attention to EXPLAIN's rows estimate. This will reduce false-positive oversized chunks that will be skipped otherwise. The decision whether to check for oversized chunks should be based only on the 2 uppermost rows, and not on whether the index is UNIQUE.

The EXPLAIN must be run regardless of whether the chunk is oversized, to ensure that the query will use the correct index. If the EXPLAIN doesn't say that MySQL is choosing the index that the Nibble Iterator chose, then the tool must skip that chunk. The skipped chunk number should not be reused, that is, there should be a gap in the sequence of chunk numbers in the --replicate table.

The tool must EXPLAIN the next-upper-boundary query before running it. If MySQL decides not to use an index for that query, it could impact the server very badly. If the tool can't run the next-upper-boundary query, it should skip the rest of the table and warn so the user is aware: "Aborting $db.$tbl because $chunkno cannot be nibbled safely."

When there is no suitable index at all, if the EXPLAIN rows estimate is not too high, the tool should checksum with 1=1 WHERE clause instead of throwing an error about no good index for nibbling. If the table is oversized, then the tool should print a warning and skip the table: "Skipping $table because there is no good index and the table is oversized."

The tool must compare the lower boundary of the current nibble to the previous nibble's lower boundary. If they are identical, it is an infinite loop, and the tool should print a warning and skip the table: "Skipping $table because an infinite loop was detected at chunk $chunkno."

16. Output control

The tool should support --quiet to suppress --progress messages and tables that don't have diffs, and --quiet --quiet to suppress everything but errors. --quiet is used for cases where you only want to know what's important: diffs, warnings, and errors. All warnings, and especially all errors, are important. They signal something out of the ordinary that you should investigate and fix. --quiet --quiet is used for cases where only the tool's exit status is evaluated. In this case, errors are still printed because errors should be extremely rare and demand attention: code bugs, checksum query causing MySQL error, etc.--things that shouldn't happen.

17. Avoiding activities that might block the server or overload a slave

a) The tool should set innodb_lock_wait_timeout=1 automatically to avoid blocking other queries if it is blocked itself.

b) The tool should have a --max-load parameter, which is an array of STATUS_VAR[=VALUE] values that defaults to Threads_running=25. If an explicit VALUE is given, use that as the max value, else issue a SHOW GLOBAL STATUS when it first connects to the server, observe the value of the variable, and set its thresholds to 20% higher than the observed value. So if it sees Threads_connected=23, for example, then it will multiply 23 * 1.2 = 27.6 and set the threshold for Threads_connected to 27. Then, after every checksum query, the tool should issue SHOW GLOBAL STATUS and pause until the specified values are lower than the thresholds. This should be similar to the way replication lag checking is done: do it silently first, then after a timeout, print "Pausing because $variable=$value, $variable2=$value2". This will help throttle checksumming based on server load. If the load is high because of checksum queries, the pause will allow the server to recover between chunks. If the user specifies an empty value for the option (i.e. --max-load ""), this pausing behavior should be disabled.

c) Here is a pathological worst case: some table has 0 rows on master, many rows on replica; replicates in single 1=1 chunk, takes for ever to finish. To avoid this, when a table will be done as 1=1, we should check the table on all of the replicas and see if it has too many rows. Skip it if it's too big on any replica.

18. Add back --where

The tool should add the specified clause to all queries against all tables. If it results in a syntax error "ERROR 1064 (42000): You have an error in your SQL syntax", the tool should print the error and exit. If it results in an unknown column error "ERROR 1054 (42S22): Unknown column '$column' in 'where clause'", then the tool should not add the clause for this table, and it is neither an error nor a warning.

19. Remove the ability to name more than one host on the command-line.

If more than one is specified, exit with an error. If no host is specified, behave as the mysql command-line clients would: attempt to connect to an empty host or localhost. We probably figured out the correct behavior for this in some other tools; we should copy what we did for those.

20. Repurpose options

a) Re-implement (and possibly rename) --recheck We need the original functionality: look for chunks that differ, and re-checksum them. This is so that we can avoid re-checksumming the whole table when we're doing iterations of checksumming and syncing. The functionality that is bound to --recheck right now needs a new name. We need to ensure that --recheck doesn't cause a lot of memory to be consumed for some reason, however: when we are doing --recheck, we need to set the next-table/chunk iterator to "get the next chunk with differences".

24. Check for rows past first and last nibble boundaries

Question: Does pt-table-checksum miss any rows on replicas that don't exist on masters? If the master has rows 1..1000, and the replica also has 1000..2000, do they get checksummed in a open-ended last chunk? Is there an open-ended first chunk too?

Answer: No, all chunks are closed on both ends. In our effort to have a single nibble query, it was necessary to have close-ended, inclusive chunks like WHERE id >= 1 AND id <= 10. An open-ended chunk on either end (the high or low end) requires a 2nd nibble query.

Solution: The open-ended query doesn't need to checksum the data but simply count the rows. For example, on the master with 1k rows, the final "checksum" query simply does SELECT NULL AS crc, COUNT(*) AS cnt FROM tbl WHERE id>1000. On the master, this will result in <NULL, 0> but on the slave it will result in <NULL, 1000>.

Blueprint information

Baron Schwartz
Baron Schwartz
Daniel Nichter
Series goal:
Milestone target:
milestone icon 2.0.1
Started by
Daniel Nichter
Completed by
Daniel Nichter


Status of spec implementations:

1. Done
2. Done
3. Done
4. Done
5. Done
6. Done
7. Needs to be fixed (pause $tbl->{progress} while waiting elsewhere)
8. Done
9. Done
10. Done
11. Needs to be fixed (new progress message)
12. Done
13. Needs to be fixed ($tbl->{progress} jobsize)
14. Done
15. Done
16. Done
17a. Done
17b. Done
17c. Done
18. Partial (handle errors caused by --where)
19. Done
20a. Done
24. Partial (doesn't print boundaries when --explain --explain is given)

[Baron's note: Let's implement --max-load and leave the other items for later. I need --max-load for some customers. I believe that Deva needs --filter for some customers too, but that might not have to go into this release, and we can discuss it with him.]


Work Items

This blueprint contains Public information 
Everyone can see this information.