pt-table-checksum TODO

Registered by Daniel Nichter on 2011-12-29

These features were not implemented in 2.0.1 ( They are feature we plan to do in a future version of the tool.

Developers: once we decide on a version in which to do some of these features, create a new blueprint like pt-table-checksum-<version> and cut-paste items from the blueprint. Leave this blueprint open as a running list of TODO and other ideas.

### 1. Remove --check-interval

We don't need --check-interval anymore. 1 second is too long when we have potentially targeted a very short chunk time. We can just make the tool sleep for --chunk-time when replicas are lagging.

### 2. Extended checksum table columns

Consider adding skipped tables to the checksum table, with no checksums, no rows, no chunks, and skipped=1, so we can find skipped tables with a query on skipped>chunks.

### 3. Add --filter to replace some removed features like --modulo and --offset

The filter is Perl code that accepts a hashref of information about each chunk and is compiled into a function, similar to the way pt-query-digest is done. The filter gets a hashref named $chunk, with the following keys:

- db, the database name
- tbl, the table name
- chunk, the chunk number

If the filter returns 1, the chunk is checksummed. If it returns 0, the chunk is skipped, and the SKIPPED column is incremented. If the filter throws an error, the whole table is skipped, and the error is printed.

An example filter to do approximately 1/7th of the table every day:

  --filter '($chunk->{chunk} % 7) == (sprintf("%d", time/86400) % 7)'

An example filter to skip a table for some reason:

  --filter 'die "Skipping table $chunk->{db}.$chunk->{tbl} because I said so"'

### 4. Automatically avoid false negatives

a) Automatically use --float-precision to avoid false positives. Set --float-precision to a default value of 12. This is TBD. Or, perhaps if the tool is checking --replicate-check (as it should by default), then it can notice out-of-sync chunks and decrease its float precision if there is a float/double column in the table? But that seems silly -- why not just use a lower value to begin with. Let's defer this item and return to it later.

### 5. Add safety checks

a. Replication filter

Add replication filter checks similarly to how pt-table-sync does it: if there are any binlog_{do,ignore}_db or replicate-* filters on any server, refuse to run unless the option to check for this is disabled. This feature is billable to issue 19429.

b. Table existence

Before checksumming any table, check for its existence on all replicas. If it doesn't exist, skip it with the message "Skipping $db.$tbl because it doesn't exist on $host." This feature is billable to issue 19429.

c. binlog_format on slaves

When recursing to replicas and checking for filters and so forth, also check for binlog_format=ROW/MIXED; if found, abort and warn

d. Timezone

Check the timezone for all connections the tool opens, and if any of them doesn't match, stop with an error message indicating that you can solve it by setting --set-vars time_zone=foo. See also bug 912470

e. read_only

Detect whether the server the tool is running on is read_only; if so, it might be run against a replica accidentally; warn and stop.

### 6. Add --recheck

This feature needs to be implemented as it was in v1.0, where it looks at the replicate table and re-checksums any chunks found to be different on one or more replicas.

### 7. Make the checksum queries use LOW_PRIORITY hint

I think pt-archiver has a --low-priority option or something; see if we can emulate that too. this should be enabled by default, just like innodb_lock_wait_timeout=1 by default.

Blueprint information

Not started
Baron Schwartz
Baron Schwartz
Series goal:
Informational Informational
Milestone target:

Related branches




Work Items

This blueprint contains Public information 
Everyone can see this information.


No subscribers.