Redesign pt-table-sync

Registered by Baron Schwartz on 2012-01-04

1. Use nibbling.

Remove all of the chunking, streaming, etc and just use plain nibbling like pt-table-checksum does.

2. Use optimistic locking/waiting

Don't lock, don't wait. Optimize for the common case: most servers have only a small amount of changing data. For each table, start out by assuming that both servers' copy of the table is static, and checksum in chunks until something is found to be different, then use the normal algorithms (example: lock on master, wait for replica to catch up) to ensure that there are no artifacts caused by changes in data while the checksumming is happening. This will make the tool much faster because it won't have to wait for replication to catch up all the time. When differences are found, exponentially increase a variable that ranges from 0..1 which specifies whether we should switch to pessimistic mode. If it's over 0.8 or something like that, be pessimistic rather than having to re-checksum every chunk. If chunks are found not to differ for a while, the variable should decrease and we should revert back to optimistic mode.

3. Handle special cases efficiently

Handle these:

- one table is empty. In this case, bulk-insert or bulk-delete instead of doing it row by row.
- a chunk doesn't exist on either server / is empty. Same thing

4. More flexible, less confusing host specs

It's really confusing to figure out which server is considered the authoritative copy of the data, and which server is going to get updated. Sometimes you run it on one server and it treats another as authoritative, but makes the updates on yet another... all of this is very confusing and not flexible enough for a lot of cases. I'll attach a patch to illustrate something I had to modify to make it do what I needed. In general, I think we need to allow the user to specify: this server is the master copy, and this server is where changes should be made (independently).

5. Add safety and sanity checks

Look at what we did in pt-table-checksum and ensure we have good coverage of scary things.

6. Avoid false positives

Sometimes people run pt-table-checksum and it finds differences, but then pt-table-sync changes nothing. Sometimes this is because of CRC32 collisions. When we know there are differences, at least the row-by-row checksums should use MD5() or another function that's stronger than CRC32().

7. Add features to save the data

Allow the user to save a copy of all modified data into a) a file b) a table with a similar name. Example: if I'm syncing 'users', then save a copy of the original value of all updated/deleted rows in the users table into __users_saved or something, with an extra __action column that says I/U/D, and an extra __timestamp column, to keep a record of what was done.

8. Consider whether we can resolve PK diffs

Sometimes the problem is that the PK is conflicting on two servers -- i.e. the same row exists in two places, with different PKs. If there is another unique index, we can use that instead of the PK as the authoritative "identity" of the row. For example, this happens a lot when we have an auto_increment ID but the real "identity" of the user is the email address or username or something like that. The way to fix the data conflicts is often to reconcile the conflicting IDs while leaving the rest of the row as-is. Currently the tool can't do this. How hard is it to add? Investigate.

9. Get rid of quoting madness

Use prepared statements, period :)

Blueprint information

Status:
Not started
Approver:
None
Priority:
Essential
Drafter:
None
Direction:
Needs approval
Assignee:
Daniel Nichter
Definition:
New
Series goal:
None
Implementation:
Not started
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.