pt-archiver: PK index is not always efficient

Registered by Muhammad Irfan on 2016-01-20

pt-archiver probably requires unique index to be able to ascend and archive the rows in "chunks". pt-archiver always picks primary key regardless If other unique keys in source table is efficient enough. Due to this, it will slow down entire job process. Below is test case:

CREATE TABLE archive_test (
ID INT UNSIGNED NOT NULL AUTO_INCREMENT,
UID INT UNSIGNED NOT NULL DEFAULT '0',
NAME CHAR(10) NOT NULL,
PHONE INT UNSIGNED DEFAULT '0',
PRIMARY KEY (`ID`),
UNIQUE KEY `UID_IDX` (`UID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

I executed pt-archiver with --dry-run option to print queries.
pt-archiver --user=irfan --password=****** --source h=localhost,D=test,A=utf8,t=archive_test --file '/tmp/%Y-%m-%d-tabname' \
--where "UID IN (6508962,367622,2315962,7190511,3874710)" \
--no-delete --limit=10000 \
--statistics --dry-run

SELECT /*!40001 SQL_NO_CACHE */ `id`,`uid`,`name`,`phone` FROM `test`.`archive_test` FORCE INDEX(`PRIMARY`)
WHERE (UID IN (6508962,367622,2315962,7190511,3874710)) AND (`id` < '108211')
ORDER BY `id` LIMIT 10000
SELECT /*!40001 SQL_NO_CACHE */ `id`,`uid`,`name`,`phone` FROM `test`.`archive_test` FORCE INDEX(`PRIMARY`)
WHERE (UID IN (6508962,367622,2315962,7190511,3874710)) AND (`id` < '108211') AND ((`id` > ?))
ORDER BY `id` LIMIT 10000

And below is equivalent EXPLAIN of above SELECT & it picked PRIMARY key with range scan.
mysql> EXPLAIN SELECT /*!40001 SQL_NO_CACHE */ `id`,`uid`,`name`,`phone` FROM `test`.`archive_test` FORCE INDEX(`PRIMARY`)
WHERE (UID IN (6508962,367622,2315962,7190511,3874710)) AND (`id` < '108211')
ORDER BY `id` LIMIT 10000;
+----+-------------+---------------+-------+---------------+---------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+---------+---------+------+-------+-------------+
| 1 | SIMPLE | archive_test | range | PRIMARY | PRIMARY | 4 | NULL | 53547 | Using where |
+----+-------------+---------------+-------+---------------+---------+---------+------+-------+-------------+
1 row in set (0.00 sec)

Below, I forced pt-archiver to use other unique key in table (UID_IDX) as per --where criteria.
pt-archiver --user=irfan --password=****** --source h=localhost,D=test,A=utf8,t=archive_test,i=UID_IDX --file '/tmp/%Y-%m-%d-tabname' \
--where "UID IN (6508962,367622,2315962,7190511,3874710)" \
--no-delete --limit=10000 \
--statistics --dry-run

SELECT /*!40001 SQL_NO_CACHE */ `id`,`uid`,`name`,`phone` FROM `test`.`archive_test` FORCE INDEX(`uid_idx`)
WHERE (UID IN (6508962,367622,2315962,7190511,3874710))
ORDER BY `uid` LIMIT 10000
SELECT /*!40001 SQL_NO_CACHE */ `id`,`uid`,`name`,`phone` FROM `test`.`archive_test` FORCE INDEX(`uid_idx`)
WHERE (UID IN (6508962,367622,2315962,7190511,3874710)) AND ((`uid` > ?))
ORDER BY `uid` LIMIT 10000

And here is equivalent EXPLAIN output of above produced SELECT for archival. And as per EXPLAIN it can be verified that UID_IDX unique index here is far efficient then PRIMARY key.
mysql> EXPLAIN SELECT /*!40001 SQL_NO_CACHE */ `id`,`uid`,`name`,`phone` FROM `test`.`archive_test` FORCE INDEX(`uid_idx`)
WHERE (UID IN (6508962,367622,2315962,7190511,3874710))
ORDER BY `uid` LIMIT 10000;
+----+-------------+---------------+-------+---------------+---------+---------+------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+---------+---------+------+------+-----------------------+
| 1 | SIMPLE | archive_test | range | UID_IDX | UID_IDX | 4 | NULL | 5 | Using index condition |
+----+-------------+---------------+-------+---------------+---------+---------+------+------+-----------------------+
1 row in set (0.00 sec)

Either pt-archiver should be smart enough to pick right unique index as PK is not always efficient as in this case as per query execution plan or there should be some option to pt-archiver to ignore PK so tool can pick other unique index.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Muhammad Irfan
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.