Write changed page bitmap in XtraDB

Registered by Laurynas Biveinis

Operation and architecture overview

The InnoDB/XtraDB changed page tracking is done by a new thread (log0online.h, log0online.c) that reads and parses the (space; page) pairs out of the written log data. The tracking is controlled by a new read-only server variable --innodb-track-changed-pages=[TRUE|FALSE]. To hold the part of the tracking state that is shared with the rest of the log system, the log_sys_t struct is expanded with the new 'tracked_lsn' field.

The 'tracked_lsn' field contains an LSN up to which all the changes have been parsed. There is a maximum limit for the (current LSN - tracked LSN) value, violation of which will cause server operation to stop until the tracking catches up. This limit is equal to the maximum checkpoint age.

For better concurrency, the tracked_lsn field is not protected by the log_sys or any other mutex. It is accessed using the atomic operations primitives. InnoDB in 5.1 does not have the 64-bit primitives, thus they are backported from InnoDB of 5.6. For the platforms lacking the 64-bit atomics, provide a fallback implementation that protects the field by the log_sys mutex.

On the server startup, the log reader thread opens the last tracked bitmap file, truncates it to a multiple of bitmap block length and reads the last page to find out the last LSN tracked in that file. If the last page checksum check fails or it does not have the last page flag set, then the file is read backwards one page at a time until these two conditions are met. This LSN is then compared with the server start LSN. If they are non-equal, that means that there is a hole in the tracked LSN interval, i.e. due to a crash or srv_fast_shutdown=2 shutdown. In this case the hole is either closed by immediatelly reading and parsing the untracked log data or diagnosed if a part of the required logs is already overwritten. In this case the changed page bitmap data is usable only from the latest LSN. The log reader thread then goes to a loop of waiting for an srv_checkpoint_completed_event and reading-parsing the log data (srv_redo_log_follow_thread).

The log-writing thread behaviour is adjusted as follows. First, to ensure that the maximum tracked LSN age limit holds, all pending writes check it and delay log write operations if necessary (log_reserve_and_open, log_check_margins). Upon a certain number of retries the log write operation is allowed to proceed even if that causes loss of changed page information (such situation is diagnosed with a warning in the error log), as we prioritize the core server operation over the changed page tracking. At the completion of each checkpoint (log_io_complete_checkpoint) the srv_checkpoint_completed_event is signaled, waking up the log reader thread.

Upon the slow server shutdown the logs_empty_and_mark_files_at_shutdown is made
to loop until the log reader thread completely catches up with the written log.

Whenever log reader thread wakes up, it reads and parses the log data as follows (log_online_follow_redo_log) in a way not unsimilar to what log recovery does. First, the last checkpoint LSN is copied out of log_sys and is checked for advancing. Then the required data interval is rounded to OS_FILE_LOG_BLOCK boundaries and read into the read buffer by 2^14 byte chunks. Since it is possible to read the same log data multiple times due to aforementioned LSN rounding, already-read data is skipped and the remaining data is appended to the parse buffer in OS_FILE_LOG_BLOCK chunks, skipping over log block headers and trailers. Then the parse buffer is parsed one record at a time (using recv_parse_log_rec from recovery sys). In case of success add the (space; page) pairs to the bitmap. In case of error the log records that cross log block boundaries are handled by shifting the unparsed data to the start of the parse buffer and retrying after reading.

The in-memory changed page bitmap structure is the InnoDB red-black tree (ut0rb) of bitmap blocks. Each block is identified by the (space id, 1st page id in this block) pair, where 1st page id is only allowed to be a multiple of one bitmap block length. When the tree data is written to the disk, its nodes are recycled into a free list. They are never released back to heap in order to prevent heap fragmentation.

TODO: missing implementation items: 1) bitmap file rotate; 2) bitmap file rotate on user request; 3) INFORMATION_SCHEMA.CHANGED_PAGES table; 4) I/O stats.

Additional information in SHOW ENGINE INNODB STATUS

When log tracking is enabled, the following additional fields are displayed in the LOG section of the SHOW ENGINE INNODB STATUS output:

"Log tracked up to:" displays the LSN up to which all the changes have been parsed and stored as a bitmap on disk by the log tracking thread
"Max tracked LSN age:" displays the maximum limit on how far behind the log tracking thread may be.

File format

The changed page bitmap consists of 4K blocks that form variable-length runs. Each run has a complete tracking information for a certain LSN interval and each page has the following fields (format offset (width)):
- 0 (4): Last block flag. 1 if the current block is the last one in the current run, 0 otherwise.
- 4 (8): Starting tracked LSN of the current run. Equal for all blocks in the same run.
- 12 (8): Last tracked LSN of the current run. Equal for all blocks in the same run.
- 20 (4): Space ID of the tracked pages in the current block.
- 24 (4): Page ID of the first tracked page in the current block
- 28 (4): unused space to align the start of bitmap data at 8 bytes
- 32 (4056): the changed page bitmap.
- 4088 (4): unused space to align the end of bitmap data at 8 bytes.
- 4092 (4): the checksum of the current page.

The bitmap representation is a straightforward uncompressed bitmap: byte 0, bit 0 of the bitmap corresponds to page 0, bit 1 to page 1, byte 1, bit 0 to page 8, etc. A single page has 4056 bytes = 32448 bits of bitmap data. No bitmap compression currently is used. However, storing the page id of the 1st tracked page in the current block limits the sparseness of the bitmaps somewhat, especially if only pages with high ids are being changes.

XtraBackup consumption

https://blueprints.launchpad.net/percona-xtrabackup/+spec/changed-page-bmp-inc-backups
Instead of iterating over all data files to check last page modification LSN > LSN of last full backup, read the bitmap data to find this same set of pages.

Original description in bug 742162 (note that current implementation has deviated):

This is proposal from Peter

Current incremental backups are pain for large databases because they require complete scan. The idea is to add the feature which will be able
to track changes in the database and only copy data if it was changed. To maintain this server need to be modified to have an option to maintain
the log of pages changed enabled by option innodb_modified_pages_log=<file_prefix>

Innodb when will create series of log file ib_modified_log.000001 (and increasing numbers) with numbers increasing on each MySQL Restart
or reaching certain size (for example 1GB) (in the future we might add feature to rotate them)

The log file will contain records containing TIMESTAMP, LSN_FROM, LST_TO <LIST OF PAGES+TABLESPACES FLUSHED>. Each block should have length and
checksum in the start of the block so if partial block have been written during the crash it is detected.

When MySQL is to about to write series of pages to the disk (ie when they are picked for double write buffer) we store list of pages updated and
LSN number and fsync() before pages are written to their appropriate locations on disk.

We store both LSN_FROM and LSN_TO as checkpoint LSN to be able to catch the case if data was corrupted in some way - for example if we temporary disabled
this functionality by mistake and when enabled this back we'll have the gap in the ranges (the next LSN_FROM will not match LSN_TO in previous record) this
means the log will be unusable.

Integration With Xtrabackup:

Xtrabackup will have the option to read this set of log file. It will check the first record in each log file to understand from which log file it should start and when will
identify the last LSN_FROM which is smaller than supplied argument. When it will scan the log files to build the list of pages which need to be copied, sorting it by
tablespace number. Many pages will be seen multiple times in the log file but they still need to be copied only once.

Xtrabackup will not need to enable or disable anything on server so multiple backup processes can continue to operate absolutely independently.

Size Calculation:

Assuming we're writing 100MB/sec of flushing, (over 8TB/day) which is 6400 pages per second. They are flushed in 100 page blocks (double write) in this case we'll need to write:

64*(8+4+16+100*8) = ~ 53KB/sec or about 4.5GB per day. It also contains about 1/200 of data written from buffer pool to the disk which I consider acceptable overhead.

If we consider more typical example for such case, 1TB database, about 10GB of data changed per day. 10G of changes will require some 60MB of tracking changes, which in
case Incremental backups are done as daily backup during a week will contain less than 500MB in total, which is 0.05% of total database size.

Blueprint information

Status:
Complete
Approver:
Alexey Kopytov
Priority:
High
Drafter:
Laurynas Biveinis
Direction:
Approved
Assignee:
Laurynas Biveinis
Definition:
Approved
Series goal:
Accepted for 5.1
Implementation:
Implemented
Milestone target:
milestone icon 5.1.65-14.0
Started by
Laurynas Biveinis
Completed by
Alexey Kopytov

Whiteboard

I am wondering if we should diagnose the maximum untracked log age violations as follows. Since these violations result in holes in an otherwise continuous tracked LSN range, save the start LSN of the current (last) uninterrupted tracking range. Include this value in the show InnoDB status output. On the maximum age violation save the last tracked LSN. When the tracking resumes again, print the hole interval to the error log. This way a DBA can diagnose when the bitmaps become partly unusable due to tracked LSN holes and can also verify the uninterruptedness by InnoDB status output.

(?)

Work Items

Work items:
[hrvojem] Documentation: TODO

Dependency tree

* Blueprints in grey have been implemented.

This blueprint contains Public information 
Everyone can see this information.