Limit UNDO space growth
There exists a need to place some limits on old or rogue transactions which would consume UNDO log space up to the point of running out of actual filesystem space.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Low
- Drafter:
- George Ormond Lorch III
- Direction:
- Needs approval
- Assignee:
- George Ormond Lorch III
- Definition:
- Drafting
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
Limiting UNDO space poses some not so obvious challenges.
First, there is no global or master ‘count’ of the current size or utilization. It must be manually calculated by:
Walking the system list of REDO segments (rseg)
Obtaining the rseg mutex
Walking the rsegs 4 different UNDO segment (useg) lists which are the used_insert, used_update, unused_insert and unused_update lists.
Adding up the number of pages into various values for total, used and unused.
This can become quite time consuming to perform for every insert, update/delete operation and would put unnecessary pressure on the rseg mutexes, thus a running counter must be implemented to solve this.
Second, due to the depth of the call stack where the actual detection of ‘too big’ would occur and numerous locks obtained in the process, there is no practical way to do anything other than fail the attempt to obtain more UNDO records. There is no reasonable way to try to perform an in place ‘clean up’ of old transactions in order to free up UNDO space for the current transaction to continue. This would also potentially stall the current transaction until there was enough space freed up to continue.
Third, purge interaction. One of the potential causes for UNDO growth is a phenomenon known as ‘purge lag’. There has been much written about this and some attempts to correct it. If the system is experiencing moderate to severe purge lag, there is very little that can be done to prevent run away UNDO growth without impacting current user transactions. It is strongly suggested that before attempting to limit UNDO via any other means, the purge lag must be first diagnosed and addressed. References:
http://
http://
http://
Solution:
There may betwo or three components to the total solution with two of those components operating somewhat independently and may be used together or individually:
1. Required: Implement some global counter values to accurately track the total number of allocated pages and the 4 different types of use within the allocated size as described above.
2. Optional: UNDO soft limit
2.1. Create a new InnoDB variable ‘innodb_
2.2. A background worker thread will be introduced:
2.2.1. When the used UNDO page count reaches the innodb_
2.2.2 The worker thread will then begin looking through the system transaction list, from oldest to newest for transactions that may qualify as undesirable. These include:
2.2.2.1. Old (definition of 'old' TBD) transactions that have performed some DML operations which have introduced UNDO records and have not been committed or rolled back yet.
2.2.2.2. Old (definition of 'old' TBD) transactions that are read-only with a consistent view that are holding UNDO records hostage (and also preventing the purge thread from running).
2.2.3. If purge lag is occurring, the worker thread will only target older read-only transactions which are holding UNDO records and not transactions with active DML operations pending as this would only make purge lag worse and potentially impact real, running transactions.
2.2.4. When an undesirable transaction has been discovered, it will be reported to the logging system and will be terminated.
2.2.5. This will continue until the combined UNDO utilization counters fall to or below the innodb_
2.3. If the innodb_
3. Required: UNDO hard limit
3.1. Create a new InnoDB variable ‘innodb_
3.2. When UNDO runs out of cached pages and needs to obtain more pages to grow, it will compare the total number of allocated pages described above with the innodb_
3.3. If the innodb_
3.4. If at some point the server is restarted with a lower value for innodb_
When used together with proper purge management, these features can offer both a soft control to gently try to keep the UNDO space constrained and a hard control to stop excessive growth in its tracks and possibly allow purging to keep up (at the expense of application usability). Each system will need to find the proper balance between these values, innodb_