Limit UNDO space growth

Registered by George Ormond Lorch III on 2013-07-26

There exists a need to place some limits on old or rogue transactions which would consume UNDO log space up to the point of running out of actual filesystem space.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Low
Drafter:
George Ormond Lorch III
Direction:
Needs approval
Assignee:
George Ormond Lorch III
Definition:
Drafting
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Limiting UNDO space poses some not so obvious challenges.

First, there is no global or master ‘count’ of the current size or utilization. It must be manually calculated by:
    Walking the system list of REDO segments (rseg)
    Obtaining the rseg mutex
    Walking the rsegs 4 different UNDO segment (useg) lists which are the used_insert, used_update, unused_insert and unused_update lists.
    Adding up the number of pages into various values for total, used and unused.

This can become quite time consuming to perform for every insert, update/delete operation and would put unnecessary pressure on the rseg mutexes, thus a running counter must be implemented to solve this.

Second, due to the depth of the call stack where the actual detection of ‘too big’ would occur and numerous locks obtained in the process, there is no practical way to do anything other than fail the attempt to obtain more UNDO records. There is no reasonable way to try to perform an in place ‘clean up’ of old transactions in order to free up UNDO space for the current transaction to continue. This would also potentially stall the current transaction until there was enough space freed up to continue.

Third, purge interaction. One of the potential causes for UNDO growth is a phenomenon known as ‘purge lag’. There has been much written about this and some attempts to correct it. If the system is experiencing moderate to severe purge lag, there is very little that can be done to prevent run away UNDO growth without impacting current user transactions. It is strongly suggested that before attempting to limit UNDO via any other means, the purge lag must be first diagnosed and addressed. References:
    http://dimitrik.free.fr/blog/archives/06-01-2009_06-30-2009.html#61
    http://www.mysqlperformanceblog.com/2010/06/10/reasons-for-run-away-main-innodb-tablespace
    http://www.mysqlperformanceblog.com/2010/06/10/purge-thread-spira-of-death

Solution:
There may betwo or three components to the total solution with two of those components operating somewhat independently and may be used together or individually:

1. Required: Implement some global counter values to accurately track the total number of allocated pages and the 4 different types of use within the allocated size as described above.
2. Optional: UNDO soft limit
 2.1. Create a new InnoDB variable ‘innodb_undo_soft_limit’, measured in pages, which defines an upper threshold for used (not allocated) UNDO pages.
 2.2. A background worker thread will be introduced:
  2.2.1. When the used UNDO page count reaches the innodb_undo_soft_limit, the worker thread will analyze the system for the presence of purge lag (Algorihm TBD).
  2.2.2 The worker thread will then begin looking through the system transaction list, from oldest to newest for transactions that may qualify as undesirable. These include:
   2.2.2.1. Old (definition of 'old' TBD) transactions that have performed some DML operations which have introduced UNDO records and have not been committed or rolled back yet.
   2.2.2.2. Old (definition of 'old' TBD) transactions that are read-only with a consistent view that are holding UNDO records hostage (and also preventing the purge thread from running).
  2.2.3. If purge lag is occurring, the worker thread will only target older read-only transactions which are holding UNDO records and not transactions with active DML operations pending as this would only make purge lag worse and potentially impact real, running transactions.
   2.2.4. When an undesirable transaction has been discovered, it will be reported to the logging system and will be terminated.
   2.2.5. This will continue until the combined UNDO utilization counters fall to or below the innodb_undo_soft_limit or no more transactions can be identified.
 2.3. If the innodb_undo_soft_limit is 0, this function is disabled. Default = 0.
3. Required: UNDO hard limit
 3.1. Create a new InnoDB variable ‘innodb_undo_hard_limit’, measured in pages, which defines a maximum upper threshold for all allocated UNDO pages.
 3.2. When UNDO runs out of cached pages and needs to obtain more pages to grow, it will compare the total number of allocated pages described above with the innodb_undo_hard_limit. If the innodb_undo_hard_limit has been reached, the attempt to obtain any new pages will fail and an error will be propagated all the way back out to the client.
 3.3. If the innodb_undo_hard_limit is 0, this function is disabled.
 3.4. If at some point the server is restarted with a lower value for innodb_undo_hard_limit than before or is changed dynamically, the exact same logic described above will be used. UNDO space WILL NOT be compacted or reduced in any way, this is simply a constraint on new growth, not an active management function. Default = 0.

When used together with proper purge management, these features can offer both a soft control to gently try to keep the UNDO space constrained and a hard control to stop excessive growth in its tracks and possibly allow purging to keep up (at the expense of application usability). Each system will need to find the proper balance between these values, innodb_max_purge_lag, innodb_use_purge_thread the available resources and the transaction load on the system. A possible starting place would be to define what absolute maximum you want to apply to the innodb_undo_hard_limit, then make innodb_undo_soft_limit some percentage (80%?) of that value. This will allow temporary spikes in usage to occur up to the hard limit while still maintaining a reasonable ability to process user transactions as old transactions are cleared away and purging catches up.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.