Checkpointing Update

Registered by Jon Hill

A new checkpointing method:
 - write the current dump number to the checkpointed FLML
 - Restart output from this number (overwriting the last output with the initial conditions of the restarted run, but they should
   be identical)
 - Do not rename checkpointed out, just continue from the last output
 - Reopen the statfile, rewind it to the correct time and restart from that point

The result is a set of output files that is indistinguishable from a run that wasn't checkpointed.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
Jon Hill
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

The current checkpointing algorithm is:

 - at predetermined times, based on the number of outputs, write a checkpoint
 - The checkpoint consists of the VTUs necessary to recreate the run and a FLML file which contains the options
 - The FLML renames the output to append _checkpoint to the output name
 - The rename_checkpoint script can be used to standardise names back to the original

Whilst this approach works, it is cumbersome. For example, on HECToR with it's 12 hour limit on small
or large jobs, means that several checkpoints may be necessary to finish a full simulation, leading to the
name original_chackpoint_checkpoint_checkpoint_checkpoint_checkpoint_..._checkpoint. The rname_checkpoint script must then be run in
reverse several times to get the VTUs in order. The stat file is not amalgamated (though see see rename_checkpoint branch for
a fix for this).

Another method of checkpointing might be:
 - write the current dump number to the checkpointed FLML
 - Restart output from this number (overwriting the last output with the initial conditions of the restarted run, but they should
   be identical)
 - Do not rename checkpointed out, just continue from the last output
 - Reopen the statfile, rewind it to the correct time and restart from that point

The result is a set of output files that is indistinguishable from a run that wasn't checkpointed.

[Aside, but nice feature - binary stat files as the stat files from long simulations is likely to be large. Given we are
rewinding the statfile, doing this in binary is possibly harder, hence it might be worth doing this before implementing
the revised checkpointing scheme]

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.

Subscribers

No subscribers.