Checkpointing Update
A new checkpointing method:
- write the current dump number to the checkpointed FLML
- Restart output from this number (overwriting the last output with the initial conditions of the restarted run, but they should
be identical)
- Do not rename checkpointed out, just continue from the last output
- Reopen the statfile, rewind it to the correct time and restart from that point
The result is a set of output files that is indistinguishable from a run that wasn't checkpointed.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- Jon Hill
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- New
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
The current checkpointing algorithm is:
- at predetermined times, based on the number of outputs, write a checkpoint
- The checkpoint consists of the VTUs necessary to recreate the run and a FLML file which contains the options
- The FLML renames the output to append _checkpoint to the output name
- The rename_checkpoint script can be used to standardise names back to the original
Whilst this approach works, it is cumbersome. For example, on HECToR with it's 12 hour limit on small
or large jobs, means that several checkpoints may be necessary to finish a full simulation, leading to the
name original_
reverse several times to get the VTUs in order. The stat file is not amalgamated (though see see rename_checkpoint branch for
a fix for this).
Another method of checkpointing might be:
- write the current dump number to the checkpointed FLML
- Restart output from this number (overwriting the last output with the initial conditions of the restarted run, but they should
be identical)
- Do not rename checkpointed out, just continue from the last output
- Reopen the statfile, rewind it to the correct time and restart from that point
The result is a set of output files that is indistinguishable from a run that wasn't checkpointed.
[Aside, but nice feature - binary stat files as the stat files from long simulations is likely to be large. Given we are
rewinding the statfile, doing this in binary is possibly harder, hence it might be worth doing this before implementing
the revised checkpointing scheme]