Checkpoints to be able to restore a simulation state
Issue
Currently there is no simple way to restart a simulation from a given checkpoint. HDF5 I/O makes it possible to dump and load field data for given topology states but the user still has to manually load all fields and reconfigure the simulation parameters (t, dt, niter) and other dynamic parameters of the simulation. We should implement something more user friendly and less error prone.
Checkpoints
The general idea is to automatically dump simulation, parameters and discrete fields data to some kind of blob file to be able to continue a simulation from a given point. As some operators accumulate/store data in custom data structures (ie. not only in parameters and discrete_fields) we should also provide a way to export and restore operator state.
We assume that a given problem generates the same graph of operators, or at least, that each variable inputs and outputs of a given problem have the same topology states (local transposition state and memory ordering). Under this assumption we can load and store data without even taking into account topology states.
Checkpoint should work with the following features:
- Simple to use with existing problems
- Exported data should consist in a single file.
- Array and parameter data should be stored compressed
- MPI support
- I/O should be collective without locks when multiple processes are spawned
- Allow to reload a checkpoint with relaxed constraints so that datatype, boundary conditions, ghost number and process distribution can be changed between subsequent runs.
The best way to achieve this seems to be the zarr
module
which depends on numcodecs
for compression.
When underlying operator implementations are close enough, it should be possible to launch a simulation in single precision on a single GPU and finish it in double precision on CPU on a cluster with 32 processes.
Usage
It should be as simple as:
problem.solve(simu, checkpoint_handler=args.checkpoint_handler)
The example argument interface is extended with:
-
-L [LOAD_CHECKPOINT], --load-checkpoint [LOAD_CHECKPOINT]
begin simulation from this checkpoint. -
-S [SAVE_CHECKPOINT], --save-checkpoint [SAVE_CHECKPOINT]
save simulation checkpoints to this tar file. -
--checkpoint-dump-dir
custom input/output directory for checkpoints (defaults to dump_dir). -
--checkpoint-dump-freq
custom dump freq for checkpoints -
--checkpoint-dump-period
custom dump period for checkpoints -
--checkpoint-dump-times
custom dump times for checkpoints -
--checkpoint-dump-tstart
custom start time for checkpoints -
--checkpoint-dump-tend
custom end time for checkpoints -
--checkpoint-dump-last
enable dump checkpoint on last iteration -
--checkpoint-compression-method
manually specify compression method -
--checkpoint-compression-level
manually specify compression level -
--checkpoint-relax-contraints
allow for a change in datatype, boundary conditions, ghost count and topology shape when reloading a checkpoint. Useful to continue a simulation with a different precision, compute backend or another cluster. Note that changing the boundary condition will not always work as it affects effective global grid size.
Practical example:
python2.7 ./hysop_examples/examples/taylor_green/taylor_green.py -dump-dir /tmp/hysop -S checkpoint.tar --tend 1.0 -d32 --checkpoint-dump-last
python2.7 ./hysop_examples/examples/taylor_green/taylor_green.py -dump-dir /tmp/hysop -L checkpoint.tar --tend 2.0 -d32
Miscellaneous
- changed how MPI Cart topology splits the global grid to better match FFTW default MPI spliting
- updated
requirements.txt
with additional moduleszarr
andnumcodecs
, also added missingjsonpickle
module. - changed default io_params for operators to io_params=False to avoid unnecessary directory creation (as a result operator.dump_inputs/dump_outputs may now require the specification of io_params).
- also removed the automatic generation of the
generated_kernel
directory to clean up dumps. - added a dtype parameter for all timestep criterias so that auto-generated parameters can match dt.dtype.
- added mpi_params to Problems (relaxed strict mpi_params check for ComputationalGraphs).
- added argument
dump_last
to set the io_params.with_last parameter (dump on last iteration). - added argument
hdf5_disable_slicing
to disable multiprocess local dumping (<=16 process slabs). - removed time tracking in HDF5 dumps to be able to generate reproducible checksums
- removed time tracking in debug dumps for the exact same reasons
- removed annoying f2py build warnings
- added a verbosity parameter to scales
init_advection_solver
Bug fixes
- Fixed extra io_params argument group in example_utils.py
- Fixed io_params that where not correctly generated in the graph builder
- Fixed all examples to take into account global time of interests, debug_dumper and checkpoints
- Fixed missing calls to _setup_parameters in some example argparsers, leading to missing args.io_params
- Fixed time of interest tolerance issue
- Fixed bionic Dockerfile dependencies: openmpi, hdf5, pyopencl and python-flint, jsonpickle
- Fixed CMake FindPython to only match python 2.7 in
hysop/cmake/FindPythonFull.cmake
.
Continuous integration
- added ubuntu focal docker image (20.04 LTS), this is the new reference image for CI
- updated
ci/docker_images/ubuntu/bionic/Dockerfile
to include new dependencies - updated docker images to use static FFTW (for pyfftw and hysop fortran backend)
- updated experimental intel opencl platform (oclcpuexp-2020.10.6.0.4 and TBB v2020.3)
- CMake option
-DFFTW_DIR
inhysop/cmake/FindFFTW.cmake
becomes-DFFTW_ROOT
to reflect cmake policy CMP0074. - docker images now define
MPI_ROOT
andFFTW_ROOT
as environment variables (CMP0074).
TODO
- Command-line arguments
- Checkpoint export
- Checkpoint import
- Relaxed data import
- Simulation
- Parameters
- Discrete fields
- Operators
- MPI support (assuming all processes share array data)
- add tests
- update CI
- cleanup