Checkpoints to be able to restore a simulation state

Issue

Currently there is no simple way to restart a simulation from a given checkpoint. HDF5 I/O makes it possible to dump and load field data for given topology states but the user still has to manually load all fields and reconfigure the simulation parameters (t, dt, niter) and other dynamic parameters of the simulation. We should implement something more user friendly and less error prone.

Checkpoints

The general idea is to automatically dump simulation, parameters and discrete fields data to some kind of blob file to be able to continue a simulation from a given point. As some operators accumulate/store data in custom data structures (ie. not only in parameters and discrete_fields) we should also provide a way to export and restore operator state.

We assume that a given problem generates the same graph of operators, or at least, that each variable inputs and outputs of a given problem have the same topology states (local transposition state and memory ordering). Under this assumption we can load and store data without even taking into account topology states.

Checkpoint should work with the following features:

Simple to use with existing problems
Exported data should consist in a single file.
Array and parameter data should be stored compressed
MPI support
I/O should be collective without locks when multiple processes are spawned
Allow to reload a checkpoint with relaxed constraints so that datatype, boundary conditions, ghost number and process distribution can be changed between subsequent runs.

The best way to achieve this seems to be the zarr module which depends on numcodecs for compression.

When underlying operator implementations are close enough, it should be possible to launch a simulation in single precision on a single GPU and finish it in double precision on CPU on a cluster with 32 processes.

Usage

It should be as simple as:

problem.solve(simu, checkpoint_handler=args.checkpoint_handler)

The example argument interface is extended with:

-L [LOAD_CHECKPOINT], --load-checkpoint [LOAD_CHECKPOINT] begin simulation from this checkpoint.
-S [SAVE_CHECKPOINT], --save-checkpoint [SAVE_CHECKPOINT] save simulation checkpoints to this tar file.
--checkpoint-dump-dir custom input/output directory for checkpoints (defaults to dump_dir).
--checkpoint-dump-freq custom dump freq for checkpoints
--checkpoint-dump-period custom dump period for checkpoints
--checkpoint-dump-times custom dump times for checkpoints
--checkpoint-dump-tstart custom start time for checkpoints
--checkpoint-dump-tend custom end time for checkpoints
--checkpoint-dump-last enable dump checkpoint on last iteration
--checkpoint-compression-method manually specify compression method
--checkpoint-compression-level manually specify compression level
--checkpoint-relax-contraints allow for a change in datatype, boundary conditions, ghost count and topology shape when reloading a checkpoint. Useful to continue a simulation with a different precision, compute backend or another cluster. Note that changing the boundary condition will not always work as it affects effective global grid size.

Practical example:

python2.7 ./hysop_examples/examples/taylor_green/taylor_green.py -dump-dir /tmp/hysop -S checkpoint.tar --tend 1.0 -d32 --checkpoint-dump-last 
python2.7 ./hysop_examples/examples/taylor_green/taylor_green.py -dump-dir /tmp/hysop -L checkpoint.tar --tend 2.0 -d32

Miscellaneous

changed how MPI Cart topology splits the global grid to better match FFTW default MPI spliting
updated requirements.txt with additional modules zarr and numcodecs, also added missing jsonpickle module.
changed default io_params for operators to io_params=False to avoid unnecessary directory creation (as a result operator.dump_inputs/dump_outputs may now require the specification of io_params).
also removed the automatic generation of the generated_kernel directory to clean up dumps.
added a dtype parameter for all timestep criterias so that auto-generated parameters can match dt.dtype.
added mpi_params to Problems (relaxed strict mpi_params check for ComputationalGraphs).
added argument dump_last to set the io_params.with_last parameter (dump on last iteration).
added argument hdf5_disable_slicing to disable multiprocess local dumping (<=16 process slabs).
removed time tracking in HDF5 dumps to be able to generate reproducible checksums
removed time tracking in debug dumps for the exact same reasons
removed annoying f2py build warnings
added a verbosity parameter to scales init_advection_solver

Bug fixes

Fixed extra io_params argument group in example_utils.py
Fixed io_params that where not correctly generated in the graph builder
Fixed all examples to take into account global time of interests, debug_dumper and checkpoints
Fixed missing calls to _setup_parameters in some example argparsers, leading to missing args.io_params
Fixed time of interest tolerance issue
Fixed bionic Dockerfile dependencies: openmpi, hdf5, pyopencl and python-flint, jsonpickle
Fixed CMake FindPython to only match python 2.7 in hysop/cmake/FindPythonFull.cmake.

Continuous integration

added ubuntu focal docker image (20.04 LTS), this is the new reference image for CI
updated ci/docker_images/ubuntu/bionic/Dockerfile to include new dependencies
updated docker images to use static FFTW (for pyfftw and hysop fortran backend)
updated experimental intel opencl platform (oclcpuexp-2020.10.6.0.4 and TBB v2020.3)
CMake option -DFFTW_DIR in hysop/cmake/FindFFTW.cmake becomes -DFFTW_ROOT to reflect cmake policy CMP0074.
docker images now define MPI_ROOT and FFTW_ROOT as environment variables (CMP0074).

TODO

✓ 18 of 18 checklist items completed · Edited 4 years ago

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information