xshells merge requestshttps://gricad-gitlab.univ-grenoble-alpes.fr/schaeffn/xshells/-/merge_requests2021-12-04T11:05:42+01:00https://gricad-gitlab.univ-grenoble-alpes.fr/schaeffn/xshells/-/merge_requests/1Finished CUDA port2021-12-04T11:05:42+01:00EXT Máté Ferenc Nagy-EgriFinished CUDA portThis MR enhances the CUDA support of XSHELLS to a point that production runs can benefit from end-to-end device calculation with minimal host intervention.
## High-level overview
- GPU types and code paths have been put into namespaces...This MR enhances the CUDA support of XSHELLS to a point that production runs can benefit from end-to-end device calculation with minimal host intervention.
## High-level overview
- GPU types and code paths have been put into namespaces, under `xs::gpu::`. Host side functions where applicable received overloads that take device-side storage as input, for eg.
```c++
class LinOp3l : public _LinOp3l<LinOp3l> {
...
public:
void apply(xs_array2d<cplx> x, xs_array2d<cplx> y, int lmstart, int lmend);
#ifdef XS_GPU
void apply(xs::gpu::proxy2d<xs::gpu::cplx> x, xs::gpu::proxy2d<xs::gpu::cplx> y, int lmstart, int lmend);
#endif
};
```
- The implementation of such functions to remain understandable to an ordinary host compiler immediately invoke the implementation, which is CUDA/HIP code.
```c++
#ifdef XS_GPU
void LinOp3ld::apply(xs::gpu::proxy2d<xs::gpu::cplx> x, xs::gpu::proxy2d<xs::gpu::cplx> y, int lms, int lme)
{
xs::gpu::const_proxy2d<double> M{(int)nelem, dev_Mr(0) + 2};
xs::gpu::LinOp3ld::apply(dev_li, M, x, y, irs, ire, lms, lme, xs::gpu::streamDefault);
}
#endif
```
- Some of the rationale for using namspaces are:
- Being able to shed off library prefix off of types and functions which otherwise need them for disambiguation. `xs_array` vs. `xs::array` but in selected scopes can be referred to as `array` via `using namespace xs;` or the calling function being in namespace `xs::` to begin with.
- One prevalent occasion are the types `cplx` and `xs::gpu::cplx` which refer to `std::complex<double>` and `double2` respectively. In host side code, one most often wants to refer to the STL type and in device code the CUDA vector type. One need no introduce yet another name for complex, but host side code automatically picks up the STL type and device side code (functions residing in `xs::gpu::`) when saying `cplx` unqualified, they will pick up the name from the nearest enclosing namespace, which is `xs::gpu::cplx` and we didn't have to introduce a `gpu_cplx` type to disambiguate the two.
- New [RAII](https://en.cppreference.com/w/cpp/language/raii) types have been introduced to store device-side data.
- The new type `xs::gpu::array2d` models `xs_array2d_mem`.
- The new type `xs::gpu::proxy2d` models `xs_array2d`.
- Contrary to the original classes following `is-a` relationship (`xs_array2d_mem` is a `xs_array2d` via inheritance), the GPU storage types are distinct and `xs::gpu::array2d` can cast itself to `xs::gpu::proxy2d`. Detailed rationale can be found at the beginning of the header. Quting here for simplicity.
```c++
/// A simple, lightweight, generic 2D array class that allows efficient access with [i][j], with a possible shift for i.
/// during construction. Requires C++11 compiler. There is no bound check. Mimics the behaviour of T**.
///
/// NOTE: The xs::gpu::array2d class mimics xs_array2d but follows slightly different semantics:
/// - xs::gpu::array2d knows its size (contrary to strides only as in xs_array2d) so it can
/// copy and manage itself without providing external context.
/// - Feeding complex (read: non-built-in) structures containing device pointers to kernels requires
/// passing-by-value to kernels. However xs::gpu::array2d wishes to follow idiomatic RAII-based resource
/// management which conflicts with value semantics and device execution. (Invoking for eg.
/// xs::gpu::kernels::laplP<<<>>>(li, ...) will cause a copy to be made on host-side too, causing the
/// destructor to run.) To fix this a helper class is introduced, xs::gpu::proxy2d which exposes the
/// kernel-side interface but doesn't manage the underlying memory.
/// - Because of this duality, and already having xs_array2d resembling host-side storage, xs::gpu::array2d
/// omits operator[] altogether. Host-side access to device contents in the current design require involving
/// xs_array2d.
/// - xs_array2d supports referencing non-contiguous storage. In contrast xs::gpu::array2d only supports
/// contiguous storage, but does provide a CTOR to consume non-contiguously stored xs_array2d instances.
```
- An alternative approach may have been taken which keeps inheritance as the basis; it involves [object slicing](https://en.wikipedia.org/wiki/Object_slicing), but that is a dicy part of the language and I wasn't sure that nvcc would properly slice objects at the kernel call-site.
- Device-side storage as much as possible is anchored in the same place where host-side storage is. For eg. `PolTor` has:
```c++
class PolTor final : public Spectral
{
public:
xs_array2d<cplx> Pol;
xs_array2d<cplx> Tor;
#ifdef XS_GPU
xs::gpu::array2d<xs::gpu::cplx> dev_Pol;
xs::gpu::array2d<xs::gpu::cplx> dev_Tor;
#endif
...
};
```
- One notable exception to that is the base class `Spectral`, which is the only class which owns it's data, however the device-side storage was not written in a manner to store multiple components. This feature proved being non-essential, and aside from the sole usage `NL.Su` and `NL.Sb` which are then never used elsewhere, but rather function as temporaries, this capability was not implemented. Quote from `Spectral` source code:
```c++
class Spectral
{
...
xs_array2d<cplx> data; ///< the actual data is stored in an xs_array2d object.
// Mate TODO: consider promoting the scope-local `mem` from Spectral::alloc()
// to a member to own the memory. Currently data is likely leaked and freed
// only upon app termination.
// NOTE: dev_data does not exist, as the xs::gpu::array2d class was not designed to hold multiple components.
// Device data on it's own doesn't participate in the utilities provided by this class. Owning the
// device allocated data is handed down to child classes, allocation will happen at init.
```
- The bulk of the GPU code invocation stems from `explicit_terms_gpu`.
- Because GPU device code needed host-side type definitions, some parts of the series of `.cpp` inclusions (more on this later) needed to be broken off into headers. For eg. `grid.hpp`'s `OpCurlLapl`, `IN_RANGE_INCLUSIVE` utility macro, etc. Ideally all function prototypes and type definitions should be in headers with include guards, but rehsaping the entire library was outside the scope of the project.
- I did much more than what was minimally needed, for the sake of easier code navigation. Files like `xshells_spectral` and `xshells_linop` have been throroughly separated into headers and sources.
- The inclusion scheme was kept intact, but now the source structure resembles more, as if it were a library with no dependence on include order.
- Host-side code is kept as `.cpp` files and only the thinnest possible layer has been offloaded into `.cu` files, which do nothing more than kernel launches.
- For every XSHELLS module, for example spectral ops, the following files were created from `xshells_spectral.cpp`:
- `xshells_spectral.hpp` containing all related type definitions and function prototypes
- `xshells_spectral.cpp` containing all function definitions
- `xshells_gpu_spectral.hpp` containing function prototpes of kernel launches. This file is included by `.cu` files and in `.cpp` too when `XS_GPU` is defined.
- `xshells_gpu_spectral.kernel.cu` containing the actual kernels.
## GPU kernels
- All inputs-outputs are const-correct.
- GPU kernels logically stay true to their host-side counterparts. Some things which consistently differ:
- All operations are implemented to work in batches. Instead of `PolTor::RadSph` taking radial a single radial coordinate as input, it takes a range of radial coordinates which it processes.
- This was done to reduce the number of kernel launches and in preparation for multi-GPU implementations.
- GPU kernels are typically indexed as `<<<dim3(ire-irs+1,1), dim3(1,256),...>>>` where every block processes one radial coordinate (`blockIdx.x`) and `threadIdx.y` identifies a thread in the block which processes `NLM-1` indicies.
- GPU kernels use lambda functions instead of macros utilities to declare loops over various directions. (`IN_RANGE_INCLUSIVE` may be an exception)
- Boundary conditions are handled in separate kernels to help with code deduplication. (not of the boundaries, but the inner parts.)
- Some auxiliary arrays were eliminated (`r_1` and `l2`) from device code to save memory reads and instead are calculated on the fly, but others may be eliminated too. (`li` and `l`)
- All kernel launch functions take a `xs::gpu::stream` instance as their last argument, which defaults to `xs::gpu::streamDefault`. Because integration faced far too many challanges, we aired on the safe side and omitted multi-stream, async kernel handling.
- Asynchrony can be reintroduced simply by returning an event associated with the last operation within. Currently all kernel launch functions return `void` and do not use their return type.
- Some facilities to aid this were left in the code, such as `std::array<std::array<xs::gpu::event, 4>, MAXFIELD> generate_field_events()`. _(See later in notes on having more functions.)_ Should kernel ops initialize the events before kernel launch, this function would not be needed and instead be turned into a clean-up function to release the events, as kernel launches could just assign their return value to a similar array.
## Build system support
### CMake
CMake build definitions are provided for both SHTns and XSHELLS. This was done early on during development to accelerate the porting work, as IDEs make extensive use of information provided by CMake to light up semantically correct code colorization and code navigation. (For more information refer to [CMake File API](https://cmake.org/cmake/help/latest/manual/cmake-file-api.7.html#manual:cmake-file-api(7)))
- They both use the same set of configuration-time options to enable/disable back-ends and targets for build (last value is the default)
- `option(BUILD_BENCHMARK "Build benchmarks" OFF)`
- `option(BUILD_TESTING "Build unit tests" ON)`
- `option(BUILD_DOCUMENTATION "Build docs" OFF)`
- `option(USE_CUDA "Enable CUDA support" OFF)`
- `option(USE_HIP "Enable HIP support" OFF)`
- `option(USE_OPENMP "Enable OpenMP support" OFF)`
- SHTns provides the following configuration-time options unique to it:
- `option(BUILD_FORTRAN "Build the Fortran77 API" OFF)`
- `option(USE_MAGIC "I need the transforms compatible with the MagIC code, to speed it up!" OFF)`
- `option(USE_ISHIOKA "Enable the new recurrence proposed by Ishioka (2018) see https://doi.org/10.2151/jmsj.2018-019 (faster)" ON)`
- `set(SHT_VERBOSE 0 CACHE STRING "0:no output, 1:output info to stdout, 2:more output (debug info), 3:also print fftw plans.]")`
- XSHELLS provides the following configuration-time options unique to it:
- `option(BUILD_EXAMPLE "Build examples" ON)`
- `option(USE_SIMD "Enable explicit SIMD support" OFF)`
- `option(USE_MPI "Enable MPI support" OFF)`
- For the moment the minimum CMake version required to build is 3.17 for the CUDA back-end, but this can be retrofitted down to CMake 3.8 if it causes troubles.
- The main reason was using `find_package(CUDAToolkit)` but that isn't really necessary.
- The CMake minimum version required for compiling the HIP port is 3.21 (brand new) as proper CMake HIP language support was [just recently merged](https://cmake.org/cmake/help/latest/release/3.21.html#languages).
- The CMake build scripts recreate the same configuration logic from the autoconf-based build and also defines build steps for all generated source code with proper interdependence.
- All example problems show as CMake build targets without needing to copy any files (as opposed to the original build).
### Autoconf
- The original autoconf/GNU make based build definition was kept intact.
## HIP support
AMD's single-source GPGPU API named HIP is a shameless knockoff of CUDA, primarily to ease porting to it. With OpenCL failing to penetrate market share for AMD (although now it's being [reinvigorated](https://www.khronos.org/news/press/khronos-group-releases-opencl-3.0)), they decided to try breaking the CUDA haegemonia by creating an API that is as source-compatible with CUDA as reasonably possible.
- SHTns received the shameless "define macro" port from CUDA to HIP. Including one extra header does it all.
```
#pragma once
#if defined(HAVE_HIP)
#define cudaAddressModeClamp hipAddressModeClamp
#define cudaBindTexture hipBindTexture
```
- The benefit of this is that code can be written against one API (typically pure CUDA) and not need any rewrite.
- The downside is that because it isn't strictly conforming HIP code, it loses [HIP-CPU](https://github.com/ROCm-Developer-Tools/HIP-CPU) support, a pure host-side (ISO C++17 + OpenMP) implementation of HIP which can be compiled by any conforming host compiler (GCC, Clang, Apple-Clang and MSVC supported). Thus one can use ordinary CPU debuggers to debug kernel code as well as run them on the host without having to cook up a fallback CPU implementation. The GPU code acts as a "poor man's" multi-core host-side implementation.
- The implementation is quite nifty, it uses Parallel STL to launch CUDA blocks, and uses OpenMP simd loops to vectorize on block for SSE/AVX lanes.
- One can even use it to [throw structured exceptions from device code](https://github.com/ROCm-Developer-Tools/HIP-CPU/issues/1#issuecomment-744556441) to rack down NaNs (something which would've been mighty useful when debugging the port, but structured exceptions is an MSVC-only feature.)
- Note that this doesn't result in strictly conforming HIP code, as it still results in `<<<>>>` being the kernel launch operator as opposed to HIP's `hipLaunchKernelGGL` macro.
- While HIP code supports being compiled using `nvcc` to produce functioning CUDA code (in this case `hipLaunchKernelGGL` evaporates into `<<<>>>`), we did not use this capability of HIP and the CUDA port has nothing to do with any HIP code.
- XSHELLS received a different type of treatment, there is a macro-free thin API wrapper for both CUDA/HIP found in `xshells_gpu_runtime.hpp`. The code is then written against this thin wrapper. There is a separate header for the kernel launch operator only, because host-side `.cpp` code needs API function definitions but won't understand `<<<>>>`. `xshells_gpu_kernel_launch.hpp` is only included in `.cu`/`.hip` files.
- HIP support with these preparations is as simple as single-line source files, for eg. `xshells_gpu_spectral.hip` is a one-liner: `#include "xshells_gpu_spectral.cu"`. This file is only needed for CMake to easily pick up it's language based on the extension and for tooling to light up IntelliSense (code colorization and navigation).
- If these files are irritating, they can be removed and build scripts take care of forcing language dialect (compile `.cu` as HIP source code) via compiler flags.
- Changing between the macro the thin API wrapper is a "Find and replace all" type of change. I wanted to demonstrate both so one can be picked (or kept as is).
- _Important note:_ Production runs using HIP cannot be conducted for the time being, because SHTns does not support WARPSIZE == 64 which is true for most AMD compute GPUs. (Latest consumer HW using RDNA/RDNA2 architectures (RX 5000 series and RX 6000 series product-wise, gfx1030 and gfx1031 binary-wise) use WARPSIZE 32 just like NV GPUs.)
## Suggestions of the contributor
XSHELLS is hats down blazing fast and feature packed, both from a domain expert's POV as well as from an HPC standpoint. This initial CUDA port doesn't nearly have enough finesse as the hand-vectorized host-side version.
I would propose a list of comments which may be worth considering to improve the developer experience of contributors in the future:
- XSHELLS really is a library. Much of it's functionality could work outside the console application it is today. I would suggest trying to break off the console application from the core simulation logic.
- The include order dependent series of `.cpp` files including one another is very hard to understand as an external developer. While the original author may know off the top of their head which type/function is visible in which parts of the code, it is definitely unkown to an external developer. Being able to untangle the code was the primary reason why code reorg went past just getting kernel code to see the types they must.
- Having _a lot_ more functions would increase clarity. Currently there are functions which are multi-hundred lines long, turning into a soup of `#ifdef`s. I understand how difficult it is to avoid this with an application the size of XSHELLS, especially when it's MPI aware (which is an invasive feature and dissolves into the deepest bowels of the code), but most of the other parts could be cleaned up through functions.
- If code was hard to write, it is hard to understand. Code fragments such as
- The part where an array of structs is reinterpreted as an array of it's constituents is hard to understand, moreover undefined behavior.
```c++
struct OpCurlLapl {
double Wl,Wd,Wu,r_1; // lower, diagonal, upper, 1/r
double Ll,Ld,Lu,r_2;
};
...
// This is invoking undefined behavior (invalid pointer arithmetic)
//
// Note: one cannot obtain a pointer to any object, perform pointer arithmetic on it
// and expect to arrive at another object. (Going from CurlLapl_.Wl to any other
// named member of it.) One should treat all objects to reside in their own
// private memory namespace. An address of one object has nothing to do with the
// address of any other object.
_laplP(ir, &CURL_LAPL(ir).Ll, l2, Pol, Tlm, lms, lme); // UB
```
- The only reason (as anecdotally explained by LLVM devs at an LLVM summer school in Paris) this works is because this idiom although explicitly forbidden by the language is prevalent in the Linux kernel and compilers _have_ to play along, so long as they wish to have real world relevance. However, because compilers (especially optimizers) assume the absence of UB during compilation, I would argue that this should be fixed.
- The array `CurlLapl_` could be turned into a struct of two arrays, one containing the first 4 named members and the other the rest. That way one need not reinterpret the pointer of a member in order to avoid loading the entire object onto the cache-line. Moreover reads would have no strides.
- Porting this was further hindered by the fact that `_laplP` correctly slices the struct starting at the second row of members, but variable naming inside the function continues to use `Wl`:
```c++
inline
void _laplP(int ir, const double* Mw, const double* l2_, const xs_array2d<cplx> &Pol, cplx*__restrict T, int lms, int lme) {
...
s2d Wl = vdup(Mw[0]); s2d Wd = vdup(Mw[1]); s2d Wu = vdup(Mw[2]); s2d ri_2 = vdup(Mw[3]);
}
```
- This piece of code
```c++
void LinOpR::_alloc(int ir0, int ir1, int _nelem) {
...
data -= ir_bci*nelem; // shift pointer for faster access.
}
```
Have caused an immeasureable amount of pain, as it is highly counterintuitive to have a pointer named `data` which points nowhere and cannot be fed to `cudamemcpy()` et al. Moreover, this pointer cannot even be freed and is actually leaked and only freed upon program termination.
- Having most variables be global is nice when values diffuse into all contexts, but makes code hard to understand.
- It also conflicts with RAII of device storage, because their destructors run _after_ `main` has terminated and the CUDA DLL has been unloaded from memory. When `libcuda` is linked dynamically, all `cudafree()` functions throw errors. When linked statically, the errors vanish.EXT Máté Ferenc Nagy-EgriEXT Máté Ferenc Nagy-Egri