Finished CUDA port (!1) · Merge requests · Nathanael Schaeffer / xshells

EXT Máté Ferenc Nagy-Egri requested to merge hipify-cherry into cuda-gpu Oct 04, 2021

This MR enhances the CUDA support of XSHELLS to a point that production runs can benefit from end-to-end device calculation with minimal host intervention.

High-level overview

GPU types and code paths have been put into namespaces, under xs::gpu::. Host side functions where applicable received overloads that take device-side storage as input, for eg.
```
class LinOp3l : public _LinOp3l<LinOp3l> {
...
public:
  void apply(xs_array2d<cplx> x, xs_array2d<cplx> y, int lmstart, int lmend);
  #ifdef XS_GPU
  void apply(xs::gpu::proxy2d<xs::gpu::cplx> x, xs::gpu::proxy2d<xs::gpu::cplx> y, int lmstart, int lmend);
  #endif
};
```
- The implementation of such functions to remain understandable to an ordinary host compiler immediately invoke the implementation, which is CUDA/HIP code.
```
#ifdef XS_GPU
void LinOp3ld::apply(xs::gpu::proxy2d<xs::gpu::cplx> x, xs::gpu::proxy2d<xs::gpu::cplx> y, int lms, int lme)
{
  xs::gpu::const_proxy2d<double> M{(int)nelem, dev_Mr(0) + 2};
  xs::gpu::LinOp3ld::apply(dev_li, M, x, y, irs, ire, lms, lme, xs::gpu::streamDefault);
}
#endif
```
- Some of the rationale for using namspaces are:
  - Being able to shed off library prefix off of types and functions which otherwise need them for disambiguation. xs_array vs. xs::array but in selected scopes can be referred to as array via using namespace xs; or the calling function being in namespace xs:: to begin with.
  - One prevalent occasion are the types cplx and xs::gpu::cplx which refer to std::complex<double> and double2 respectively. In host side code, one most often wants to refer to the STL type and in device code the CUDA vector type. One need no introduce yet another name for complex, but host side code automatically picks up the STL type and device side code (functions residing in xs::gpu::) when saying cplx unqualified, they will pick up the name from the nearest enclosing namespace, which is xs::gpu::cplx and we didn't have to introduce a gpu_cplx type to disambiguate the two.

New RAII types have been introduced to store device-side data.

The new type xs::gpu::array2d models xs_array2d_mem.
The new type xs::gpu::proxy2d models xs_array2d.
Contrary to the original classes following is-a relationship (xs_array2d_mem is a xs_array2d via inheritance), the GPU storage types are distinct and xs::gpu::array2d can cast itself to xs::gpu::proxy2d. Detailed rationale can be found at the beginning of the header. Quting here for simplicity.

/// A simple, lightweight, generic 2D array class that allows efficient access with [i][j], with a possible shift for i.
/// during construction. Requires C++11 compiler. There is no bound check. Mimics the behaviour of T**.
///
/// NOTE: The xs::gpu::array2d class mimics xs_array2d but follows slightly different semantics:
///           - xs::gpu::array2d knows its size (contrary to strides only as in xs_array2d) so it can
///             copy and manage itself without providing external context.
///           - Feeding complex (read: non-built-in) structures containing device pointers to kernels requires
///             passing-by-value to kernels. However xs::gpu::array2d wishes to follow idiomatic RAII-based resource
///             management which conflicts with value semantics and device execution. (Invoking for eg.
///             xs::gpu::kernels::laplP<<<>>>(li, ...) will cause a copy to be made on host-side too, causing the
///             destructor to run.) To fix this a helper class is introduced, xs::gpu::proxy2d which exposes the
///             kernel-side interface but doesn't manage the underlying memory.
///           - Because of this duality, and already having xs_array2d resembling host-side storage, xs::gpu::array2d
///             omits operator[] altogether. Host-side access to device contents in the current design require involving
///             xs_array2d.
///           - xs_array2d supports referencing non-contiguous storage. In contrast xs::gpu::array2d only supports
///             contiguous storage, but does provide a CTOR to consume non-contiguously stored xs_array2d instances.

An alternative approach may have been taken which keeps inheritance as the basis; it involves object slicing, but that is a dicy part of the language and I wasn't sure that nvcc would properly slice objects at the kernel call-site.

Device-side storage as much as possible is anchored in the same place where host-side storage is. For eg. PolTor has:

class PolTor final : public Spectral
{
public:
  xs_array2d<cplx> Pol;
  xs_array2d<cplx> Tor;
#ifdef XS_GPU
  xs::gpu::array2d<xs::gpu::cplx> dev_Pol;
  xs::gpu::array2d<xs::gpu::cplx> dev_Tor;
#endif
...
};

One notable exception to that is the base class Spectral, which is the only class which owns it's data, however the device-side storage was not written in a manner to store multiple components. This feature proved being non-essential, and aside from the sole usage NL.Su and NL.Sb which are then never used elsewhere, but rather function as temporaries, this capability was not implemented. Quote from Spectral source code:

class Spectral
{
...
xs_array2d<cplx> data;   ///< the actual data is stored in an xs_array2d object.
                         //   Mate TODO: consider promoting the scope-local `mem` from Spectral::alloc()
                         //   to a member to own the memory. Currently data is likely leaked and freed
                         //   only upon app termination.

// NOTE: dev_data does not exist, as the xs::gpu::array2d class was not designed to hold multiple components.
//       Device data on it's own doesn't participate in the utilities provided by this class. Owning the
//       device allocated data is handed down to child classes, allocation will happen at init.

The bulk of the GPU code invocation stems from explicit_terms_gpu.
Because GPU device code needed host-side type definitions, some parts of the series of .cpp inclusions (more on this later) needed to be broken off into headers. For eg. grid.hpp's OpCurlLapl, IN_RANGE_INCLUSIVE utility macro, etc. Ideally all function prototypes and type definitions should be in headers with include guards, but rehsaping the entire library was outside the scope of the project.
- I did much more than what was minimally needed, for the sake of easier code navigation. Files like xshells_spectral and xshells_linop have been throroughly separated into headers and sources.
- The inclusion scheme was kept intact, but now the source structure resembles more, as if it were a library with no dependence on include order.
Host-side code is kept as .cpp files and only the thinnest possible layer has been offloaded into .cu files, which do nothing more than kernel launches.
- For every XSHELLS module, for example spectral ops, the following files were created from xshells_spectral.cpp:
  - xshells_spectral.hpp containing all related type definitions and function prototypes
  - xshells_spectral.cpp containing all function definitions
  - xshells_gpu_spectral.hpp containing function prototpes of kernel launches. This file is included by .cu files and in .cpp too when XS_GPU is defined.
  - xshells_gpu_spectral.kernel.cu containing the actual kernels.

GPU kernels

All inputs-outputs are const-correct.
GPU kernels logically stay true to their host-side counterparts. Some things which consistently differ:
- All operations are implemented to work in batches. Instead of PolTor::RadSph taking radial a single radial coordinate as input, it takes a range of radial coordinates which it processes.
  - This was done to reduce the number of kernel launches and in preparation for multi-GPU implementations.
- GPU kernels are typically indexed as <<<dim3(ire-irs+1,1), dim3(1,256),...>>> where every block processes one radial coordinate (blockIdx.x) and threadIdx.y identifies a thread in the block which processes NLM-1 indicies.
- GPU kernels use lambda functions instead of macros utilities to declare loops over various directions. (IN_RANGE_INCLUSIVE may be an exception)
Boundary conditions are handled in separate kernels to help with code deduplication. (not of the boundaries, but the inner parts.)
Some auxiliary arrays were eliminated (r_1 and l2) from device code to save memory reads and instead are calculated on the fly, but others may be eliminated too. (li and l)
All kernel launch functions take a xs::gpu::stream instance as their last argument, which defaults to xs::gpu::streamDefault. Because integration faced far too many challanges, we aired on the safe side and omitted multi-stream, async kernel handling.
- Asynchrony can be reintroduced simply by returning an event associated with the last operation within. Currently all kernel launch functions return void and do not use their return type.
- Some facilities to aid this were left in the code, such as std::array<std::array<xs::gpu::event, 4>, MAXFIELD> generate_field_events(). (See later in notes on having more functions.) Should kernel ops initialize the events before kernel launch, this function would not be needed and instead be turned into a clean-up function to release the events, as kernel launches could just assign their return value to a similar array.

Build system support

CMake

CMake build definitions are provided for both SHTns and XSHELLS. This was done early on during development to accelerate the porting work, as IDEs make extensive use of information provided by CMake to light up semantically correct code colorization and code navigation. (For more information refer to CMake File API)

They both use the same set of configuration-time options to enable/disable back-ends and targets for build (last value is the default)
- option(BUILD_BENCHMARK "Build benchmarks" OFF)
- option(BUILD_TESTING "Build unit tests" ON)
- option(BUILD_DOCUMENTATION "Build docs" OFF)
- option(USE_CUDA "Enable CUDA support" OFF)
- option(USE_HIP "Enable HIP support" OFF)
- option(USE_OPENMP "Enable OpenMP support" OFF)
SHTns provides the following configuration-time options unique to it:
- option(BUILD_FORTRAN "Build the Fortran77 API" OFF)
- option(USE_MAGIC "I need the transforms compatible with the MagIC code, to speed it up!" OFF)
- option(USE_ISHIOKA "Enable the new recurrence proposed by Ishioka (2018) see https://doi.org/10.2151/jmsj.2018-019 (faster)" ON)
- set(SHT_VERBOSE 0 CACHE STRING "0:no output, 1:output info to stdout, 2:more output (debug info), 3:also print fftw plans.]")
XSHELLS provides the following configuration-time options unique to it:
- option(BUILD_EXAMPLE "Build examples" ON)
- option(USE_SIMD "Enable explicit SIMD support" OFF)
- option(USE_MPI "Enable MPI support" OFF)
For the moment the minimum CMake version required to build is 3.17 for the CUDA back-end, but this can be retrofitted down to CMake 3.8 if it causes troubles.
- The main reason was using find_package(CUDAToolkit) but that isn't really necessary.
The CMake minimum version required for compiling the HIP port is 3.21 (brand new) as proper CMake HIP language support was just recently merged.
The CMake build scripts recreate the same configuration logic from the autoconf-based build and also defines build steps for all generated source code with proper interdependence.
All example problems show as CMake build targets without needing to copy any files (as opposed to the original build).

Autoconf

The original autoconf/GNU make based build definition was kept intact.

HIP support

AMD's single-source GPGPU API named HIP is a shameless knockoff of CUDA, primarily to ease porting to it. With OpenCL failing to penetrate market share for AMD (although now it's being reinvigorated), they decided to try breaking the CUDA haegemonia by creating an API that is as source-compatible with CUDA as reasonably possible.

SHTns received the shameless "define macro" port from CUDA to HIP. Including one extra header does it all.
```
#pragma once

#if defined(HAVE_HIP)
#define cudaAddressModeClamp hipAddressModeClamp
#define cudaBindTexture hipBindTexture
```
- The benefit of this is that code can be written against one API (typically pure CUDA) and not need any rewrite.
- The downside is that because it isn't strictly conforming HIP code, it loses HIP-CPU support, a pure host-side (ISO C++17 + OpenMP) implementation of HIP which can be compiled by any conforming host compiler (GCC, Clang, Apple-Clang and MSVC supported). Thus one can use ordinary CPU debuggers to debug kernel code as well as run them on the host without having to cook up a fallback CPU implementation. The GPU code acts as a "poor man's" multi-core host-side implementation.
  - The implementation is quite nifty, it uses Parallel STL to launch CUDA blocks, and uses OpenMP simd loops to vectorize on block for SSE/AVX lanes.
  - One can even use it to throw structured exceptions from device code to rack down NaNs (something which would've been mighty useful when debugging the port, but structured exceptions is an MSVC-only feature.)
- Note that this doesn't result in strictly conforming HIP code, as it still results in <<<>>> being the kernel launch operator as opposed to HIP's hipLaunchKernelGGL macro.
  - While HIP code supports being compiled using nvcc to produce functioning CUDA code (in this case hipLaunchKernelGGL evaporates into <<<>>>), we did not use this capability of HIP and the CUDA port has nothing to do with any HIP code.
XSHELLS received a different type of treatment, there is a macro-free thin API wrapper for both CUDA/HIP found in xshells_gpu_runtime.hpp. The code is then written against this thin wrapper. There is a separate header for the kernel launch operator only, because host-side .cpp code needs API function definitions but won't understand <<<>>>. xshells_gpu_kernel_launch.hpp is only included in .cu/.hip files.
HIP support with these preparations is as simple as single-line source files, for eg. xshells_gpu_spectral.hip is a one-liner: #include "xshells_gpu_spectral.cu". This file is only needed for CMake to easily pick up it's language based on the extension and for tooling to light up IntelliSense (code colorization and navigation).
- If these files are irritating, they can be removed and build scripts take care of forcing language dialect (compile .cu as HIP source code) via compiler flags.
- Changing between the macro the thin API wrapper is a "Find and replace all" type of change. I wanted to demonstrate both so one can be picked (or kept as is).
Important note: Production runs using HIP cannot be conducted for the time being, because SHTns does not support WARPSIZE == 64 which is true for most AMD compute GPUs. (Latest consumer HW using RDNA/RDNA2 architectures (RX 5000 series and RX 6000 series product-wise, gfx1030 and gfx1031 binary-wise) use WARPSIZE 32 just like NV GPUs.)

Suggestions of the contributor

XSHELLS is hats down blazing fast and feature packed, both from a domain expert's POV as well as from an HPC standpoint. This initial CUDA port doesn't nearly have enough finesse as the hand-vectorized host-side version.

I would propose a list of comments which may be worth considering to improve the developer experience of contributors in the future:

XSHELLS really is a library. Much of it's functionality could work outside the console application it is today. I would suggest trying to break off the console application from the core simulation logic.
The include order dependent series of .cpp files including one another is very hard to understand as an external developer. While the original author may know off the top of their head which type/function is visible in which parts of the code, it is definitely unkown to an external developer. Being able to untangle the code was the primary reason why code reorg went past just getting kernel code to see the types they must.
Having a lot more functions would increase clarity. Currently there are functions which are multi-hundred lines long, turning into a soup of #ifdefs. I understand how difficult it is to avoid this with an application the size of XSHELLS, especially when it's MPI aware (which is an invasive feature and dissolves into the deepest bowels of the code), but most of the other parts could be cleaned up through functions.
If code was hard to write, it is hard to understand. Code fragments such as
- The part where an array of structs is reinterpreted as an array of it's constituents is hard to understand, moreover undefined behavior.
```
struct OpCurlLapl {
  double Wl,Wd,Wu,r_1;           // lower, diagonal, upper, 1/r
  double Ll,Ld,Lu,r_2;
};

...

// This is invoking undefined behavior (invalid pointer arithmetic)
//
// Note: one cannot obtain a pointer to any object, perform pointer arithmetic on it
//       and expect to arrive at another object. (Going from CurlLapl_.Wl to any other
//       named member of it.) One should treat all objects to reside in their own
//       private memory namespace. An address of one object has nothing to do with the
//       address of any other object.
_laplP(ir, &CURL_LAPL(ir).Ll, l2, Pol, Tlm, lms, lme); // UB
```
- The only reason (as anecdotally explained by LLVM devs at an LLVM summer school in Paris) this works is because this idiom although explicitly forbidden by the language is prevalent in the Linux kernel and compilers have to play along, so long as they wish to have real world relevance. However, because compilers (especially optimizers) assume the absence of UB during compilation, I would argue that this should be fixed.
  - The array CurlLapl_ could be turned into a struct of two arrays, one containing the first 4 named members and the other the rest. That way one need not reinterpret the pointer of a member in order to avoid loading the entire object onto the cache-line. Moreover reads would have no strides.
- Porting this was further hindered by the fact that _laplP correctly slices the struct starting at the second row of members, but variable naming inside the function continues to use Wl:
```
inline
void _laplP(int ir, const double* Mw, const double* l2_, const xs_array2d<cplx> &Pol, cplx*__restrict T, int lms, int lme) {
  ...
  s2d Wl = vdup(Mw[0]);		s2d Wd = vdup(Mw[1]);		s2d Wu = vdup(Mw[2]);		s2d ri_2 = vdup(Mw[3]);
}
```
This piece of code
```
void LinOpR::_alloc(int ir0, int ir1, int _nelem) {
  ...
  data -= ir_bci*nelem;		// shift pointer for faster access.
}
```
Have caused an immeasureable amount of pain, as it is highly counterintuitive to have a pointer named data which points nowhere and cannot be fed to cudamemcpy() et al. Moreover, this pointer cannot even be freed and is actually leaked and only freed upon program termination.
Having most variables be global is nice when values diffuse into all contexts, but makes code hard to understand.
- It also conflicts with RAII of device storage, because their destructors run after main has terminated and the CUDA DLL has been unloaded from memory. When libcuda is linked dynamically, all cudafree() functions throw errors. When linked statically, the errors vanish.

Edited Oct 05, 2021 by EXT Máté Ferenc Nagy-Egri

Finished CUDA port