Interface for many-core architectures

Registered by Garth Wells on 2010-07-13

Investigate how UFC can be adapted to support multi- and many-core processors.

Blueprint information

Not started
Needs approval
Series goal:
Milestone target:

Related branches


GNW: Initial thoughts:

First step is to allow tabulate_tensor to compute an arbitrary number of element tensors. This would be good for OpenMP and CUDA/OpenCL.
  - Need to pass multiple 'cells' (cell coordinates to cells). Would be good to flatten member arrays in ufc::cell.

GRM: Further thoughts (based on some notes I'd made beforehand, and what we talked about on Tue 13th)

 - Perhaps making this change to ufc::cell to support multiple cells provides sufficient flexibility throughout most of the interface for supporting CUDA. The member functions where execution occurs on the GPU can contain a call to lauch a kernel implementing the operation.
 - Adding an extra function to ufc that allows the assembler to enquire about how many cells to pass at a time might be a good idea - this will allow it to pass relatively few cells at a time if the generated code is parallelised using OpenMP (e.g. 8 or 16), whereas the assembler should pass a much larger number of cells for a CUDA/OpenCL-on-GPU implementation (e.g. 16000).
 - Since execution on the GPU needs to be relatively "uniform", all the cells that are passed in one go will probably need to have the same type of element (polynomial order of base, cell_shape) etc. (Is this assumption already true, or could a function space have different order polynomials in it?)
 - The row-major ordering is not conducive to fast execution on CUDA, as it prevents coalescing. Arrays may need to be passed in column-major order. Maybe another function could be added to enquire whether the arrays are row- or column-major should be added? (It might also be possible to make CUDA code that works well with the row-major ordering, but I think that the transformation might be difficult to do in some cases).

Florian (2010-07-21): Further thoughts about a CUDA implementation

1) General considerations
  - transfers host -> GPU are expensive, as little communication and as much data kept on the GPU as possible should be the goal
  - an operation on a cell in UFC would need to be launced for all cells at once on the GPU
  - all input data needed for this operation must reside in GPU memory before the operation is launched (or computed on the fly, which might be problematic, see below)
  - data that should reside on the GPU for assembly should be
    * tabulated element tensors, needing as input
      * coordinates
      * evaluated coefficients
    * dofmap
  - the GPU has serious restrictions when it comes to memory access, they should be coalesced (i.e. adjacent thread read adjacent memory words) and indirection avoided at all cost
  - this implies that for tensor tabulation and later assembly data must be laid out for all vertices of all cells of the mesh consecutively
  - to obtain coalesced access it should be SoA, not AoS

2) Some background on CUDA to get a better feel for the restrictions and needs

CUDA distinguishes 3 kinds of functions:
__host__ functions called from the host and executed on the host
__global__ function called from the host and executed on the GPU (kernel)
__device__ function called from the GPU and executed on the GPU (will be inlined)

CUDA imposes a number of restrictions that would need to be respected by UFC and the code generator:
- kernels don't support function calls. It is possible to call __device__ functions (these calls will be inlined), but they must be defined in the same compilation unit
- kernels do not support reference arguments, which means data can only be passed by pointer
- pointer to pointer is evil in a CUDA kernel since it would mean 2 loads from global memory and will most certainly kill coalescing -> all arrays should be flattened
- further restrictions can be found in section B 1.4 of the CUDA programming guide:
  * __device__ and __global__ functions do not support recursion.
  * __device__ and __global__ functions cannot declare static variables inside their body.
  * __device__ and __global__ functions cannot have a variable number of arguments.
  * __device__ functions cannot have their address taken; function pointers to __global__ functions, on the other hand, are supported.
  * The __global__ and __host__ qualifiers cannot be used together.
  * __global__ functions must have void return type.
  * Any call to a __global__ function must specify its execution configuration (i.e. how many threads are supposed to be dispatched)
  * A call to a __global__ function is asynchronous, meaning it returns before the device has completed its execution (kernels are serialized from the GPU perspective though and hence this should not be a restriction).

3) Some implications and suggestions
  - all pointer to pointer member data structures should be flattened
  - all functions supposed to act on all cells of the mesh and taking a cell as input now should launch a kernel performing the action for all cells of the mesh at once
  - ufc::cell should contain pointers to GPU memory for the data needed
  - ufc::cell should have a member for the number of cells it represents
  - to keep the flexibility of UFC for different platforms, the UFC functions should be gateway functions that dispatch kernels on the device
  - FFC should be augmented to generate device-specific code (kernels and their dispatching calls) + wrapper code (as it does already) based on the -l switch
  - the UFC functions taking a ** parameter dispatch a kernel and dynamically generate the right number of parameters for the kernel call (i.e. in the case of tabulate_tensor where the coefficients are given as **, but as mentioned above that is not a suitable format to pass to the kernel)


Work Items

This blueprint contains Public information 
Everyone can see this information.