FFC

Generate PyOP2 code

Registered by Graham Markall on 2012-09-17

Generate code for tabulate_tensor that conforms to the PyOP2 kernel specification, for doing assembly on GPUs and other multi/manycore processors.

Blueprint information

Status:: Not started

Approver:: None

Priority:: Undefined

Drafter:: None

Direction:: Needs approval

Assignee:: None

Definition:: New

Series goal:: None

Implementation:: Unknown

Milestone target:: None

Related branches

Related bugs

Sprints

Whiteboard

As presented at FEniCS '12, the plan for our work in the MAPDES project at
Imperial College London on GPU assembly revolves around generating code that
uses the OP2 domain-specific language for parallel computations on unstructured
meshes. We see OP2 as the right abstraction below UFL to decouple the
Finite-element implementation from the efficient parallel execution of the
resulting computations on unstructured meshes. OP2 provides source-to-source
translation and automatic code synthesis for performance-portable parallel loop
computations on various hardware platforms. Currently multi-core CPUs and GPUs
are supported via OpenMP, CUDA or OpenCL optionally combined with MPI for
inter-node parallelisation.

For generating OP2-compatible intermediate code from UFL sources we had
developed our own many-core form compiler (MCFC). MCFC gave us a good platform
for experimenting with different code generation options and discovering some
of the design issues in building a form compiler, but it doesn't support many
forms. Now that we have a much clearer idea of the abstractions and interfaces
between the form compiler and the intermediate layers, we hope to base all our
current and future work on FFC with the goal of making it accessible to many
users and contributing improvements back to the FEniCS project.

Our present plan, for the moment at least, is to maintain a branch of FFC which
supports OP2 and attempts to replicate the functionality of the FEniCS
toolchain. This will fit into a software stack consisting of UFL,FFC(branch),
OP2, and Fluidity. We hope to ensure that this toolchain can do the same things
that the UFL, FFC(trunk), UFC and Dolfin toolchain can be used for, with
performance portability across the targets that OP2 supports.

Throughout the development of our FFC branch, we hope to ensure that it stays in
sync with the FFC trunk - in particular, to ensure that it can pass all the unit
and regression tests. Our motivation for this is to ensure that merging this
branch with the trunk should not require extensive refactoring, should it ever
be considered appropriate to merge.

OP2 Options
-----------

The OP2 ecosystem presents the user with a couple of different choices for
writing and generating code:

1. A C++/Fortran embedded DSL where code is generated at compile-time using
static analysis.
2. A Python-embedded DSL where code is generated at run-time and JIT-compiled -
referred to as PyOP2.

PyOP2 is a much more attractive option than the static OP2 library as it is much
easier to integrate into FFC and supports operations on matrices, which the
static OP2 does not.

A Brief overview of PyOP2
-------------------------

In order to write an application using PyOP2, one must declare mesh data
structures, provide kernels which will operate on these data structures, and
implement calls to parallel loops that accept a kernel and some mesh data
structures. The (currently in-progress) documentation for PyOP2 is located at:
http://op2.github.com/PyOP2/ - there is some documentation on the OP2 project
page as well: http://www.oerc.ox.ac.uk/research/op2

The PyOP2 repository is at https://github.com/OP2/PyOP2

Mesh data structures
--------------------

The basic data types include Sets, Maps, and Dats. Mesh entities such as edges,
vertices, faces, etc, are declared as sets. For example:

# op2.Set(num_entites, name)
edges = op2.Set(nedges, "edges")

Mappings between different sets (e.g. from elements to vertices where the
elements are triangular) are declared as (for example):

# op2.Set(from_set, to_set, map_arity, map_values, name)
ele_vertex_map = op2.Map(elements, vertices, 3, e_v_map_values, "ele_vertex")

where map_values is an iterable containing the numeric values for the map. Dats
store a value for every element of a set, for example the coefficients of a
field. Dats are declared as (for example):

# op2.Dat(set, dim, values, valuetype, name)
temperature = op2.Set(nodes, 1, temp_vals, numpy.float64, "temperature")
velocity = op2.Set(nodes, 2, velo_vals, numpy.float64, "velocity")

Matrices can also be declared in PyOP2. First, one must declare the sparsity for
the matrix using a pair of maps:

# op2.Sparsity((test_map, trial_map), dim, name)
sparsity = op2.sparsity((elem_node, elem_node), 1, "sparsity")

where dim is the dimension of an element in the matrix - e.g. when assembling a
scalar field matrix, this will be one, when one is solving for a vector field in
2D, this would be 2, etc.

Matrices can then be declared on this sparsity:

# op2.Mat(sparsity, valuetype, name)
mass = op2.Mat(sparsity, numpy.float64, "mass")

Kernels
-------

In both OP2 and PyOP2, a kernel has to be supplied as a string of C code that
implements an operation to be performed on a single mesh entity. It is the job
of the runtime system to invoke this kernel for every element of the set that is
being iterated over. An example kernel that adds two scalar fields together
would be declared as:

add_fields = """\
void add_fields(double *result, double *field1, double *field2)
{
*result = *field1 + *field2;
}"""
# op2.Kernel(kernel_code, kernel_name)
add_fields_kernel = op2.Kernel(add_fields, "add_fields")

Parallel loops
--------------

PyOP2 provides the user with the ability to execute a kernel in parallel over
every element of a set. Dats that are defined on the set can be arguments to the
kernel, as can Dats on Sets that are access through a map from the iteration
set. The form of a parallel loop is:

op2.par_loop(kernel, iteration_space, *args)

The kernel must be an instance of an op2.Kernel. The iteration_space is either
an op2.Set, or an iteration space, which is produced by calling an op2.Set (more
about this later). Each of the args corresponds to one of the arguments that is
passed to the code in the kernel. An arg is constructed by calling the Dat or
Mat which is to be passed as the argument. The __call__ method of a Dat is
defined as:

def __call__(self, path, access):

where path is a Map, and access is an access descriptor. If the map is indexed,
then the index value is used to index into the map to only pass a single element
through the Map (e.g. only the first vertex of each element is passed). If the
Map is not indexed, then all of the values from the target of the map are
gathered and passed to the kernel (e.g. all the vertices of an element). The
access descriptor is one of op2.READ, op2.WRITE, op2.RW, or op2.INC to inform
PyOP2 of the intended access mode of the data.

Iteration spaces
----------------

An issue in assembling local matrices on a GPU is the allocation of resources.
When the polynomial order of elements is low, there is only a small number of
matrix elements, and it can make sense to assign one thread per element to
construct the local matrices. However, when the polynomial order is higher (even
for P=3 or P=4), there will be enough entries in the local matrix that the
resource requirements for computing the entire local matrix are increased such
that the working set is too large for a single thread, and register spilling
will impact performance significantly. Therefore, it is necessary to divide the
work for computing a single local matrix between threads.

Since the optimum partitioning of the computation will vary between devices and
kernels, PyOP2 takes control of this partitioning in order to isolate
higher-level abstractions (e.g. a form compiler, or the user) from this
lower-level complexity. This is achieved by handing control of the iteration
space of the local matrix to PyOP2. Instead of kernels containing a loop over
the i and j indices of the matrix, i and j are instead passed in as parameters.
OP2 can then manage the iteration space as necessary (and this may involve
performing some source-to-source transformations on the kernel to remove the
overhead of passing in i and j multiple times).

In summary, in a software stack that uses PyOP2, the form compiler is no longer
responsible for generating the code for assembling an entire local assembly
matrix. The form compiler is instead responsible for generating the code that
computes a single element of a local assembly matrix. PyOP2 is then responsible
for managing the execution of this code, in order to assemble the entire local
matrix.

In order to construct an iteration space, a Set is called with the dimensions of
the iteration space when creating the par_loop. For an iteration space for
assembling 3x3 local matrices, the par_loop call may look like:

op2.par_loop(mass, elements(3,3),
mat((elem_node[op2.i[0]], elem_node[op2.i[1]]), op2.INC),
coords(elem_node, op2.READ))

Because the mass kernel is now required to assemble only a single element of the
local matrix, the mat argument is indexed into using maps (the elem_node map)
that is itself indexed. This index op2.i[x] instructs OP2 to index the map using
the current value of the x'th component of the iteration space.

Kernel interface
----------------

The only function generated by FFC that is needed when generating PyOP2 code is
the tabulate_tensor function, since all the other functions are managed by
PyOP2. The function signature for this code is fairly similar to that of the
tabulate_tensor function, and is as follows when assembling a matrix:

name(double localTensor[dim][dim],
     double *coords[geo_dim],
     double **coeff0, double **coeff1, ...
     int j, int k);

where name = (prefix)_(measure)_integral_(form)_(integral)

where :
* prefix is an arbitrary name
* measure is one of cell, exterior_facet, interior_facet
* form is the id of the form (0 <= id)
* integral is the id of the integral (0 <= id)

A single element of the local tensor is passed to the kernel - the dim is always
instantiated as an integer at the time the kernel code is generated, and is 1 in
the scalar case, and n for an n-dimensional vector field. Since PyOP2 passes in
each dat as an individual argument, the coefficients cannot be passed in as a
single array of arrays. Instead, each coefficient is passed as a separate
parameter.

For the RHS assembly, the interface is similar, but the iteration space and
local tensor are rank-1 rather than rank-2:

name(double localTensor[dim],
     double *coords[geo_dim],
     double **coeff0, double **coeff1, ...
     int j);

Implementation in FFC
---------------------

The branch https://code.launchpad.net/~grm08/ffc/pyop2 contains a modified
version of FFC which has some implementation of the PyOP2 interface completed.
It is intended that this branch will be kept in sync with the FFC trunk, and
should pass all the same tests for generating UFC code, as well as being able to
test the PyOP2 code generation (we have a buildbot testing this branch with the
unit and regression tests at
http://buildbot-ocean.ese.ic.ac.uk:8080/builders/ffc ).

This work began as an experimental branch during a sprint on the development of
PyOP2 - before retargeting MCFC to generate PyOP2 (as opposed to the static OP2
code we had been generating before), this branch was started in order to see how
difficult it would be to modify FFC to generate PyOP2 code. Since this turned
out to be quite straightforward, we continued work on this branch.

So far only the un-optimised quadrature transformer has been altered, since this
is the most simple code to work on - once we have a stable set of changes, these
could probably be somehow incorporated into the optimised quadrature
transformer. The tensor transformer will require some consideration as to how to
fit it in with PyOP2, since the optimised tensor code doesn't have the same
obvious ways to split the iteration space like the quadrature representation
does.

The main changes that have been made so far include:

* Code snippets. Some alternative snippets have been added where the code
  differs in PyOP2 and UFC. The changes to this interface might be tidied up in
  future - in some cases, in order to minimise the amount of change made in
  other parts of the code, certain strings have been replaced with dicts
  containing both the UFC and PyOP2 code, that is selected by the code which
  uses it. (These changes are the ones in cpp.py and codesnippets.py)

* Addition of the "pyop2" format, in addition to the "ufc" format in the
parameters.

* Changes to the signature of the tabulate_tensor function in
quadrature/quadraturegenerator.py - the _arglist function assists in
generating the correct signature.

* Changes to the way the local tensor is dereferenced - this is due to the
declaration of the local tensor as A[dim][dim] in the matrix case, and **A in
the vector case.

* In the case of assembling a matrix, the loops over j and k are no longer
generated.

* When assembling vector fields, extra loops over r (and s, for a matrix) are
generated, the upper bound of r and s being the dimension of the vectors being
assembled. (mainly in _create_loop_entry in quadraturetransformerbase.py).

* When the format is "pyop2", the pyop2_utils module (from the PyOP2 repository)
is used for the templates for code generation.

Support/examples in PyOP2
-------------------------

In the demo/ folder, there are several functioning examples which test the use
of FFC with PyOP2. These exercise the following in PyOP2/FFC (in rough terms):

* Matrix and RHS vector assembly for vector fields and scalar fields
* Iteration spaces
* Surface integrals, by using them for implementing weak boundary conditions

The files that implement these demos are: mass2d_ffc.py, laplace_ffc.py,
weak_bcs_ffc.py, and advection_diffusion.py.

Further steps
-------------

Support for using the ffc test framework:

In order to assess what portion of FFC's functionality is covered by the PyOP2
modifications, the regression tests need to be modified to test the generated
PyOP2 code. This includes the abstraction of some parts of the regression test
into specific formats, since the requirements for running UFC and PyOP2 code is
very different. Since only the tabulate_tensor kernel is generated in the PyOP2
code, there is not enough information for the test harness alone to test the
code - in order to get around this, the analysis from FFC could be saved to disk
for the test harness to read. Some work is in progress in this direction, and is
in the lp:~grm08/+junk/ffc-pyop2-testing branch at the moment.

Testing/bug fixes:

Once we have implemented the test harness, the results can be compared against
the expected UFC results for the tests. This should provide pointers about what
needs fixing up/is broken.

Mixed elements:

The current interface for PyOP2 code doesn't accommodate mixed elements. There
are some brief notes in this direction at:
https://github.com/OP2/PyOP2/wiki/Vector-field-assembly

(?)

Work Items

This blueprint contains Public information

Everyone can see this information.

Subscribers

No subscribers.