# Generate PyOP2 code

Generate code for tabulate_tensor that conforms to the PyOP2 kernel specification, for doing assembly on GPUs and other multi/manycore processors.

## Blueprint information

- Status:
- Not started

- Approver:
- None

- Priority:
- Undefined

- Drafter:
- None

- Direction:
- Needs approval

- Assignee:
- None

- Definition:
- New

- Series goal:
- None

- Implementation:
- Unknown

- Milestone target:
- None

- Started by

- Completed by

### Whiteboard

As presented at FEniCS '12, the plan for our work in the MAPDES project at

Imperial College London on GPU assembly revolves around generating code that

uses the OP2 domain-specific language for parallel computations on unstructured

meshes. We see OP2 as the right abstraction below UFL to decouple the

Finite-element implementation from the efficient parallel execution of the

resulting computations on unstructured meshes. OP2 provides source-to-source

translation and automatic code synthesis for performance-

computations on various hardware platforms. Currently multi-core CPUs and GPUs

are supported via OpenMP, CUDA or OpenCL optionally combined with MPI for

inter-node parallelisation.

For generating OP2-compatible intermediate code from UFL sources we had

developed our own many-core form compiler (MCFC). MCFC gave us a good platform

for experimenting with different code generation options and discovering some

of the design issues in building a form compiler, but it doesn't support many

forms. Now that we have a much clearer idea of the abstractions and interfaces

between the form compiler and the intermediate layers, we hope to base all our

current and future work on FFC with the goal of making it accessible to many

users and contributing improvements back to the FEniCS project.

Our present plan, for the moment at least, is to maintain a branch of FFC which

supports OP2 and attempts to replicate the functionality of the FEniCS

toolchain. This will fit into a software stack consisting of UFL,FFC(branch),

OP2, and Fluidity. We hope to ensure that this toolchain can do the same things

that the UFL, FFC(trunk), UFC and Dolfin toolchain can be used for, with

performance portability across the targets that OP2 supports.

Throughout the development of our FFC branch, we hope to ensure that it stays in

sync with the FFC trunk - in particular, to ensure that it can pass all the unit

and regression tests. Our motivation for this is to ensure that merging this

branch with the trunk should not require extensive refactoring, should it ever

be considered appropriate to merge.

OP2 Options

-----------

The OP2 ecosystem presents the user with a couple of different choices for

writing and generating code:

1. A C++/Fortran embedded DSL where code is generated at compile-time using

static analysis.

2. A Python-embedded DSL where code is generated at run-time and JIT-compiled -

referred to as PyOP2.

PyOP2 is a much more attractive option than the static OP2 library as it is much

easier to integrate into FFC and supports operations on matrices, which the

static OP2 does not.

A Brief overview of PyOP2

-------

In order to write an application using PyOP2, one must declare mesh data

structures, provide kernels which will operate on these data structures, and

implement calls to parallel loops that accept a kernel and some mesh data

structures. The (currently in-progress) documentation for PyOP2 is located at:

http://

page as well: http://

The PyOP2 repository is at https:/

Mesh data structures

-------

The basic data types include Sets, Maps, and Dats. Mesh entities such as edges,

vertices, faces, etc, are declared as sets. For example:

# op2.Set(

edges = op2.Set(nedges, "edges")

Mappings between different sets (e.g. from elements to vertices where the

elements are triangular) are declared as (for example):

# op2.Set(from_set, to_set, map_arity, map_values, name)

ele_vertex_map = op2.Map(elements, vertices, 3, e_v_map_values, "ele_vertex")

where map_values is an iterable containing the numeric values for the map. Dats

store a value for every element of a set, for example the coefficients of a

field. Dats are declared as (for example):

# op2.Dat(set, dim, values, valuetype, name)

temperature = op2.Set(nodes, 1, temp_vals, numpy.float64, "temperature")

velocity = op2.Set(nodes, 2, velo_vals, numpy.float64, "velocity")

Matrices can also be declared in PyOP2. First, one must declare the sparsity for

the matrix using a pair of maps:

# op2.Sparsity(

sparsity = op2.sparsity(

where dim is the dimension of an element in the matrix - e.g. when assembling a

scalar field matrix, this will be one, when one is solving for a vector field in

2D, this would be 2, etc.

Matrices can then be declared on this sparsity:

# op2.Mat(sparsity, valuetype, name)

mass = op2.Mat(sparsity, numpy.float64, "mass")

Kernels

-------

In both OP2 and PyOP2, a kernel has to be supplied as a string of C code that

implements an operation to be performed on a single mesh entity. It is the job

of the runtime system to invoke this kernel for every element of the set that is

being iterated over. An example kernel that adds two scalar fields together

would be declared as:

add_fields = """\

void add_fields(double *result, double *field1, double *field2)

{

*result = *field1 + *field2;

}"""

# op2.Kernel(

add_fields_kernel = op2.Kernel(

Parallel loops

--------------

PyOP2 provides the user with the ability to execute a kernel in parallel over

every element of a set. Dats that are defined on the set can be arguments to the

kernel, as can Dats on Sets that are access through a map from the iteration

set. The form of a parallel loop is:

op2.par_

The kernel must be an instance of an op2.Kernel. The iteration_space is either

an op2.Set, or an iteration space, which is produced by calling an op2.Set (more

about this later). Each of the args corresponds to one of the arguments that is

passed to the code in the kernel. An arg is constructed by calling the Dat or

Mat which is to be passed as the argument. The __call__ method of a Dat is

defined as:

def __call__(self, path, access):

where path is a Map, and access is an access descriptor. If the map is indexed,

then the index value is used to index into the map to only pass a single element

through the Map (e.g. only the first vertex of each element is passed). If the

Map is not indexed, then all of the values from the target of the map are

gathered and passed to the kernel (e.g. all the vertices of an element). The

access descriptor is one of op2.READ, op2.WRITE, op2.RW, or op2.INC to inform

PyOP2 of the intended access mode of the data.

Iteration spaces

----------------

An issue in assembling local matrices on a GPU is the allocation of resources.

When the polynomial order of elements is low, there is only a small number of

matrix elements, and it can make sense to assign one thread per element to

construct the local matrices. However, when the polynomial order is higher (even

for P=3 or P=4), there will be enough entries in the local matrix that the

resource requirements for computing the entire local matrix are increased such

that the working set is too large for a single thread, and register spilling

will impact performance significantly. Therefore, it is necessary to divide the

work for computing a single local matrix between threads.

Since the optimum partitioning of the computation will vary between devices and

kernels, PyOP2 takes control of this partitioning in order to isolate

higher-level abstractions (e.g. a form compiler, or the user) from this

lower-level complexity. This is achieved by handing control of the iteration

space of the local matrix to PyOP2. Instead of kernels containing a loop over

the i and j indices of the matrix, i and j are instead passed in as parameters.

OP2 can then manage the iteration space as necessary (and this may involve

performing some source-to-source transformations on the kernel to remove the

overhead of passing in i and j multiple times).

In summary, in a software stack that uses PyOP2, the form compiler is no longer

responsible for generating the code for assembling an entire local assembly

matrix. The form compiler is instead responsible for generating the code that

computes a single element of a local assembly matrix. PyOP2 is then responsible

for managing the execution of this code, in order to assemble the entire local

matrix.

In order to construct an iteration space, a Set is called with the dimensions of

the iteration space when creating the par_loop. For an iteration space for

assembling 3x3 local matrices, the par_loop call may look like:

op2.par_loop(mass, elements(3,3),

Because the mass kernel is now required to assemble only a single element of the

local matrix, the mat argument is indexed into using maps (the elem_node map)

that is itself indexed. This index op2.i[x] instructs OP2 to index the map using

the current value of the x'th component of the iteration space.

Kernel interface

----------------

The only function generated by FFC that is needed when generating PyOP2 code is

the tabulate_tensor function, since all the other functions are managed by

PyOP2. The function signature for this code is fairly similar to that of the

tabulate_tensor function, and is as follows when assembling a matrix:

name(double localTensor[

double *coords[geo_dim],

double **coeff0, double **coeff1, ...

int j, int k);

where name = (prefix)

where :

* prefix is an arbitrary name

* measure is one of cell, exterior_facet, interior_facet

* form is the id of the form (0 <= id)

* integral is the id of the integral (0 <= id)

A single element of the local tensor is passed to the kernel - the dim is always

instantiated as an integer at the time the kernel code is generated, and is 1 in

the scalar case, and n for an n-dimensional vector field. Since PyOP2 passes in

each dat as an individual argument, the coefficients cannot be passed in as a

single array of arrays. Instead, each coefficient is passed as a separate

parameter.

For the RHS assembly, the interface is similar, but the iteration space and

local tensor are rank-1 rather than rank-2:

name(double localTensor[dim],

double *coords[geo_dim],

double **coeff0, double **coeff1, ...

int j);

Implementation in FFC

-------

The branch https:/

version of FFC which has some implementation of the PyOP2 interface completed.

It is intended that this branch will be kept in sync with the FFC trunk, and

should pass all the same tests for generating UFC code, as well as being able to

test the PyOP2 code generation (we have a buildbot testing this branch with the

unit and regression tests at

http://

This work began as an experimental branch during a sprint on the development of

PyOP2 - before retargeting MCFC to generate PyOP2 (as opposed to the static OP2

code we had been generating before), this branch was started in order to see how

difficult it would be to modify FFC to generate PyOP2 code. Since this turned

out to be quite straightforward, we continued work on this branch.

So far only the un-optimised quadrature transformer has been altered, since this

is the most simple code to work on - once we have a stable set of changes, these

could probably be somehow incorporated into the optimised quadrature

transformer. The tensor transformer will require some consideration as to how to

fit it in with PyOP2, since the optimised tensor code doesn't have the same

obvious ways to split the iteration space like the quadrature representation

does.

The main changes that have been made so far include:

* Code snippets. Some alternative snippets have been added where the code

differs in PyOP2 and UFC. The changes to this interface might be tidied up in

future - in some cases, in order to minimise the amount of change made in

other parts of the code, certain strings have been replaced with dicts

containing both the UFC and PyOP2 code, that is selected by the code which

uses it. (These changes are the ones in cpp.py and codesnippets.py)

* Addition of the "pyop2" format, in addition to the "ufc" format in the

parameters.

* Changes to the signature of the tabulate_tensor function in

quadrature/

generating the correct signature.

* Changes to the way the local tensor is dereferenced - this is due to the

declaration of the local tensor as A[dim][dim] in the matrix case, and **A in

the vector case.

* In the case of assembling a matrix, the loops over j and k are no longer

generated.

* When assembling vector fields, extra loops over r (and s, for a matrix) are

generated, the upper bound of r and s being the dimension of the vectors being

assembled. (mainly in _create_loop_entry in quadraturetrans

* When the format is "pyop2", the pyop2_utils module (from the PyOP2 repository)

is used for the templates for code generation.

Support/examples in PyOP2

-------

In the demo/ folder, there are several functioning examples which test the use

of FFC with PyOP2. These exercise the following in PyOP2/FFC (in rough terms):

* Matrix and RHS vector assembly for vector fields and scalar fields

* Iteration spaces

* Surface integrals, by using them for implementing weak boundary conditions

The files that implement these demos are: mass2d_ffc.py, laplace_ffc.py,

weak_bcs_ffc.py, and advection_

Further steps

-------------

Support for using the ffc test framework:

In order to assess what portion of FFC's functionality is covered by the PyOP2

modifications, the regression tests need to be modified to test the generated

PyOP2 code. This includes the abstraction of some parts of the regression test

into specific formats, since the requirements for running UFC and PyOP2 code is

very different. Since only the tabulate_tensor kernel is generated in the PyOP2

code, there is not enough information for the test harness alone to test the

code - in order to get around this, the analysis from FFC could be saved to disk

for the test harness to read. Some work is in progress in this direction, and is

in the lp:~grm08/+junk/ffc-pyop2-testing branch at the moment.

Testing/bug fixes:

Once we have implemented the test harness, the results can be compared against

the expected UFC results for the tests. This should provide pointers about what

needs fixing up/is broken.

Mixed elements:

The current interface for PyOP2 code doesn't accommodate mixed elements. There

are some brief notes in this direction at:

https:/