Reduction of tabulated tensor for integral over a subset of a MixedSpace

Registered by Johan Hake

When an integral over a MixedSpace only includes Arguments from a subset of the MixedSpace, alot of zeros are inserted into the global matrix. This becomes utterly slow when one uses a _alot_ of global dofs ("Real") in a MixedSpace. If a Matrix is assembled using no Global dofs alot of zeros are added for each cell during Assemble. This can be avoided with comparable small changes in the Assembler together with a change in ufc::integral.

Blueprint information

Status:
Not started
Approver:
None
Priority:
High
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
Discussion
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Whiteboard

The expensive part is the call to:

      A.add(&ufc.A[0], dofs);

I have timed the cost of the above operation for the following stripped code:

for i in [0, 16]:
    V = MixedFunctionSpace([FunctionSpace(mesh, "CG", 1)]+\
                           [FunctionSpace(mesh, "R", 0)]*i)
    v = TestFunction(V)
    u = TrialFunction(V)
    A = assemble(u[0]*v[0]*dx)

With a repetition of 1.5M cell assembles this is the timing:

Without global dofs 1.2 s
With 16 global dofs 16.9 s

A slowdown of 10 times. Note I have used the Epetra backend, which is much faster than the PETSc backend.

I figure that one do not have to alter the tabulate_tensor interface, as no memory is created or copied for this operation. This means that the time is not spent in tabulate_tensor. One just need to tabulate the degrees of freedom for each rank that is used in an integral. Such a tabulation can be added to the ufc::integral:

  uint value_size(uint rank)
  *uint local_dofs(uint rank)

If value_size == dofmap.max_cell_dimension() the integral tabulates all dofs, and no reduction is needed. If a reduction is needed we can use the local_dofs to retabulate the local to global dofs. Helper functions for example in UFC can be added to facilitate this. A corresponding smaller block can be allocated and filled for the original tabulated tensor and passed to the A.add method.

GNW: I don't quite follow the above.

JH: The point is to not reduce the size of the values that get tabulated, as it is not the tabulation part that takes time. The local tensor UFC::A will have the same size as before. We need to construct a reduced vector<double> A_reduced (and a reduced dofmap), which follows the size of the non zeros pattern of the integral. Then we grab the values from the tabulated tensor (UFC::A) and put them into A_reduced () and ship this to GenericTensor::Add. The copying of values cost far less time than the communication to the underlying linear algebra backend.

I hacked the Assembler.cpp to see if any speed up could be expected with this method, and found that for the particular case showed above the timings dropped to:

Without global dofs 1.0 s
With 16 global dofs 1.0 s

for the Add method. The extra time for constructing the subdofmap for each cell was insignificant compared to the speed up.

This feature if added can also be used for other forms using a Mixed space.

The suggestions implies changes to ufc, see above. FFC/SFC and maybe also UFL if there is no convenient way of extracting the indices and the corresponding local dofs of the used Arguments in a form, and last to DOLFIN. But I consider the changes quite moderate for such a useful feature!

AM: I ran into the same slow-down problem when using stabilized P1-P0 elements for the Stokes flow. Here the bilinearform is defined by

a = (inner(grad(u), grad(v)) - div(v)*p - q*div(u))*dx - beta_h*inner(jump(p),jump(q))*dS

and the jump terms are only assemble for the pressure elements. Nevertheless ffc
generates code to fill in and scatter a ((4*3 + 1)*2)** = 676 matrix instead of a 2*2 matrix which is sufficient to assemble the contributions from the interior facets.

I wrote a script illustrating that point by 1) assemble the entire form and 2) assembling the 4 blocks (not using a mixed space). The result for 30*30*30 UnitCube is
entire form: 67 s
blocks: : 11 s

AL: I suggest we add the following functions to the UFC interface:

class foo_integral
{
  uint num_nonzero_entries() const;
  void tabulate_nonzero_entries(uint* nonzero_entries) const;
}

One may then check if num_nonzero_entries is smaller than the number of total entries and in that case extract only a subset to send to A.add().

JH:
Sounds good. I just realized that this fix will work when we also get a domain constrained FunctionSpace, where we would need to have a flexible sized element tensor for each integral

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.