Speedup Python assembly to equal C++ assembly

Registered by Anders Logg

Assembling from Python is slower than from C++, with an overhead of 10-20%.

Example:

  V = VectorFunctionSpace(mesh, "CG", 1)
  Q = FunctionSpace(mesh, "CG", 1)
  q = TestFunction(Q)
  u = Function(V)
  L = - q*div(u)*dx

C++: 0.052 s
Python: 0.062 s (with parameters["optimize"] = True, otherwise even slower)

Results seem to be the same when commenting out the following lines in UFC::update():

  for (uint i = 0; i < coefficients.size(); i++)
    coefficients[i]->interpolate(w[i], this->cell, cell.index());

With DOLFIN 0.7.2, the time is even faster, 0.032 s.

Blueprint information

Status:
Complete
Approver:
None
Priority:
High
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
Obsolete
Series goal:
None
Implementation:
Unknown
Milestone target:
milestone icon 0.9.7
Completed by
Garth Wells

Related branches

Sprints

Whiteboard

* Do you have any test script for these figures?

AL: See below.

* What size of the mesh do you use? Should probably use larger mesh.

AL: Yes, we could do that but the difference is significant even for a smaller mesh.

* Is the time measured only for the call to assemble() or is it the time to execute the whole python script?

AL: The time is for the call to assemble_cells in Assembler.cpp as reported by summary()
JH: OK, I thought it was the time for calling the assemble function.

* Are we sure that the extension module built with instant is using the same compile flags as DOLFIN, i.e., the same optimization?

AL: Yes, the difference is larger otherwise. In particular, I have tried both with setting -O0 when compiling the C++ code and setting parameters["optimize"] = True in the Python code.

* We also need to measure the time it takes to create a dolfin.Form. In c++ this is done outside the assemble loop, where In Python we do it inside. If this contributes significantly to the lag we might be able to cache the dolfin.Forms.

AL: I think we already cache the forms.
JH: Yes, we cache the ufc forms but not the dolfin.forms. But I realize now that this is not included in the timing, see above.

Here are the codes. I have used a mesh that's too big to distribute so I've changed to a UnitCube. Don't know what a reasonable size is.

assemble_python
-----------------------

#!/usr/bin/env python

from dolfin import *
from time import time

parameters["ffc_representation"] = "tensor"
parameters["optimize"] = True

mesh = UnitCube(16, 16, 16)

V = VectorFunctionSpace(mesh, "CG", 1)
Q = FunctionSpace(mesh, "CG", 1)
q = TestFunction(Q)
u = Function(V)
L = - q*div(u)*dx

n = 5
t = time()
for i in range(n):
    print i
    b = assemble(L)
t = (time() - t) / float(n)
print "Time to assemble:", t

summary()

assemble_cpp.cpp
------------------------

#include <iostream>
#include <dolfin.h>
#include "RHS.h"

using namespace dolfin;

int main()
{
  UnitCube mesh(16, 16, 16);

  RHS::FunctionSpace Q(mesh);
  RHS::LinearForm L(Q);

  RHS::CoefficientSpace_u V(mesh);
  Function u;
  L.u = u;
  u.vector();

  Vector b;
  tic();
  int n = 5;
  double t = time();
  for (int i = 0; i < n; ++i)
  {
    std::cout << i << std::endl;
    assemble(b, L);
  }
  t = (time() - t) / static_cast<double>(n);
  std::cout << "Time to assemble: " << t << std::endl;

  summary();
}

RHS.ufl
----------

V = VectorElement("CG", tetrahedron, 1)
Q = FiniteElement("CG", tetrahedron, 1)
u = Function(V)
q = TestFunction(Q)
L = - q*div(u)*dx

Should be compiled with -l dolfin -r tensor

GNW (11-02-22): I've run this in a finer mesh (32x32x32) and cannot detect a difference between Python and C++

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.