Complete work for TMUDF compile time interface

Registered by Hans Zeller

Suresh and others implemented much of a compile time interface for TMUDFs (Table-Mapping UDFs). Such an interface allows a TMUDF to be polymorphic (input and output columns decided at compile time, not at create time) and to do optimizations like elimination of unneeded columns, pushing predicates below or into the TMUDF, getting better cardinality and cost estimates, and using sort order and partitioning of input tables.

A TMUDF can have zero or more input tables. Using more than one input table is not tested and
supported at the moment, but the design should allow it.

The interface will be a C++ class that the TMUDF writer can derive from. Implementing a compiler
interface for a TMUDF is completely optional and is done by overriding virtual methods of the
default implementation. In a later step, we also want to replace the C interface
at runtime with this C++ interface, designed a few years ago. We should be able to define a
Java compile time interface that's fairly similar to the one in C++. A TMUDF writer only needs to
override those interfaces that need to be different from the default implementation. For example,
a TMUDF could define its output columns through the compiler interface, but it might not
support pushing predicates into the TMUDF. Another example could be a TMUDF that only
implements the compiler interface that determines the degree of parallelism.

There are three main classes in this interface, all are defined in file core/sql/sqludr/sqludr.h:

UDRInvocationInfo: This is similar to SQLUDR_TMUDFINFO in the C interface. It describes
the metadata of the TMUDF, scalar input parameters, the table-valued result, PARTITION BY
and ORDER BY clauses specified for table-valued inputs, etc. There is one of these for every
TMUDF invocation in a query. In some cases, the compiler may create additional
UDRInvocationInfo objects when it transforms the TMUDF, for example by placing it under
a nested join, with a different set of predicates to be pushed down. There are additional
classes to describe parameters, table-valued inputs and outputs, data types, similar to
the existing C interface.

UDRPlanInfo: There are zero or more UDRPlanInfo objects for every UDRInvocationInfo
object. The optimizer creates one for every optimization goal (context) where it needs to
call the TMUDF interface.

TMUDRInterface: This class represents the code associated with a TMUDF. The class itself
represents the default behavior of a TMUDF without a compile time interface. UDF writers
can define a derived class and implement virtual methods to customize the optimizer
interface. Trafodion tries to find a C function
<UDF external function name>_CreateCompilerInterfaceObject
If that function exists in the UDF library, it is assumed to return an object of a class
derived from TMUDRInterface, and the compiler will call the virtual methods, some
could be defined in the derived class, some could be in the base class, and the derived
class also might call the base class method to do part of the work.

Here are the methods we plan to support:

- Validate scalar input parameters, possibly allow those parameters to deviate from the
  parameter list declared at DDL time.
- Allow the compiler interface to look at constant values that are passed in as input
  parameters.
- Define the table-valued result columns, based on scalar parameters and column
  layout of the input (child) table(s).
- Eliminate unneeded columns from the TMUDF result and also from the input tables.
- Allow predicates to be pushed down through he TMUDF operator to the child table(s).
- Allow predicates to be absorbed into the TMUDF.
- Return a cost estimate of the TMUDF, based on information available at compile time.
- Influence the degree of parallelism chosen for the TMUDF.
- Make use of natural partitioning and sort order of input (child) tables to produce
  partitioned and sorted results.
- Gather the necessary information based on the compile time interaction that is
  needed at runtime.
- At any time in the process, the compile time interface can raise an exception. If
  it does, the compilation will fail and an error message provided by the TMUDF
  writer will be returned in the diagnostics area.

Some more design choices:

The C++ interface uses its own C++ namespace, to avoid naming collisions and
to make it look more similar to Java. Objects are allocated on the system heap and
are deleted after statement compilation is finished. The interface does not use any
of the Trafodion objects like NAHeap, ComDiagsArea, etc. The interface does use
C++ STL for strings and collection templates, again with the goal to stay close
to Java.

NAString ==> std::string
NAHeap ==> C++ system heap
ComDiagsArea ==> Throw exception with an attached SQLSTATE and error message

In this first implementation, we only support a C++ interface and the compiler will
call that interface directly, without going through the tdm_udrserv process. We may
need a special privilege to allow a user to define code that's executed in the
Trafodion process (the privilege to create a TMUDF that has a compile time interface).
In the longer term we hope to support the following flavors:

- C++ and Java interfaces for the TMUDF, both at compile time and run time.
- Trusted and isolated modes, both at compile and run time.

Example code for a "sessionize" TMUDF that expects a single table input and
passes all columns through to the result, in addition to the session id column
(assume that session id column is defined as the only output column in the DDL):

class SessionizeUDFInterface : public TMUDRInterface
{
  // override any methods where the UDF author would
  // like to change the default behavior

  void describeParamsAndColumns(UDRInvocationInfo &info);

};

void SessionizeUDFInterface::describeParamsAndColumns(
     UDRInvocationInfo &info)
{
      // sessionize is intended to work with a single input table
      if (info.getNumTableInputs() != 1)
        throw UDRException(38001,
                           "Expecting one table-valued input, got %d",
                           info.getNumTableInputs());

      // add all input table columns as output columns
      info.addPassThruColumns(0);

}

extern "C" TMUDRInterface * SESSIONIZE_CreateCompilerInterfaceObject(
     const UDRInvocationInfo *info)
{
  return new SessionizeUDFInterface();
}

Blueprint information

Status:
Started
Approver:
Suresh Subbiah
Priority:
Medium
Drafter:
Hans Zeller
Direction:
Approved
Assignee:
Hans Zeller
Definition:
Approved
Series goal:
Accepted for trunk
Implementation:
Good progress
Milestone target:
milestone icon r1.1
Started by
Hans Zeller

Whiteboard

Gerrit topic: https://review.trafodion.org/#q,topic:bp/cmp-tmudf-compile-time-interface,n,z

Addressed by: https://review.trafodion.org/787
    TMUDF C++ compiler interface, part of log-reading TMUDF

Addressed by: https://review.trafodion.org/824
    Phase 2 for log reader TMUDF

Addressed by: https://review.trafodion.org/848
    Log reading TMUDF, phase 3

Gerrit topic: https://review.trafodion.org/#q,topic:bug/1420539,n,z

Addressed by: https://review.trafodion.org/1125
    C++ run-time interface for TMUDFs

Gerrit topic: https://review.trafodion.org/#q,topic:bug/cmp-tmudf-compile-time-interface,n,z

Addressed by: https://review.trafodion.org/1249
    Normalizer interface for TMUDFs, blueprint cmp-tmudf-compile-time-interface

Addressed by: https://review.trafodion.org/1279
    Normalizer interface for TMUDFs, blueprint cmp-tmudf-compile-time-interface

Gerrit topic: https://review.trafodion.org/#q,topic:langman_for_tmudf,n,z

Addressed by: https://review.trafodion.org/1655
    Using the language manager for UDF compiler interface

Gerrit topic: https://review.trafodion.org/#q,topic:bug/1433192,n,z

Addressed by: https://review.trafodion.org/1717
    Costing and statistics compiler interfaces for UDFs

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.