Native User Defined Function

UDF is mainly suitable for scenarios where the analytical capabilities that users need do not possess. Users can implement custom functions according to their own needs, and register with Doris through the UDF framework to expand Doris’ capabilities and solve user analysis needs.

There are two types of analysis requirements that UDF can meet: UDF and UDAF. UDF in this article refers to both.

  1. UDF: User-defined function, this function will operate on a single line and output a single line result. When users use UDFs for queries, each row of data will eventually appear in the result set. Typical UDFs are string operations such as concat().
  2. UDAF: User-defined aggregation function. This function operates on multiple lines and outputs a single line of results. When the user uses UDAF in the query, each group of data after grouping will finally calculate a value and expand the result set. A typical UDAF is the set operation sum(). Generally speaking, UDAF will be used together with group by.

This document mainly describes how to write a custom UDF function and how to use it in Doris.

If users use the UDF function and extend Doris’ function analysis, and want to contribute their own UDF functions back to the Doris community for other users, please see the document Contribute UDF.

Writing UDF functions

Before using UDF, users need to write their own UDF functions under Doris’ UDF framework. In the contrib/udf/src/udf_samples/udf_sample.h|cpp file is a simple UDF Demo.

Writing a UDF function requires the following steps.

Writing functions

Create the corresponding header file and CPP file, and implement the logic you need in the CPP file. Correspondence between the implementation function format and UDF in the CPP file.

Users can put their own source code in a folder. Taking udf_sample as an example, the directory structure is as follows:

  1. └── udf_samples
  2. ├── uda_sample.cpp
  3. ├── uda_sample.h
  4. ├── udf_sample.cpp
  5. └── udf_sample.h

Non-variable parameters

For UDFs with non-variable parameters, the correspondence between the two is straightforward. For example, the UDF of INT MyADD(INT, INT) will correspond to IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2).

  1. AddUdf can be any name, as long as it is specified when creating UDF.
  2. The first parameter in the implementation function is always FunctionContext*. The implementer can obtain some query-related content through this structure, and apply for some memory to be used. The specific interface used can refer to the definition in udf/udf.h.
  3. In the implementation function, the second parameter needs to correspond to the UDF parameter one by one, for example, IntVal corresponds to INT type. All types in this part must be referenced with const.
  4. The return parameter must correspond to the type of UDF parameter.

variable parameter

For variable parameters, you can refer to the following example, corresponding to UDFString md5sum(String, ...) The implementation function is StringVal md5sumUdf(FunctionContext* ctx, int num_args, const StringVal* args)

  1. md5sumUdf can also be changed arbitrarily, just specify it when creating.
  2. The first parameter is the same as the non-variable parameter function, and the passed in is a FunctionContext*.
  3. The variable parameter part consists of two parts. First, an integer is passed in, indicating that there are several parameters behind. An array of variable parameter parts is passed in later.

Type correspondence

UDF TypeArgument Type
TinyIntTinyIntVal
SmallIntSmallIntVal
IntIntVal
BigIntBigIntVal
LargeIntLargeIntVal
FloatFloatVal
DoubleDoubleVal
DateDateTimeVal
DatetimeDateTimeVal
CharStringVal
VarcharStringVal
DecimalDecimalVal

Compile UDF function

Since the UDF implementation relies on Doris’ UDF framework, the first step in compiling UDF functions is to compile Doris, that is, the UDF framework.

After the compilation is completed, the static library file of the UDF framework will be generated. Then introduce the UDF framework dependency and compile the UDF.

Compile Doris

Running sh build.sh in the root directory of Doris will generate a static library file of the UDF framework headers|libs in output/udf/

  1. ├── output
  2. └── udf
  3. ├── include
  4. ├── uda_test_harness.h
  5. └── udf.h
  6. └── lib
  7. └── libDorisUdf.a

Writing UDF compilation files

  1. Prepare thirdparty

    The thirdparty folder is mainly used to store thirdparty libraries that users’ UDF functions depend on, including header files and static libraries. It must contain the two files udf.h and libDorisUdf.a in the dependent Doris UDF framework.

    Taking udf_sample as an example here, the source code is stored in the user’s own udf_samples directory. Create a thirdparty folder in the same directory to store the static library. The directory structure is as follows:

    1. ├── thirdparty
    2. │── include
    3. └── udf.h
    4. └── lib
    5. └── libDorisUdf.a
    6. └── udf_samples

    udf.h is the UDF frame header file. The storage path is doris/output/udf/include/udf.h. Users need to copy the header file in the Doris compilation output to their include folder of thirdparty.

    libDorisUdf.a is a static library of UDF framework. After Doris is compiled, the file is stored in doris/output/udf/lib/libDorisUdf.a. The user needs to copy the file to the lib folder of his thirdparty.

    *Note: The static library of UDF framework will not be generated until Doris is compiled.

  2. Prepare to compile UDF’s CMakeFiles.txt

    CMakeFiles.txt is used to declare how UDF functions are compiled. Stored in the source code folder, level with user code. Here, taking udf_samples as an example, the directory structure is as follows:

    1. ├── thirdparty
    2. └── udf_samples
    3. ├── CMakeLists.txt
    4. ├── uda_sample.cpp
    5. ├── uda_sample.h
    6. ├── udf_sample.cpp
    7. └── udf_sample.h
    • Need to show declaration reference libDorisUdf.a
    • Declare udf.h header file location

    Take udf_sample as an example

    1. # Include udf
    2. include_directories(thirdparty/include)
    3. # Set all libraries
    4. add_library(udf STATIC IMPORTED)
    5. set_target_properties(udf PROPERTIES IMPORTED_LOCATION thirdparty/lib/libDorisUdf.a)
    6. # where to put generated libraries
    7. set(LIBRARY_OUTPUT_PATH "${BUILD_DIR}/src/udf_samples")
    8. # where to put generated binaries
    9. set(EXECUTABLE_OUTPUT_PATH "${BUILD_DIR}/src/udf_samples")
    10. add_library(udfsample SHARED udf_sample.cpp)
    11. target_link_libraries(udfsample
    12. udf
    13. -static-libstdc++
    14. -static-libgcc
    15. )
    16. add_library(udasample SHARED uda_sample.cpp)
    17. target_link_libraries(udasample
    18. udf
    19. -static-libstdc++
    20. -static-libgcc
    21. )

    If the user’s UDF function also depends on other thirdparty libraries, you need to declare include, lib, and add dependencies in add_library.

The complete directory structure after all files are prepared is as follows:

  1. ├── thirdparty
  2. │── include
  3. └── udf.h
  4. └── lib
  5. └── libDorisUdf.a
  6. └── udf_samples
  7. ├── CMakeLists.txt
  8. ├── uda_sample.cpp
  9. ├── uda_sample.h
  10. ├── udf_sample.cpp
  11. └── udf_sample.h

Prepare the above files and you can compile UDF directly

Execute compilation

Create a build folder under the udf_samples folder to store the compilation output.

Run the command cmake ../ in the build folder to generate a Makefile, and execute make to generate the corresponding dynamic library.

  1. ├── thirdparty
  2. ├── udf_samples
  3. └── build

Compilation result

After the compilation is completed, the UDF dynamic link library is successfully generated. Under build/src/, taking udf_samples as an example, the directory structure is as follows:

  1. ├── thirdparty
  2. ├── udf_samples
  3. └── build
  4. └── src
  5. └── udf_samples
  6. ├── libudasample.so
  7. └── libudfsample.so

Create UDF function

After following the above steps, you can get the UDF dynamic library (that is, the .so file in the compilation result). You need to put this dynamic library in a location that can be accessed through the HTTP protocol.

Then log in to the Doris system and create a UDF function in the mysql-client through the CREATE FUNCTION syntax. You need to have ADMIN authority to complete this operation. At this time, there will be a UDF created in the Doris system.

  1. CREATE [AGGREGATE] FUNCTION
  2. name ([argtype][,...])
  3. [RETURNS] rettype
  4. PROPERTIES (["key"="value"][,...])

Description:

  1. “Symbol” in PROPERTIES means that the symbol corresponding to the entry function is executed. This parameter must be set. You can get the corresponding symbol through the nm command, for example, _ZN9doris_udf6AddUdfEPNS_15FunctionContextERKNS_6IntValES4_ obtained by nm libudfsample.so | grep AddUdf is the corresponding symbol.
  2. The object_file in PROPERTIES indicates where it can be downloaded to the corresponding dynamic library. This parameter must be set.
  3. name: A function belongs to a certain DB, and the name is in the form of dbName.funcName. When dbName is not explicitly specified, the db where the current session is located is used as dbName.

For specific use, please refer to CREATE FUNCTION for more detailed information.

Use UDF

Users must have the SELECT permission of the corresponding database to use UDF/UDAF.

The use of UDF is consistent with ordinary function methods. The only difference is that the scope of built-in functions is global, and the scope of UDF is internal to DB. When the link session is inside the data, directly using the UDF name will find the corresponding UDF inside the current DB. Otherwise, the user needs to display the specified UDF database name, such as dbName.funcName.

Delete UDF

When you no longer need UDF functions, you can delete a UDF function by the following command, you can refer to DROP FUNCTION.