This topic describes custom formatters and how to use them with Greenplum Streaming Server.
A custom formatter is a C function that performs specific formatting or processing on data that is accessed by a Greenplum Database external table. A custom formatter may support options that you provide to direct the processing performed by the function. Greenplum includes built-in and custom formatters. You can also develop your own custom formatter.
You compile the C custom formatter functions that you develop into a shared library. These functions are available to Greenplum Database users after the shared library is installed in the Greenplum Database cluster and the custom formatter functions are registered as SQL UDFs.
This topic includes the following sections:
A custom formatter is a PostgreSQL C language function; refer to C-Language Functions in the PostgreSQL documentation for detailed information about developing C language functions.
Important header files for custom formatter development include:
fmgr.h. These headers define the functions and macros required to interact with PostgreSQL and formatter C structures.
For an example Greenplum Database custom formatter implementation, refer to the formatter_fixedwidth example in the Greenplum Database open source
See Custom Formatter for Kafka for a GPSS- and Kafka-specific custom formatter example.
You can develop and test a custom formatter against a Greenplum Database external table that specifies the
file: protocol in the
LOCATION URI. Any custom formatter that you develop and test in this fashion should be compatible with GPSS.
The remainder of this topic describes special considerations when developing a custom formatter for use with GPSS.
GPSS handles data boundaries from the source; a custom formatter can expect to receive a complete Kafka message. Certain formatter API calls are expected to behave differently than a typical Greenplum Database formatter:
FORMATTER_SET_DATACURSOR()has no effect.
FORMATTER_GET_DATALEN()always returns the full size of the message. The message is guaranteed to be complete.
When the custom formatter encounters an unrecoverable error, it should invoke
ereport() with the error code
ERRCODE_INTERNAL_ERROR (or its siblings) to indicate that the process should be terminated. The bad row data will not be written to the error log in this case.
If the custom formatter encounters an ignorable error and the data loading should continue, it should invoke
ereport() with the error code
ERRCODE_DATA_EXCEPTION (or its siblings). In this scenario, GPSS writes the bad row data to the error log automatically.
The GPSS extension invokes the
FORMATTER_SET_BAD_ROW_DATA() function; the function has no effect when invoked by the custom formatter.
Greenplum Database truncates bad row data written to the error log at the first 0x00 byte. As a result, the
gp_read_persistent_error_log() functions may return an incomplete
You compile the custom formatter function that you write into a shared library that the Greenplum Database server loads on demand.
You can use the PostgreSQL build extension infrastructure (PGXS) to build the source code for your custom formatter function against a Greenplum Database installation. This framework automates common build rules for simple modules. If you have a more complicated use case, you must write your own build system.
To use the PGXS infrastructure to generate a shared library for a custom formatter function that you develop, create a simple
Makefile that sets PGXS-specific variables.
Note: Refer to Extension Building Infrastructure in the PostgreSQL documentation for information about the
Makefile variables supported by PGXS.
For example, the following
Makefile generates a shared library named
customfmt_example.so from a C source file named
MODULE_big = customfmt_example OBJS = customfmt.o PG_CPPFLAGS = -I$(shell $(PG_CONFIG) --includedir) SHLIB_LINK = -L$(shell $(PG_CONFIG) --libdir) PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS)
MODULE_big identifes the base name of the shared library generated by the
PG_CPPFLAGS adds the Greenplum Database installation include directory to the compiler header file search path.
SHLIB_LINK adds the Greenplum Database installation library directory to the linker search path.
PGXS variable settings and the
include statement are required and typically reside in the last three lines of the
Before you can use a custom formatter, you must register the function with Greenplum Database.
Registering the function involves mapping the function signature to a SQL user-defined function. You define this mapping with the
CREATE FUNCTION .. AS command specifying the shared library name. You may choose to use the same name or differing names for the custom formatter and SQL functions.
CREATE FUNCTION ... AS syntax follows:
CREATE FUNCTION <sql_function_name>(<arg>[, ...]) RETURNS <return_type> AS '<shared_library_path>'[, '<formatter_function_name>'] LANGUAGE C STABLE;
You may omit the shared library
.so extension when you specify
The following command registers a C function named
customfmt_import() to a SQL UDF named
customfmt_in() when the function is compiled and linked in a shared library named
CREATE FUNCTION customfmt_in() RETURNS record AS 'customfmt_example.so', 'customfmt_import' LANGUAGE C STABLE;
You can use a custom formatter in your GPSS load job by identifying the
custom data format and providing the formatter function name and parameters in the load configuration file. The names of these properties differ in version 2 and version 3 (Beta) format configuration files. Version 2 format example:
FORMAT: custom CUSTOM_OPTION: NAME: formatter_in PARAMSTR: aaa="test",bbb="123"
Version 3 (Beta) format example:
custom: columns: - name: value type: text name: formatter_in options: - aaa="test" - bbb="123"
When you specify a custom formatter in your GPSS load configuration file, GPSS invokes the formatter to process the data before loading it into Greenplum Database.