Loading File Data into Greenplum

You can use the gpsscli utility to load data from a file or from the stdout of a command into Greenplum Database.

Load Procedure

You will perform the following tasks when you use the Greenplum Streaming Server to load file or command output data into a Greenplum Database table:

Ensure that you meet the Prerequisites.
Register the Greenplum Streaming Server extension.
Identify the format of the data.
Construct the load configuration file.
Create the target Greenplum Database table.
Assign Greenplum Database role permissions to the table, if required.
Run the gpsscli Client Commands to load the data into Greenplum Database.
Check for load errors.

Prerequisites

Before using the gpsscli utilities to load file or command output data to Greenplum Database, ensure that:

Your systems meet the Prerequisites documented for the Greenplum Streaming Server.
The file or command is accessible on the ETL server host, and the operating system user running the gpss server process has the appropriate permissions to access the file or run the command.

About Supported Data Formats

To write file or command output data into a Greenplum Database table, you must identify the format of the data in the load configuration file.

The Greenplum Streaming Server supports loading files of the following formats:

Format	Description
avro	Avro-format data. Specify the `avro` format when you want to load a single-object encoded Avro file. GPSS reads Avro data from the file and loads it into a single JSON-type column. You must define a mapping if you want GPSS to write the data into specific columns in the target Greenplum Database table. GPSS supports `libz`-, `lzma`-, and `snappy`-compressed Avro data.
binary	Binary format data. Specify the `binary` format when your file is binary format. GPSS reads binary data from the file and loads it into a single bytea-type column.
csv	Comma-delimited text format data. Specify the `csv` format when your file data is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF).
custom	Data of a custom format, parsed by a custom formatter function.
delimited	Text data separated by a configurable delimiter. The `delimited` data format supports a multi-byte delimiter.
json, jsonl (version 2 only)	JSON- or JSONB-format data. Specify the `json` format when the file is in JSON or JSONB format. GPSS can read JSON data as a single object or can read a single JSON record per line. You must define a mapping if you want GPSS to write the data into specific columns in the target Greenplum Database table. Note: GPSS supports JSONB-format data only when loading to Greenplum 6. Note: Specify `FORMAT: jsonl` in version 2 format load configuration files. Specify `json` with `is_jsonl: true` in version 3 (Beta) format load configuration files.

Constructing the filesource.yaml Configuration File

You configure a data load operation from a file or command output to Greenplum Database via a YAML-formatted configuration file. This configuration file includes parameters that identify the source file or command and information about the Greenplum Database connection and target table, as well as error thresholds for the operation.

The Greenplum Streaming Server supports versions 2 and 3 (Beta) of the YAML configuration file when you load data into Greenplum from a file or command output. Versions 2 and 3 of the configuration file format support loading file/command and meta data to Greenplum.

Refer to the filesource-v2.yaml reference page for configuration file format and the configuration properties supported.

A sample version 2 file load YAML configuration file named loadfromfile2.yaml follows:

DATABASE: ops
USER: gpadmin
PASSWORD: changeme
HOST: mdw-1
PORT: 5432
VERSION: 2
FILE:
   INPUT:
      SOURCE:
         URL: file:///tmp/file.csv
      VALUE:
         COLUMNS:
           - NAME: id
             TYPE: int
           - NAME: cname
             TYPE: text
           - NAME: oname
             TYPE: text
         FORMAT: delimited
         DELIMITED_OPTION:
           DELIMITER: "\t"
           EOL_PREFIX: "$$EOL$$"
           QUOTE: '&'
           ESCAPE: '|'
      META:
         COLUMNS:
           - NAME: meta
             TYPE: json
         FORMAT: json
      ERROR_LIMIT: 25
   OUTPUT:
      SCHEMA: gpschema
      TABLE: gptable
      MODE: INSERT
      MAPPING:
        - NAME: id
          EXPRESSION: id
        - NAME: cname
          EXPRESSION: cname
        - NAME: fname
          EXPRESSION: (meta->>'filename')::text
SCHEDULE:
  RETRY_INTERVAL: 500ms
  MAX_RETRIES: 2

Greenplum Database Options (Version 2-Focused)

You identify the Greenplum Database connection options via the DATABASE, USER, PASSWORD, HOST, and PORT parameters.

The VERSION parameter identifies the version of the GPSS YAML configuration file.

Note
You must specify version 2 when you load from a file or command output data source into Greenplum using this format.

Input Options

You can direct GPSS to load from a file or from the stdout of a command:

Specify a file location using the SOURCE:URL property. GPSS supports wildcards in the file path. If you want to read all files in a directory, specify dirname/*.
Alternatively, you can load the stdout of a command by specifying the command and execution options using the properties in the the SOURCE:EXEC block.

The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in VALUE:COLUMNS:NAME with a column name in the target Greenplum Database OUTPUT:TABLE:

You must identify the data elements in the order in which they appear in the file or command output.
You must provide the same name for each data element and its associated Greenplum Database table column.
You must specify a compatible data type for each data element and its associated Greenplum Database table column.

The VALUE block must specify a FORMAT. The VALUE:FORMAT keyword identifies the format of the file. GPSS supports comma-delimited text (csv), binary (binary), or JSON/JSONB (json), and Avro (avro) format files. GPSS also supports data that is separated by a configurable multi-byte delimiter (delimited).

When you provide a META block, you must specify a single JSON-type COLUMNS and the FORMAT: json. Meta data for a file is a single text property named filename. GPSS does not support meta data when loading from command output.

The FILTER parameter identifies a filter to apply to the data before it is loaded into Greenplum Database. If the filter evaluates to true, GPSS loads the data. The data is dropped if the filter evaluates to false. The filter string must be a valid SQL conditional expression and may reference one or more VALUE column names.

The ERROR_LIMIT parameter identifies the number of errors or the error percentage threshold after which GPSS should exit the load operation.

FILE:OUTPUT Options

You identify the target Greenplum Database schema name and table name via the FILE:OUTPUT: SCHEMA and TABLE parameters. You must pre-create the Greenplum Database table before you attempt to load file or command output data.

The default load mode is to insert data into the Greenplum Database table. GPSS also supports updating and merging data into a Greenplum table. You specify the load MODE, the MATCH_COLUMNS and UPDATE_COLUMNS, and any UPDATE_CONDITIONs that must be met to merge or update the data. In MERGE MODE, you can also specify ORDER_COLUMNS to filter out duplicates and a DELETE_CONDITION.

You can override the default mapping of the INPUT:VALUE:COLUMNS by specifying a MAPPING block in which you identify the association between a specific column in the target Greenplum Database table and a data value element. You can also map the META data column, and map a Greenplum Database table column to a value expression.

Note
When you specify a MAPPING block, ensure that you provide entries for all data elements of interest - GPSS does not automatically match column names when you provide a MAPPING.

About the Merge Load Mode

MERGE mode is similar to an UPSERT operation; GPSS may insert new rows in the database, or may update an existing database row that satisfies match and update conditions. GPSS deletes rows in MERGE mode when the data satisfies an optional DELETE_CONDITION that you specify.

GPSS stages a merge operation in a temporary table, generating the SQL to populate the temp table from the set of OUTPUT configuration properties that you provide.

GPSS uses the following algorithm for MERGE mode processing:

Create a temporary table like the target table.
Generate the SQL to insert the source data into the temporary table.
1. Add the MAPPINGS.
2. Add the FILTER.
3. Use MATCH_COLUMNS and ORDER_COLUMNS to filter out duplicates.
Update the target table from rows in the temporary table that satisfy MATCH_COLUMNS, UPDATE_COLUMNS, and UPDATE_CONDITION.
Insert non-matching rows into the target table.
Delete rows in the target table that satisfy MATCH_COLUMNS and the DELETE_CONDITION.
Truncate the temporary table.

About the JSON Format and Column Type

When you specify FORMAT: json or FORMAT: jsonl, valid COLUMN:TYPEs for the data include json or jsonb. You can also specify the new GPSS gp_jsonb (Beta) or gp_json (Beta) column types.

gp_jsonb is an enhanced JSONB type that adds support for \u escape sequences and unicode. For example, gp_jsonb can escape \uDD8B and \u0000 as text format, but jsonb treats these characters as illegal.
gp_json is an enhanced JSON type that can tolerate certain illegal unicode sequences. For example, gp_json automatically escapes incorrect surrogate pairs and processes \u0000 as \\u0000. Note that unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.

You can use the gp_jsonb (Beta) and gp_json (Beta) data types as follows:

As the COLUMN:TYPE when the target Greenplum Database table column type is json or jsonb.
In a MAPPING when the target Greenplum Database column is text or varchar. For example:
```
EXPRESSION: (j->>'a')::text
```
In a MAPPING when FORMAT: avro and the target Greenplum Database column is json or jsonb. For example:
```
EXPRESSION: j::gp_jsonb
```
or
```
EXPRESSION: j::gp_json
```
In a MAPPING when FORMAT: avro and the target Greenplum Database column is text or varchar. For example:
```
EXPRESSION: (j::gp_jsonb->>'a')::text
```
or
```
EXPRESSION: (j::gp_json->>'a')::text
```

Note
The gp_jsonb (Beta) and gp_json (Beta) data types are defined in an extension named dataflow. You must CREATE EXTENSION dataflow; in each database in which you choose to use these (Beta) data types.

Preserving Ill-Formed JSON Escape Sequences

GPSS exposes a configuration parameter that you can use with the gp_jsonb and gp_json types. The name of this parameter is gpss.json_preserve_ill_formed_prefix. When set, GPSS does not return an error when it encounters an ill-formed JSON escape sequence with these types, but instead prepends it with the prefix that you specify.

For example, if gpss.json_preserve_ill_formed_prefix is set to the string "##" as follows:

SET gpss.json_preserve_ill_formed_prefix = "##";

and GPSS encounters an ill-formed JSON sequence such as the orphaned low surrogate \ude04X, GPSS writes the data as ##\ude04X instead.

About META, VALUEs, and FORMATs

You can specify the avro, binary, csv, delimited, or json data format in the Version 2 configuration file INPUT:VALUE:FORMAT, with some restrictions. You cannot specify a META block when INPUT:VALUE:FORMAT is csv.

About Transforming and Mapping Input Data

You can define a MAPPING between the input data (VALUE:COLUMNS and META:COLUMNS) and the columns in the target Greenplum Database table. Defining a mapping may be useful when you have a multi-field input column (such as a JSON-type column), and you want to assign individual components of the input field to specific columns in the target table.

You might also use a MAPPING to assign a value expression to a target table column. The expression must be one that you could specify in the SELECT list of a query, and can include a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so forth.

If you choose to map more than one input column in an expression, you can can create a user-defined function to parse and transform the input column and return the columns of interest.

For example, suppose you are loading a JSON file with the following contents:

{ "customer_id": 1313131, "some_intfield": 12 }
{ "customer_id": 77, "some_intfield": 7 }
{ "customer_id": 1234, "some_intfield": 56 }

You could define a user-defined function, udf_parse_json(), to parse the data as follows:

=> CREATE OR REPLACE FUNCTION udf_parse_json(value json)
     RETURNS TABLE (x int, y text)
   LANGUAGE plpgsql AS $$
     BEGIN
        RETURN query
        SELECT ((value->>'customer_id')::int), ((value->>'some_intfield')::text);
     END $$;

This function returns the two fields in each JSON record, casting the fields to integer and text, respectively.

An example MAPPING for file data in a JSON-format FILE:INPUT:COLUMNS named jdata follows:

MAPPING: 
  cust_id: (jdata->>'customer_id')
  field2: ((jdata->>'some_intfield') * .075)::decimal
  j1, j2: (udf_parse_json(jdata)).*

The Greenplum Database table definition for this example scenario is:

=> CREATE TABLE t1map( cust_id int, field2 decimal(7,2), j1 int, j2 text );

Creating the Greenplum Table

You must pre-create the Greenplum table before you load data into Greenplum Database. You use the FILE:OUTPUT: SCHEMA and TABLE load configuration file parameters to identify the schema and table names.

The target Greenplum table definition must include each column that GPSS will load into the table. The table definition may include additional columns; GPSS ignores these columns, and loads no data into them.

The name and data type that you specify for a column of the target Greenplum Database table must match the name and data type of the related data element. If you have defined a column mapping, the name of the Greenplum Database column must match the target column name that you specified for the mapping, and the type must match the target column type or expression that you define.

The CREATE TABLE command for the target Greenplum Database table receiving the data defined in the loadfromfile2.yaml file presented in the Constructing the filesource.yaml Configuration File section follows:

 testdb=# CREATE TABLE payables.expenses2( id int8, cname text, fname text );