You can use the gpsscli utility to load data from a file or from the stdout
of a command into Tanzu Greenplum.
You will perform the following tasks when you use the VMware Tanzu Greenplum streaming server to load file or command output data into a Tanzu Greenplum table:
Before using the gpsscli
utilities to load file or command output data to Tanzu Greenplum, ensure that:
gpss
server process has the appropriate permissions to access the file or run the command.To write file or command output data into a Tanzu Greenplum table, you must identify the format of the data in the load configuration file.
The Tanzu Greenplum streaming server supports loading files of the following formats:
Format | Description |
---|---|
avro | Avro-format data. Specify the avro format when you want to load a single-object encoded Avro file. GPSS reads Avro data from the file and loads it into a single JSON-type column. You must define a mapping if you want GPSS to write the data into specific columns in the target Tanzu Greenplum table.GPSS supports libz -, lzma -, and snappy -compressed Avro data. |
binary | Binary format data. Specify the binary format when your file is binary format. GPSS reads binary data from the file and loads it into a single bytea-type column. |
csv | Comma-delimited text format data. Specify the csv format when your file data is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF). |
custom | Data of a custom format, parsed by a custom formatter function. |
delimited | Text data separated by a configurable delimiter. The delimited data format supports a multi-byte delimiter. |
json, jsonl (version 2 only) | JSON- or JSONB-format data. Specify the json format when the file is in JSON or JSONB format. GPSS can read JSON data as a single object or can read a single JSON record per line. You must define a mapping if you want GPSS to write the data into specific columns in the target Tanzu Greenplum table.Note: GPSS supports JSONB-format data only when loading to Greenplum 6. Note: Specify FORMAT: jsonl in version 2 format load configuration files. Specify json with is_jsonl: true in version 3 format load configuration files. |
You configure a data load operation from a file or command output to Tanzu Greenplum via a YAML-formatted configuration file. This configuration file includes parameters that identify the source file or command and information about the Tanzu Greenplum connection and target table, as well as error thresholds for the operation.
The Tanzu Greenplum streaming server supports versions 2 and 3 of the YAML configuration file when you load data into Greenplum from a file or command output. Versions 2 and 3 of the configuration file format support loading file/command and meta data to Greenplum.
Refer to the filesource-v2.yaml reference page for configuration file format and the configuration properties supported. You may find a quick start guide and sample YAML configuration files under the $GPHOME/docs/cli_help/gpss
directory.
A sample version 2 file load YAML configuration file named loadfromfile2.yaml
follows:
DATABASE: ops
USER: gpadmin
PASSWORD: changeme
HOST: mdw-1
PORT: 5432
VERSION: 2
FILE:
INPUT:
SOURCE:
URL: file:///tmp/file.csv
VALUE:
COLUMNS:
- NAME: id
TYPE: int
- NAME: cname
TYPE: text
- NAME: oname
TYPE: text
FORMAT: delimited
DELIMITED_OPTION:
DELIMITER: "\t"
EOL_PREFIX: "$$EOL$$"
QUOTE: '&'
ESCAPE: '|'
META:
COLUMNS:
- NAME: meta
TYPE: json
FORMAT: json
ERROR_LIMIT: 25
OUTPUT:
SCHEMA: gpschema
TABLE: gptable
MODE: INSERT
MAPPING:
- NAME: id
EXPRESSION: id
- NAME: cname
EXPRESSION: cname
- NAME: fname
EXPRESSION: (meta->>'filename')::text
SCHEDULE:
RETRY_INTERVAL: 500ms
MAX_RETRIES: 2
You identify the Tanzu Greenplum connection options via the DATABASE
, USER
, PASSWORD
, HOST
, and PORT
parameters.
The VERSION
parameter identifies the version of the GPSS YAML configuration file.
NoteYou must specify version 2 when you load from a file or command output data source into Greenplum using this format.
You can direct GPSS to load from a file or from the stdout
of a command:
SOURCE:URL
property. GPSS supports wildcards in the file path. If you want to read all files in a directory, specify dirname/*
.stdout
of a command by specifying the command and execution options using the properties in the the SOURCE:EXEC
block.The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in VALUE:COLUMNS:NAME
with a column name in the target Tanzu Greenplum OUTPUT:TABLE
:
The VALUE
block must specify a FORMAT
. The VALUE:FORMAT
keyword identifies the format of the file. GPSS supports comma-delimited text (csv
), binary (binary
), or JSON/JSONB (json
), and Avro (avro
) format files. GPSS also supports data that is separated by a configurable multi-byte delimiter (delimited
).
When you provide a META
block, you must specify a single JSON-type COLUMNS
and the FORMAT: json
. Meta data for a file is a single text
property named filename
. GPSS does not support meta data when loading from command output.
The FILTER
parameter identifies a filter to apply to the data before it is loaded into Tanzu Greenplum. If the filter evaluates to true
, GPSS loads the data. The data is dropped if the filter evaluates to false
. The filter string must be a valid SQL conditional expression and may reference one or more VALUE
column names.
The ERROR_LIMIT
parameter identifies the number of errors or the error percentage threshold after which GPSS should exit the load operation.
You identify the target Tanzu Greenplum schema name and table name via the FILE:OUTPUT:
SCHEMA
and TABLE
parameters. You must pre-create the Tanzu Greenplum table before you attempt to load file or command output data.
The default load mode is to insert data into the Tanzu Greenplum table. GPSS also supports updating and merging data into a Greenplum table. You specify the load MODE
, the MATCH_COLUMNS
and UPDATE_COLUMNS
, and any UPDATE_CONDITION
s that must be met to merge or update the data. In MERGE
MODE
, you can also specify ORDER_COLUMNS
to filter out duplicates and a DELETE_CONDITION
.
You can override the default mapping of the INPUT:VALUE:COLUMNS
by specifying a MAPPING
block in which you identify the association between a specific column in the target Tanzu Greenplum table and a data value element. You can also map the META
data column, and map a Tanzu Greenplum table column to a value expression.
NoteWhen you specify a
MAPPING
block, ensure that you provide entries for all data elements of interest - GPSS does not automatically match column names when you provide aMAPPING
.
MERGE
mode is similar to an UPSERT
operation; GPSS may insert new rows in the database, or may update an existing database row that satisfies match and update conditions. GPSS deletes rows in MERGE
mode when the data satisfies an optional DELETE_CONDITION
that you specify.
GPSS stages a merge operation in a temporary table, generating the SQL to populate the temp table from the set of OUTPUT
configuration properties that you provide.
GPSS uses the following algorithm for MERGE
mode processing:
MAPPINGS
.FILTER
.MATCH_COLUMNS
and ORDER_COLUMNS
to filter out duplicates.MATCH_COLUMNS
, UPDATE_COLUMNS
, and UPDATE_CONDITION
.MATCH_COLUMNS
and the DELETE_CONDITION
.When you specify FORMAT: json
or FORMAT: jsonl
, valid COLUMN:TYPE
s for the data include json
or jsonb
. You can also specify the new GPSS gp_jsonb
or gp_json
column types.
gp_jsonb
is an enhanced JSONB type that adds support for \u
escape sequences and unicode. For example, gp_jsonb
can escape \uDD8B
and \u0000
as text format, but jsonb
treats these characters as illegal.gp_json
is an enhanced JSON type that can tolerate certain illegal unicode sequences. For example, gp_json
automatically escapes incorrect surrogate pairs and processes \u0000
as \\u0000
. Note that unicode escape values cannot be used for code point values above 007F
when the server encoding is not UTF8
.You can use the gp_jsonb
and gp_json
data types as follows:
As the COLUMN:TYPE
when the target Tanzu Greenplum table column type is json
or jsonb
.
In a MAPPING
when the target Tanzu Greenplum column is text
or varchar
. For example:
EXPRESSION: (j->>'a')::text
In a MAPPING
when FORMAT: avro
and the target Tanzu Greenplum column is json
or jsonb
. For example:
EXPRESSION: j::gp_jsonb
or
EXPRESSION: j::gp_json
In a MAPPING
when FORMAT: avro
and the target Tanzu Greenplum column is text
or varchar
. For example:
EXPRESSION: (j::gp_jsonb->>'a')::text
or
EXPRESSION: (j::gp_json->>'a')::text
NoteThe
gp_jsonb
andgp_json
data types are defined in an extension nameddataflow
. You mustCREATE EXTENSION dataflow;
in each database in which you choose to use these data types.
Preserving Ill-Formed JSON Escape Sequences
GPSS exposes a configuration parameter that you can use with the gp_jsonb
and gp_json
types. The name of this parameter is gpss.json_preserve_ill_formed_prefix
. When set, GPSS does not return an error when it encounters an ill-formed JSON escape sequence with these types, but instead prepends it with the prefix that you specify.
For example, if gpss.json_preserve_ill_formed_prefix
is set to the string "##"
as follows:
SET gpss.json_preserve_ill_formed_prefix = "##";
and GPSS encounters an ill-formed JSON sequence such as the orphaned low surrogate \ude04X
, GPSS writes the data as ##\ude04X
instead.
You can specify the avro
, binary
, csv
, delimited
, or json
data format in the Version 2 configuration file INPUT:VALUE:FORMAT
, with some restrictions. You cannot specify a META
block when INPUT:VALUE:FORMAT
is csv
.
You can define a MAPPING
between the input data (VALUE:COLUMNS
and META:COLUMNS
) and the columns in the target Tanzu Greenplum table. Defining a mapping may be useful when you have a multi-field input column (such as a JSON-type column), and you want to assign individual components of the input field to specific columns in the target table.
You might also use a MAPPING
to assign a value expression to a target table column. The expression must be one that you could specify in the SELECT
list of a query, and can include a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so forth.
If you choose to map more than one input column in an expression, you can can create a user-defined function to parse and transform the input column and return the columns of interest.
For example, suppose you are loading a JSON file with the following contents:
{ "customer_id": 1313131, "some_intfield": 12 }
{ "customer_id": 77, "some_intfield": 7 }
{ "customer_id": 1234, "some_intfield": 56 }
You could define a user-defined function, udf_parse_json()
, to parse the data as follows:
=> CREATE OR REPLACE FUNCTION udf_parse_json(value json)
RETURNS TABLE (x int, y text)
LANGUAGE plpgsql AS $$
BEGIN
RETURN query
SELECT ((value->>'customer_id')::int), ((value->>'some_intfield')::text);
END $$;
This function returns the two fields in each JSON record, casting the fields to integer and text, respectively.
An example MAPPING
for file data in a JSON-format FILE:INPUT:COLUMNS
named jdata
follows:
MAPPING:
cust_id: (jdata->>'customer_id')
field2: ((jdata->>'some_intfield') * .075)::decimal
j1, j2: (udf_parse_json(jdata)).*
The Tanzu Greenplum table definition for this example scenario is:
=> CREATE TABLE t1map( cust_id int, field2 decimal(7,2), j1 int, j2 text );
You must pre-create the Greenplum table before you load data into Tanzu Greenplum. You use the FILE:OUTPUT:
SCHEMA
and TABLE
load configuration file parameters to identify the schema and table names.
The target Greenplum table definition must include each column that GPSS will load into the table. The table definition may include additional columns; GPSS ignores these columns, and loads no data into them.
The name and data type that you specify for a column of the target Tanzu Greenplum table must match the name and data type of the related data element. If you have defined a column mapping, the name of the Tanzu Greenplum column must match the target column name that you specified for the mapping, and the type must match the target column type or expression that you define.
The CREATE TABLE
command for the target Tanzu Greenplum table receiving the data defined in the loadfromfile2.yaml
file presented in the Constructing the filesource.yaml Configuration File section follows:
testdb=# CREATE TABLE payables.expenses2( id int8, cname text, fname text );