Load configuration file for a GPSS file data source.
DATABASE: <db_name>
USER: <user_name>
PASSWORD: <password>
HOST: <coordinator_host>
PORT: <greenplum_port>
VERSION: 2
FILE:
INPUT:
SOURCE:
{ URL: <file_path> |
EXEC:
COMMAND: <command_to_run>
WORKDIR: <directory>
STDERR_AS_FAIL: <boolean> }
VALUE:
[COLUMNS:
- NAME: <column_name>
TYPE: <column_data_type>
[ ... ]]
FORMAT: <value_data_format>
[AVRO_OPTION:
BYTES_TO_BASE64: <boolean>]
[CSV_OPTION:
[DELIMITER: <delim_char>]
[QUOTE: <quote_char>]
[NULL_STRING: <nullstr_val>]
[ESCAPE: <escape_char>]
[FORCE_NOT_NULL: <columns>]
[FILL_MISSING_FIELDS: <boolean>]]
[NEWLINE: <newline_str>]]
[DELIMITED_OPTION:
[DELIMITER: <delimiter_string>]
[EOL_PREFIX: <prefix_string>]
[QUOTE: <quote_char>]
[ESCAPE: <escape_char>]]
[JSONL_OPTION:
[NEWLINE: <newline_str>]]
[META:
COLUMNS:
- NAME: <meta_column_name>
TYPE: { json | jsonb }
FORMAT: json
[FILTER: <filter_string>]
[ENCODING: <char_set>]
[ERROR_LIMIT: { <num_errors> | <percentage_errors> }]
OUTPUT:
[SCHEMA: <output_schema_name>]
TABLE: <table_name>
[FILTER: <output_filter_string>]
[MODE: <mode>]
[MATCH_COLUMNS:
- <match_column_name>
[ ... ]]
[ORDER_COLUMNS:
- <order_column_name>
[ ... ]]
[UPDATE_COLUMNS:
- <update_column_name>
[ ... ]]
[UPDATE_CONDITION: <update_condition>]
[DELETE_CONDITION: <delete_condition>]
TASK:
PREPARE_SQL: <udf_or_sql_command_to_run>
TEARDOWN_SQL: <udf_or_sql_command_to_run>
[SCHEDULE:
RETRY_INTERVAL: <retry_time>
MAX_RETRIES: <num_retries>
RUNNING_DURATION: <run_time>
AUTO_STOP_RESTART_INTERVAL: <restart_time>
MAX_RESTART_TIMES: <num_restarts>
QUIT_AT_EOF_AFTER: <clock_time>]
[ALERT:
COMMAND: <command_to_run>
WORKDIR: <directory>
TIMEOUT: <alert_time>]
Where you may specify any property value with a template variable that GPSS substitutes at runtime using the following syntax:
<PROPERTY:> {{<template_var>}}
You specify the configuration parameters for a Greenplum Streaming Server (GPSS) file load job in a YAML-formatted configuration file that you provide to the gpsscli submit
or gpsscli load
commands. There are two types of configuration parameters in this file - those that identify the Greenplum Database connection and target table, and parameters specific to the file data source that you will load into Greenplum.
This reference page uses the name filesource.yaml
to refer to this file; you may choose your own name for the file.
The gpsscli
utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant, and keywords are case-sensitive.
Greenplum Database Options
FILE:INPUT: Options
The file input configuration parameters. You must provide exactly one of URL
or an EXEC
block.
dirname/*
.
stdout
GPSS loads into Greenplum Database.
COMMAND: command_to_run
: The program that the GPSS server runs on the local host, including the arguments. The command must be executable by GPSS, and can include pipe and quote characters.
WORKDIR: directory
: The working directory for the child process. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.
STDERR_AS_FAIL: boolean
: Specifies whether data written to `stderr` constitutes failure of the command, regardless of the command return value. The default value is `false`; GPSS does not consider writing to `stderr` a failure, and will write a message to the GPSS log file. When true, GPSS treats any output to `stderr` as a failure, and rolls back the operation.
The field names, types, and format of the file data. You must specify all data elements in the order in which they appear in the file.
: The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in COLUMNS:NAME
with a column name in the target Greenplum Database TABLE
. You can override the default mapping by specifying a MAPPING
block.
FORMAT
of
avro
,
binary
,
csv
,
json
, or
jsonl
for the value data, with some restrictions.
avro
data format, you must define only a single
json
type column in
COLUMNS
. If the schema is registered in a Confluent Schema Registry, you must also provide the
AVRO_OPTION
.
binary
data format, you must define only a single
bytea
type column in
COLUMNS
.
csv
data format, the message content cannot contain line ending characters (CR and LF). You may also choose to provide
CSV_OPTION
s.
FORMAT: csv
, you must not provide a
META
block.
delimited
data format, the delimited message content may contain a multi-byte delimiter. You must provide
DELIMITED_OPTION
s.
json
data format, you must define only a single
json
type column in
COLUMNS
.
jsonl
data format, you may provide a
JSONL_OPTION
to define a newline character.
true
, GPSS converts Avro
bytes
fields into base64-encoded strings. The default value is
false
, GPSS does not perfrom the conversion.
FORMAT: csv
, you may also provide the following options:
,
).
null_string
(nothing between two delimiters), missing values are evaluated as zero-length strings.
false
, GPSS returns an error when it encounters a row with missing trailing field values.
true
, GPSS sets missing trailing field values to
NULL
. Blank rows, fields with a
NOT NULL
constraint, and trailing delimiters on a line will still generate an error.
FORMAT: delimited
, you may also provide the following options:
delimited
data format, delimiter_string is required and must identify the data element delimiter. delimiter_string may be a multi-byte value, and up to 32 bytes in length. It may not contain quote and escape characters.
\n
) that indicates the end of a row. The default prefix is empty.
FORMAT: jsonl
, you may choose to provide the
JSONL_OPTION
properties.
"\n"
.
META
must specify a single
json
or
jsonb
(Greenplum 6 only) type column and
FORMAT: json
. The available meta data for a file is a single
text
-type property named
filename
. You can load this property into the target table with a
MAPPING
, or use the property in the update or merge criteria for a load operation.
true
, GPSS loads the message. If the filter evaluates to
false
, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more
META
or
VALUE
column names.
csv
,
custom
,
delimited
, or
json
format. GPSS supports the character sets identified in
Character Set Support in the Greenplum Database documentation.
ERROR_LIMIT
is zero; GPSS deactivates error logging and stops the load operation when it encounters the first error. Due to a limitation of the Greenplum Database external table framework, GPSS does not accept
ERROR_LIMIT: 1
.
FILE:OUTPUT: Options
public
schema.
true
, GPSS loads the message. If the filter evaluates to
false
, the message is dropped. output_filter_string must be a valid SQL conditional expression and may reference one or more
META
or
VALUE
column names.
The table load mode. Valid mode values are INSERT
, MERGE
, or UPDATE
. The default value is INSERT
.
UPDATE
- Updates the target table columns that are listed in UPDATE_COLUMNS
when the input columns identified in MATCH_COLUMNS
match the named target table columns and the optional UPDATE_CONDITION
is true.
UPDATE
is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
MERGE
- Inserts new rows and updates existing rows when:
UPDATE_COLUMNS
,MATCH_COLUMNS
target table column values are equal to the input data, andUPDATE_CONDITION
is specified and met.Deletes rows when:
MATCH_COLUMNS
target table column values are equal to the input data, andDELETE_CONDITION
is specified and met.New rows are identified when the MATCH_COLUMNS
value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the MATCH_COLUMNS
and UPDATE_COLUMNS
. If there are multiple new MATCH_COLUMNS
values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specify ORDER_COLUMNS
, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.
MERGE
is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
Required if MODE
is MERGE
or UPDATE
.
Optional. May be specified in MERGE
MODE
to sort the input data rows.
ORDER_COLUMNS
is used with
MATCH_COLUMNS
to determine the input row with the largest value; GPSS uses that row to write/update the target.
Required if MODE
is MERGE
or UPDATE
.
MATCH_COLUMNS
criteria and the optional
UPDATE_CONDITION
.
WHERE
clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of a
MERGE
).
MERGE
MODE
, specifies a boolean condition, similar to that which you would declare in a
WHERE
clause, that must be met for GPSS to delete rows in the target table that meet the
MATCH_COLUMNS
criteria.
Optional. Overrides the default source-to-target column mapping. GPSS supports two mapping syntaxes.
NoteWhen you specify a
MAPPING
, ensure that you provide a mapping for all data value elements of interest. GPSS does not automatically match column names when you provide aMAPPING
.
COLUMNS:NAME
(source_column_name) or an expression. When you specify an expression, you may provide a value expression that you would specify in the
SELECT
list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.
MAPPING
syntax, specify the target_column_name and {source_column_name | expression} as described above.
FILE:TASK: Options
Job SCHEDULE: Options
Controls the frequency and interval of restarting jobs.
d
), hour (
h
), minute (
m
), second (
s
), or millisecond (
ms
) integer units; do not mix units. The default retry interval is
5m
(5 minutes).
RUNNING_DURATION
.
RUNNING_DURATION
. The default is 0, do not restart the job.
clock_time
, even when GPSS encounters an EOF.
Controls notification when a job is stopped for any reason (success, completion, error, user-initiated stop).
$GPSSJOB_NAME
,
$GPSSJOB_STATUS
, and
$GPSSJOB_DETAIL
.
d
), hour (
h
), minute (
m
), or second (
s
) integer units; do not mix units. The default alert timeout is
-1s
(no timeout).
GPSS supports using template variables to specify property values in the load configuration file.
You specify a template variable value in the load configuration file as follows:
<PROPERTY>: {{<template_var>}}
For example:
MAX_RETRIES: {{numretries}}
GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value>
option to the gpsscli dryrun
, gpsscli submit
, or gpsscli load
command.
For example, if the command line specifies:
--property numretries=10
GPSS substitutes occurrences of {{numretries}}
in the load configuration file with the value 10
before submitting the job, and uses that value while the job is running.
If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the filesource.yaml
configuration file. For example, if you create a table as follows:
CREATE TABLE "MyTable" ("MyColumn" text);
Your filesource.yaml
YAML configuration file would refer to the above table and column names as:
COLUMNS:
- name: '"MyColumn"'
type: text
OUTPUT:
TABLE: '"MyTable"'
You can specify backslash escape sequences in the CSV DELIMITER
, QUOTE
, and ESCAPE
options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x
). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.
Submit a job to load data from an Avro file as defined in the version 2 load configuration file named loadfromfile.yaml
:
$ gpsscli submit loadfromfile.yaml
Example loadfromfile.yaml
configuration file:
DATABASE: ops
USER: gpadmin
PASSWORD: changeme
HOST: mdw-1
PORT: 15432
VERSION: 2
FILE:
INPUT:
SOURCE:
URL: file:///tmp/file.avro
VALUE:
COLUMNS:
- NAME: value
TYPE: json
FORMAT: avro
META:
COLUMNS:
- NAME: meta
TYPE: json
FORMAT: json
FILTER: (value->>'x')::int < 10
ERROR_LIMIT: 25
OUTPUT:
SCHEMA: gpschema
TABLE: gptable
MODE: INSERT
MAPPING:
- NAME: a
EXPRESSION: (value->>'x')::int
- NAME: b
EXPRESSION: (value->>'y')::text
- NAME: c
EXPRESSION: (meta->>'filename')::text
SCHEDULE:
RETRY_INTERVAL: 500ms
MAX_RETRIES: 2