GPSS load configuration file (version 3).
version: v3
targets:
- gpdb:
host: <host>
port: <greenplum_port>
user: <user_name>
password: <password>
database: <db_name>
work_schema: <work_schema_name>
error_limit: <num_errors> | <percentage_errors>
filter_expression: <filter_string>
tables:
- table: <table_name>
schema: <schema_name>
mode:
# specify a single mode property block (described below)
insert: {}
update:
<mode_specific_property>: <value>
...
merge:
<mode_specific_property>: <value>
...
mapping:
<target_column_name> : <source_column_name> | <expression>
...
filter: <output_filter_string>
...
sources:
- <DATASOURCE>:
<DATASOURCE_specific_properties>
content:
<data_format>:
<column_spec>
<other_props>
option:
schedule:
max_retries: <num_retries>
retry_interval: <retry_time>
running_duration: <run_time>
auto_stop_restart_interval: <restart_time>
max_restart_times: <num_restarts>
quit_at_eof_after: <clock_time>
alert:
command: <command_to_run>
workdir: <directory>
timeout: <alert_time>
Where the mode_specific_propertys that you can specify for update
and merge
mode follow:
update:
match_columns: [<match_column_names>]
order_columns: [<order_column_names>]
update_columns: [<update_column_names>]
update_condition: <update_condition>
merge:
match_columns: [<match_column_names>]
update_columns: [<update_column_names>]
order_columns: [<order_column_names>]
update_condition: <update_condition>
delete_condition: <delete_condition>
Where data_format, column_spec, and other_props are one of the following blocks (data source-specific):
avro:
source_column_name: <column_name>
schema_url: <http://schemareg_host:schemareg_port> %, ...%
bytes_to_base64: <boolean>
binary:
source_column_name: <column_name>
csv:
columns:
- name: <column_name>
type: <column_data_type>
...
delimiter: <delim_char>
quote: <quote_char>
null_string: <nullstr_val>
escape: <escape_char>
force_not_null: <columns>
fill_missing_fields: <boolean>
custom:
columns:
- name: <column_name>
type: <column_data_type>
...
name: <formatter_name>
options:
- <optname>=<optvalue>
...
delimited:
columns:
- name: <column_name>
type: <column_data_type>
...
delimiter: <delimiter_string>
eol_prefix: <prefix_string>
quote: <quote_char>
escape: <escape_char>
json:
column:
name: <column_name>
type: json | jsonb
is_jsonl: <boolean>
newline: <newline_str>
And where you may specify any property value with a template variable that GPSS substitutes at runtime using the following syntax:
<property:> {{<template_var>}}
NoteVersion 3 of the GPSS load configuration file is different in both content and format than previous versions of the file. Certain symbols used in the GPSS version 1 and 2 configuration file reference page syntax have different meanings in version 3 syntax:
- Brackets
[]
are literal and are used to specify a list in version 3. They are no longer used to signify the optionality of a property.- Curly braces
{}
are literal and are used to specify YAML mappings in version 3 syntax. They are no longer used with the pipe symbol (|
) to identify a list of choices.
You specify the configuration properties for a Greenplum Streaming Server (GPSS) job in a YAML-formatted configuration file that you provide to the gpsscli submit
or gpsscli load
command. There are three types of configuration information in this file - target Greenplum Database connection and data import properties, properties specific to the data source from which you will load data into Greenplum, and properties specific to the GPSS job.
This reference page uses the name gpsscli-v3.yaml
to refer to this file; you may choose your own name for the file.
The gpsscli
utility processes the YAML configuration file in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.
You can use the gpsscli convert
command to convert a V2 load configuration file to V3 syntax.
version Property
version: v3
.
targets:gpdb Properties
public
.
true
, GPSS loads the message. If the filter evaluates to
false
, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more source value, key, or meta column names.
The Greenplum Database tables, and the data that GPSS will load into each.
public
schema.
insert
,
merge
, or
update
. The default mode is
insert
.
Note
update
andmerge
are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
update_columns
when the input columns identified in
match_columns
match the named target table columns and the optional
update_condition
is true.
update_columns
,match_columns
target table column values are equal to the input data, andupdate_condition
is specified and met. Deletes rows when:match_columns
target table column values are equal to the input data, anddelete_condition
is specified and met.New rows are identified when the match_columns
value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the match_columns
and update_columns
. If there are multiple new match_columns
values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specify order_columns
, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.
mode
supports one or more of the following properties as specified in the Synopsis.
mode
is
merge
or
update
.
order_columns
is used with
match_columns
to determine the input row with the largest value; GPSS uses that row to write/update the target.
merge
mode
to sort the input data rows.
match_columns
criteria and the optional
update_condition
.
mode
is
merge
or
update
.
WHERE
clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of a
merge
). Optional.
merge
mode
, specifies a boolean condition, similar to that which you would declare in a
WHERE
clause, that must be met for GPSS to delete rows in the target table that meet the
match_columns
criteria. Optional.
NoteWhen you specify a
mapping
, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide amapping
block.
target\_column\_name: source\_column\_name \| expression
: target\_column\_name specifies the target Greenplum Database table column name. GPSS maps this column name to the source column name specified in source\_column\_name, or to an expression. When you specify an expression, you may provide a value expression that you would specify in the `SELECT` list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.
true
, GPSS loads the message. If the filter evaluates to
false
, the message is dropped. output_filter_string must be a valid SQL conditional expression and may reference one or more
META
or
VALUE
column names.
sources: Properties
The data source.
file
,
kafka
,
rabbitmq
, and
s3
data sources.
file
,
kafka
,
rabbitmq
, or
s3
data source; refer to
filesource-v3.yaml (Beta),
gpkafka-v3.yaml (Beta),
rabbitmq-v3.yaml (Beta), and
s3source-v3.yaml (Beta) for version 3 configuration file format and properties for these sources.
option: Properties
Controls the frequency and interval of restarting jobs.
d
), hour (
h
), minute (
m
), second (
s
), or millisecond (
ms
) integer units; do not mix units. The default retry interval is
5m
(5 minutes).
running_duration
.
running_duration
. The default is 0, do not restart the job.
clock_time
, even when GPSS encounters an EOF.
Controls notification when a job is stopped for any reason (success, completion, error, user-initiated stop).
$GPSSJOB_NAME
,
$GPSSJOB_STATUS
, and
$GPSSJOB_DETAIL
.
d
), hour (
h
), minute (
m
), or second (
s
) integer units; do not mix units. The default alert timeout is
-1s
(no timeout).
GPSS supports using template variables to specify property values in the load configuration file.
You specify a template variable value in the load configuration file as follows:
<property>: {{<template_var>}}
For example:
max_retries: {{numretries}}
GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value>
option to the gpsscli dryrun
, gpsscli submit
, or gpsscli load
command.
For example, if the command line specifies:
--property numretries=10
GPSS substitutes occurrences of {{numretries}}
in the load configuration file with the value 10
before submitting the job, and uses that value while the job is running.
If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:
CREATE TABLE "MyTable" (c1 text);
Your YAML configuration file would refer to the table name as:
targets:
- gpdb:
tables:
- table: '"MyTable"'
You can specify backslash escape sequences in the CSV delimiter
, quote
, and escape
options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x
). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.
Submit a job to load data into Greenplum Database as defined in the v3 load configuration file named loadit_v3.yaml
:
$ gpsscli submit loadit_v3.yaml
Example Greenplum Database configuration properties in loadit_v3.yaml
:
version: v3
targets:
- gpdb:
host: mdw-1
port: 5432
user: gpadmin
password: changeme
database: testdb
work_schema: public
error_limit: "25"
tables:
- table: orders
schema: public
mode:
insert {}
sources:
- kafka:
<kafka_specific_properties>
gpsscli convert, gpsscli submit, filesource-v3.yaml (Beta), gpkafka-v3.yaml (Beta), rabbitmq-v3.yaml (Beta), s3source-v3.yaml (Beta)