GPSS load configuration file for a File data source (version 3).

Synopsis

version: v3
targets:
- gpdb
    host: <host>
    port: <greenplum_port>
    user: <user_name>
    password: <password>
    database: <db_name>
    work_schema: <work_schema_name>
    error_limit: <num_errors> | <percentage_errors>
    filter_expression: <filter_string>
    tables:
      - table: <table_name>
        schema: <schema_name>
        mode:
          # specify a single mode property block (described below)
          insert: {}
          update:
            <mode_specific_property>: <value>
            ...
          merge:
            <mode_specific_property>: <value>
            ...
        mapping:
          <target_column_name> : <source_column_name> | <expression>
          ...
      ...
sources:
- file:
    uri:
      - <file_path>
      ...
    content:
      <data_format>:
        <column_spec>
        <other_props>
    task:
      prepare_statement: <udf_or_sql_to_run>
      teardown_statement: <udf_or_sql_to_run>
    meta:
      json:
        column:
          name: meta
          type: json
option:
  schedule:
    max_retries: <num_retries>
    retry_interval: <retry_time>

Where the mode_specific_propertys that you can specify for update and merge mode follow:

update:
  match_columns: [<match_column_names>]
  order_columns: [<order_column_names>]
  update_columns: [<update_column_names>]
  update_condition: <update_condition>
merge:
  match_columns: [<match_column_names>]
  update_columns: [<update_column_names>]
  order_columns: [<order_column_names>]
  update_condition: <update_condition>
  delete_condition: <delete_condition>

Where data_format, column_spec, and other_props are one of the following blocks (data source-specific):

avro:
  source_column_name: <column_name>
  schema_url: <http://schemareg_host:schemareg_port> %, ...%
  bytes_to_base64: <boolean>
binary:
  source_column_name: <column_name>
csv:
  columns:
    - name: <column_name>
      type: <column_data_type>
    ...
  delimiter: <delim_char>
  quote: <quote_char>
  null_string: <nullstr_val>
  escape: <escape_char>
  force_not_null: <columns>
  fill_missing_fields: <boolean>
custom:
  columns:
    - name: <column_name>
      type: <column_data_type>
    ...
  name: <formatter_name>
  options:
    - <optname>=<optvalue>
    ...
delimited:
  columns:
    - name: <column_name>
      type: <column_data_type>
    ...
  delimiter: <delimiter_string>
json:
  column:
    name: <column_name>
    type: json | jsonb
  is_jsonl: <boolean>
  newline: <newline_str>

Description

Note: Version 3 of the GPSS load configuration file is different in both content and format than previous versions of the file. Certain symbols used in the GPSS version 1 and 2 configuration file reference page syntax have different meanings in version 3 syntax:

  • Brackets [] are literal and are used to specify a list in version 3. They are no longer used to signify the optionality of a property.
  • Curly braces {} are literal and are used to specify YAML mappings in version 3 syntax. They are no longer used with the pipe symbol (|) to identify a list of choices.

You specify the configuration properties for a Greenplum Streaming Server (GPSS) file load job in a YAML-formatted configuration file that you provide to the gpsscli submit or gpsscli load commands. There are three types of configuration properties in this file - those that identify the Greenplum Database connection and target table, properties specific to the file data source that you will load into Greenplum, and job-related properties.

This reference page uses the name filesource-v3.yaml to refer to this file; you may choose your own name for the file.

The gpsscli utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.

Keywords and Values

version Property

version: v3
The version of the configuration file. You must specify version: v3.

targets:gpdb Properties

host: host
The host name or IP address of the Greenplum Database master host.
port: greenplum_port
The port number of the Greenplum Database server on the master host.
user: user_name
The name of the Greenplum Database user/role. This user_name must have permissions as described in the Configuring Greenplum Database Role Privileges.
password: password
The password for the Greenplum Database user/role.
database: db_name
The name of the Greenplum database.
work_schema: work_schema_name
The name of the Greenplum Database schema in which GPSS creates internal tables. The default work_schema_name is public.
error_limit: num_errors | percentage_errors
The error threshold, specified as either an absolute number or a percentage. GPSS stops running the job when this limit is reached.
filter_expression: filter_string
The filter to apply to the input data before GPSS loads the data into Greenplum Database. If the filter evaluates to true, GPSS loads the message. If the filter evaluates to false, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more source value, key, or meta column names.
tables:

The Greenplum Database tables, and the data that GPSS will load into each.

table: table_name
The name of the Greenplum Database table into which GPSS loads the data.
schema: schema_name
The name of the Greenplum Database schema in which table_name resides. Optional, the default schema is the public schema.
mode:
The table load mode; insert, merge, or update. The default mode is insert.

Note: update and merge are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.

insert:
Inserts source data into Greenplum.
update:
Updates the target table columns that are listed in update_columns when the input columns identified in match_columns match the named target table columns and the optional update_condition is true.
merge:

Inserts new rows and updates existing rows when:

  • columns are listed in update_columns,
  • the match_columns target table column values are equal to the input data, and
  • an optional update_condition is specified and met.

    Deletes rows when:
  • the match_columns target table column values are equal to the input data, and
  • an optional delete_condition is specified and met.

    New rows are identified when the match_columns value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the match_columns and update_columns. If there are multiple new match_columns values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specify order_columns, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.
mode_property_name: value
The name to value mapping for a mode property. Each mode supports one or more of the following properties as specified in the Synopsis.
match_columns: [match_column_names]
A comma-separated list that specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
Required when mode is merge or update.
order_columns: [order_column_names]
A comma-separated list that specifies the column(s) by which GPSS sorts the rows. When multiple matching rows exist in a batch, order_columns is used with match_columns to determine the input row with the largest value; GPSS uses that row to write/update the target.
Optional. May be specified in merge mode to sort the input data rows.
update_columns: [update_column_names]
A column-sparated list that specifies the column(s) to update for the rows that meet the match_columns criteria and the optional update_condition.
Required when mode is merge or update.
update_condition: update_condition
Specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of a merge). Optional.
delete_condition: delete_condition
In merge mode, specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met for GPSS to delete rows in the target table that meet the match_columns criteria. Optional.
mapping:
Optional. Overrides the default source-to-target column mapping.

Note: When you specify a mapping, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide a mapping block.

target_column_name: source_column_name | expression
target_column_name specifies the target Greenplum Database table column name. GPSS maps this column name to the source column name specified in source_column_name, or to an expression. When you specify an expression, you may provide a value expression that you would specify in the SELECT list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.

sources:file: Options

uri

The path to the file.

file_path
A URL identifying a file or files to be loaded. You can specify wildcards in any element of the path. To load all files in a directory, specify dirname/*.
content:
The file type, field names, and type-specific properties of the file data. You must specify all data elements in the order in which they appear in the file.
column_spec

The source to Greenplum column mapping. The supported column specification differs for different data formats as described below.

The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in source_column_name, column:name, or columns:name with a column name in the target Greenplum Database table. You can override the default mapping by specifying a mapping: block.
data_format

The format of the key or value data. You may specify a data_format of avro, binary, csv, custom, delimited, or json for the key and value, with some restrictions.

avro
When you specify the avro data format for a key or value, GPSS reads the data into a single json-type column. You may specify a schema registery location and whether or not you want GPSS to convert bytes fields into base64-encoded strings.
source_column_name: column_name
The name of the single json-type column into which GPSS reads the key or value data.
schema_url: schemareg_host:schemareg_port
When you specify the avro format and the Avro schema of the JSON data that you want to load is registered in the Confluent Schema Registry, you must identify the host name and port number of each Confluent Schema Registry server in your Kafka cluster. You may specify more than one address, and at least one of the addresses must be legal.
bytes_to_base64: boolean
When true, GPSS converts Avro bytes fields into base64-encoded strings. The default value is false, GPSS does not perform the conversion.
binary
When you specify the binary data format, GPSS reads the data into a single bytea-type column.
source_column_name: column_name
The name of the single bytea-type column into which GPSS reads the key or value data.
csv
When you specify the csv data format, GPSS reads the data into the list of columns that you specify. The file content cannot contain line ending characters (CR and LF).
columns:
A set of column name/type mappings. The value [] specifies all columns.
name: column_name
The name of a key or value column. column_name must match the column name of the target Greenplum Database table.
type: column_data_type
The data type of the column. You must specify an equivalent data type for each data element and the associated Greenplum Database table column.
delimiter: delim_char
Specifies a single ASCII character that separates columns within each message or row of data. The default delimiter is a comma ( ,).
quote: quote_char
Specifies the quotation character. Because GPSS does not provide a default value for this property, you must specify a value.
null_string: nullstr_val
Specifies the string that represents the null value. Because GPSS does not provide a default value for this property, you must specify a value.
force_not_null: columns
Specifies a comma-separated list of column names to process as though each column were quoted and hence not a NULL value. For the default null_string (nothing between two delimiters), missing values are evaluated as zero-length strings.
fill_missing_fields: boolean
Specifies the action of GPSS when it reads a row of data that has missing trailing field values (the row has missing data fields at the end of a line or row). The default value is false, GPSS returns an error when it encounters a row with missing trailing field values.
If set to true, GPSS sets missing trailing field values to NULL. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still generate an error.
newline: newline_str
Specifies the string that represents a new line. + GPSS does not specify a default value.
custom
When you specify the custom data format, GPSS uses the custom formatter that you specify to process the input data before writing it to Greenplum Database.
columns:
A set of column name/type mappings. The value [] specifies all columns.
name: column_name
The name of a key or value column. column_name must match the column name of the target Greenplum Database table.
type: column_data_type
The data type of the column. You must specify an equivalent data type for each data element and the associated Greenplum Database table column.
name: formatter_name
When you specify the custom data format, formatter_name is required and must identify the name of the formatter user-defined function that GPSS should use when loading the data.
options:
A set of function argument name=value pairs.
optname=optvalue
The name and value of the set of arguments to pass into the formatter_name UDF.
delimited
When you specify the delimited data format, GPSS reads the data into the list of columns that you specify. You must specify the data delimiter.
columns:
A set of column name/type mappings. The value [] specifies all columns.
name: column_name
The name of a key or value column. column_name must match the column name of the target Greenplum Database table.
type: column_data_type
The data type of the column. You must specify an equivalent data type for each data element and the associated Greenplum Database table column.
delimiter: delimiter_string
Optional. When you specify the delimited data format, delimiter_string is required and must identify the data element delimiter. delimiter_string may be a multi-byte value, and up to 32 bytes in length. It may not contain quote and escape characters.
json
When you specify the json data format, GPSS can read the data as a single JSON object or as a single JSON record per line.
column:
A single column name/type mapping.
name: column_name
The name of the key or value column. column_name must match the column name of the target Greenplum Database table.
type: json | jsonb | gp_jsonb (Beta) | gp_json (Beta)
The data type of the column.
is_jsonl: boolean
Identifies whether or not GPSS reads the JSON data as a single object or single-record-per-line. The default is false, GPSS reads the JSON data as a single object.
newline: newline_str
A string that specifies the new line character(s) that end each JSON record. The default newline is "\n".
task:

File data source task properties.

prepare_statement: udf_or_sql_to_run
A user-defined function or SQL command(s) that you want GPSS to run before it executes the job. The default is null, no command to execute.
teardown_statement: udf_or_sql_to_run
A user-defined function or SQL command(s) that you want GPSS to run after the job completes. The default is null, no command to execute.
meta:

The data type and field name of the file meta data. meta: must specify the json or jsonb (Greenplum 6 only) data format, and a single json-type column.

You can load this property into the target table with a mapping, or use the property in the update or merge criteria for a load operation

option: Properties

schedule:

Controls the frequency and interval of restarting failed jobs.

retry_interval: retry_time
The period of time that GPSS waits before retrying the job. You can specify the time interval in day ( d), hour ( h), minute ( m), second ( s), or millisecond ( ms) integer units; do not mix units. The default retry interval is 5m (5 minutes).
max_retries: num_retries
The maximum number of times that GPSS attempts to retry the job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.

Notes

If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:

CREATE TABLE "MyTable" (c1 text);

Your YAML configuration file would refer to the table name as:

targets:
- gpdb
  - tables
    table: '"MyTable"'

You can specify backslash escape sequences in the CSV delimiter, quote, and escape options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.

Examples

Submit a job to load data from an Avro file as defined in the version 3 load configuration file named loadfromfile_v3.yaml:

$ gpsscli submit loadfromfile_v3.yaml

Example loadfromfile_v3.yaml configuration file:

version: v3
targets:
- gpdb:
    host: mdw-1
    port: 15432
    user: gpadmin
    password: changeme
    database: testdb
    work_schema: public
    error_limit: "25"
    tables:
      - table: orders
        schema: public
        mode:
          merge:
            match_columns: [pk]
            order_columns: [seq]
            delete_condition: flag = 0
        mapping:
          data: data
          pk: pk
          seq: seq
sources:
- file:
    uri:
      - file:///tmp/data.csv
    content:
      csv:
        columns:
          - name: pk
            type: int
          - name: seq
            type: int
          - name: data
            type: text
          - name: flag
            type: int
option:
  schedule: {}

See Also

gpsscli load, gpsscli submit

check-circle-line exclamation-circle-line close-line
Scroll to top icon