filesource-v2.yaml

Load configuration file for a GPSS file data source.

Synopsis

DATABASE: <db_name>
USER: <user_name>
PASSWORD: <password>
HOST: <coordinator_host>
PORT: <greenplum_port>
VERSION: 2
FILE:
    INPUT:
       SOURCE:
         { URL: <file_path> |
         EXEC:
           COMMAND: <command_to_run>
           WORKDIR: <directory>
           STDERR_AS_FAIL: <boolean> }
       VALUE:
         [COLUMNS:
            - NAME: <column_name>
              TYPE: <column_data_type>
            [ ... ]]
         FORMAT: <value_data_format>
         [AVRO_OPTION:
            BYTES_TO_BASE64: <boolean>]
         [CSV_OPTION:
            [DELIMITER: <delim_char>]
            [QUOTE: <quote_char>]
            [NULL_STRING: <nullstr_val>]
            [ESCAPE: <escape_char>]
            [FORCE_NOT_NULL: <columns>]
            [FILL_MISSING_FIELDS: <boolean>]]
            [NEWLINE: <newline_str>]]
         [DELIMITED_OPTION:
            [DELIMITER: <delimiter_string>]
            [EOL_PREFIX: <prefix_string>]
            [QUOTE: <quote_char>]
            [ESCAPE: <escape_char>]]
         [JSONL_OPTION:
            [NEWLINE: <newline_str>]]
       [META:
         COLUMNS:
            - NAME: <meta_column_name>
              TYPE: { json | jsonb }
         FORMAT: json
       [FILTER: <filter_string>]
       [ENCODING: <char_set>]
       [ERROR_LIMIT: { <num_errors> | <percentage_errors> }]
    OUTPUT:
       [SCHEMA: <output_schema_name>]
       TABLE: <table_name>
       [FILTER: <output_filter_string>]
       [MODE: <mode>]
       [MATCH_COLUMNS:
          - <match_column_name>
          [ ... ]]
       [ORDER_COLUMNS:
          - <order_column_name>
          [ ... ]]
       [UPDATE_COLUMNS:
          - <update_column_name>
          [ ... ]]
       [UPDATE_CONDITION: <update_condition>]
       [DELETE_CONDITION: <delete_condition>]
    TASK:
       PREPARE_SQL: <udf_or_sql_command_to_run>
       TEARDOWN_SQL: <udf_or_sql_command_to_run>
[SCHEDULE:
   RETRY_INTERVAL: <retry_time>
   MAX_RETRIES: <num_retries>
   RUNNING_DURATION: <run_time>
   AUTO_STOP_RESTART_INTERVAL: <restart_time>
   MAX_RESTART_TIMES: <num_restarts>
   QUIT_AT_EOF_AFTER: <clock_time>]
[ALERT:
   COMMAND: <command_to_run>
   WORKDIR: <directory>
   TIMEOUT: <alert_time>]

Where you may specify any property value with a template variable that GPSS substitutes at runtime using the following syntax:

<PROPERTY:> {{<template_var>}}

Description

You specify the configuration parameters for a Greenplum Streaming Server (GPSS) file load job in a YAML-formatted configuration file that you provide to the gpsscli submit or gpsscli load commands. There are two types of configuration parameters in this file - those that identify the Greenplum Database connection and target table, and parameters specific to the file data source that you will load into Greenplum.

This reference page uses the name filesource.yaml to refer to this file; you may choose your own name for the file.

The gpsscli utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant, and keywords are case-sensitive.

Keywords and Values

Greenplum Database Options

DATABASE: db_name: The name of the Greenplum database.
USER: user_name: The name of the Greenplum Database user/role. This user_name must have permissions as described in Configuring Greenplum Database Role Privileges.
PASSWORD: password: The password for the Greenplum Database user/role.
HOST: coordinator_host: The host name or IP address of the Greenplum Database coordinator host.
PORT: greenplum_port: The port number of the Greenplum Database server on the coordinator host.
VERSION: 2: The version of the GPSS load configuration file. GPSS supports version 2 of this format for a file data source.

FILE:INPUT: Options

SOURCE:

The file input configuration parameters. You must provide exactly one of URL or an EXEC block.

URL: file_path: The URL identifying the file or files to be loaded. You can specify wildcards in any element of the path. To load all files in a directory, specify dirname/*.
EXEC:: The execution options for the command whose stdout GPSS loads into Greenplum Database.

COMMAND: command_to_run
:   The program that the GPSS server runs on the local host, including the arguments. The command must be executable by GPSS, and can include pipe and quote characters.

WORKDIR: directory
:   The working directory for the child process. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.

STDERR_AS_FAIL: boolean
:   Specifies whether data written to `stderr` constitutes failure of the command, regardless of the command return value. The default value is `false`; GPSS does not consider writing to `stderr` a failure, and will write a message to the GPSS log file. When true, GPSS treats any output to `stderr` as a failure, and rolls back the operation.

VALUE:

The field names, types, and format of the file data. You must specify all data elements in the order in which they appear in the file.

COLUMNS:NAME: column_name: The name of a data value column. column_name must match the column name of the target Greenplum Database table.

: The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in COLUMNS:NAME with a column name in the target Greenplum Database TABLE. You can override the default mapping by specifying a MAPPING block.

COLUMNS:TYPE: data_type: The data type of the column. You must specify a compatible data type for each data element and the associated Greenplum Database table column.
FORMAT: data_format: The format of the value data. You may specify a FORMAT of avro, binary, csv, json, or jsonl for the value data, with some restrictions.
avro: When you specify the avro data format, you must define only a single json type column in COLUMNS. If the schema is registered in a Confluent Schema Registry, you must also provide the AVRO_OPTION.
binary: When you specify the binary data format, you must define only a single bytea type column in COLUMNS.
csv: When you specify the csv data format, the message content cannot contain line ending characters (CR and LF). You may also choose to provide CSV_OPTIONs.; When you specify FORMAT: csv, you must not provide a META block.
delimited: When you specify the delimited data format, the delimited message content may contain a multi-byte delimiter. You must provide DELIMITED_OPTIONs.
json: When you specify the json data format, you must define only a single json type column in COLUMNS.
jsonl: When you specify the jsonl data format, you may provide a JSONL_OPTION to define a newline character.
AVRO_OPTION:BYTES_TO_BASE64: boolean: When true, GPSS converts Avro bytes fields into base64-encoded strings. The default value is false, GPSS does not perfrom the conversion.
CSV_OPTION: When you specify FORMAT: csv, you may also provide the following options:
DELIMITER: delim_char: Specifies a single ASCII character that separates columns within each message or row of data. The default delimiter is a comma ( ,).
QUOTE: quote_char: Specifies the quotation character. Because GPSS does not provide a default value for this property, you must specify a value.
NULL_STRING: nullstr_val: Specifies the string that represents the null value. Because GPSS does not provide a default value for this property, you must specify a value.
ESCAPE: escape_char: Specifies the single character that is used for escaping data characters in the content that might otherwise be interpreted as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. Because GPSS does not provide a default value for this property, you must specify a value.
FORCE_NOT_NULL: columns: Specifies a comma-separated list of column names to process as though each column were quoted and hence not a NULL value. For the default null_string (nothing between two delimiters), missing values are evaluated as zero-length strings.
FILL_MISSING_FIELDS: boolean: Specifies the action of GPSS when it reads a row of data that has missing trailing field values (the row has missing data fields at the end of a line or row). The default value is false, GPSS returns an error when it encounters a row with missing trailing field values.; If set to true, GPSS sets missing trailing field values to NULL. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still generate an error.
NEWLINE: newline_str: Specifies the string that represents a new line. GPSS does not specify a default value.
DELIMITED_OPTION: When you specify FORMAT: delimited, you may also provide the following options:
DELIMITER: delimiter_string: When you specify the delimited data format, delimiter_string is required and must identify the data element delimiter. delimiter_string may be a multi-byte value, and up to 32 bytes in length. It may not contain quote and escape characters.
EOL_PREFIX: prefix_string: Specifies the prefix before the end of line character ( \n) that indicates the end of a row. The default prefix is empty.
QUOTE: quote_char: Specifies the single ASCII quotation character. The default quote character is empty.; If you do not specify a quotation character, GPSS assumes that all columns are unquoted. If you do not specify a quotation character and do specify an escape character, GPSS assumes that all columns are unquoted and escapes the delimiter, end-of-line prefix, and escape itself.; When you specify a quotation character, you must specify an escape character. GPSS reads any content between quote characters as-is, except for escaped characters.
ESCAPE: escape_char: Specifies the single ASCII character used to escape special characters (for example, the delimiter, end-of-line prefix, quote, or escape itself). Therdefault escape character is empty.; When you specify an escape character and do not specify a quotation character, GPSS escapes only the delimiter, end-of-line prefix, and escape itself.; When you specify both an escape character and a quotation character, GPSS escapes only these characters.
JSONL_OPTION: Optional. When you specify FORMAT: jsonl, you may choose to provide the JSONL_OPTION properties.
NEWLINE: newline_str: A string that specifies the new line character(s) that end each JSON record. The default newline is "\n".

META:

The field name, type, and format of the file meta data. META must specify a single json or jsonb (Greenplum 6 only) type column and FORMAT: json. The available meta data for a file is a single text-type property named filename. You can load this property into the target table with a MAPPING, or use the property in the update or merge criteria for a load operation.

FILTER: filter_string

The filter to apply to the input data before GPSS loads the data into Greenplum Database. If the filter evaluates to true, GPSS loads the message. If the filter evaluates to false, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more META or VALUE column names.

ENCODING: char_set

The source data encoding. You can specify an encoding character set when the source data is of the csv, custom, delimited, or json format. GPSS supports the character sets identified in Character Set Support in the Greenplum Database documentation.

ERROR_LIMIT: { num_errors | percentage_errors }

The error threshold, specified as either an absolute number or a percentage. GPSS stops the load operation when this limit is reached. The default ERROR_LIMIT is zero; GPSS deactivates error logging and stops the load operation when it encounters the first error. Due to a limitation of the Greenplum Database external table framework, GPSS does not accept ERROR_LIMIT: 1.

FILE:OUTPUT: Options

SCHEMA: output_schema_name

The name of the Greenplum Database schema in which table_name resides. Optional, the default schema is the public schema.

TABLE: table_name

The name of the Greenplum Database table into which GPSS loads the file data.

FILTER: output_filter_string

The filter to apply to the output data before GPSS loads the data into Greenplum Database. If the filter evaluates to true, GPSS loads the message. If the filter evaluates to false, the message is dropped. output_filter_string must be a valid SQL conditional expression and may reference one or more META or VALUE column names.

MODE: mode

The table load mode. Valid mode values are INSERT, MERGE, or UPDATE. The default value is INSERT.

UPDATE - Updates the target table columns that are listed in UPDATE_COLUMNS when the input columns identified in MATCH_COLUMNS match the named target table columns and the optional UPDATE_CONDITION is true.

UPDATE is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.

MERGE - Inserts new rows and updates existing rows when:

columns are listed in UPDATE_COLUMNS,
the MATCH_COLUMNS target table column values are equal to the input data, and
an optional UPDATE_CONDITION is specified and met.

Deletes rows when:

the MATCH_COLUMNS target table column values are equal to the input data, and
an optional DELETE_CONDITION is specified and met.

New rows are identified when the MATCH_COLUMNS value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the MATCH_COLUMNS and UPDATE_COLUMNS. If there are multiple new MATCH_COLUMNS values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specify ORDER_COLUMNS, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.

MERGE is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.

MATCH_COLUMNS:

Required if MODE is MERGE or UPDATE.

match_column_name: Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.

ORDER_COLUMNS:

Optional. May be specified in MERGE MODE to sort the input data rows.

order_column_name: Specify the column(s) by which GPSS sorts the rows. When multiple matching rows exist in a batch, ORDER_COLUMNS is used with MATCH_COLUMNS to determine the input row with the largest value; GPSS uses that row to write/update the target.

UPDATE_COLUMNS:

Required if MODE is MERGE or UPDATE.

update_column_name: Specifies the column(s) to update for the rows that meet the MATCH_COLUMNS criteria and the optional UPDATE_CONDITION.

UPDATE_CONDITION: update_condition

Optional. Specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of a MERGE).

DELETE_CONDITION: delete_condition

Optional. In MERGE MODE, specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met for GPSS to delete rows in the target table that meet the MATCH_COLUMNS criteria.

MAPPING:

Optional. Overrides the default source-to-target column mapping. GPSS supports two mapping syntaxes.

Note
When you specify a MAPPING, ensure that you provide a mapping for all data value elements of interest. GPSS does not automatically match column names when you provide a MAPPING.

NAME: target_column_name: Specifies the target Greenplum Database table column name.
EXPRESSION: { source_column_name | expression }: Specifies a value or meta COLUMNS:NAME (source_column_name) or an expression. When you specify an expression, you may provide a value expression that you would specify in the SELECT list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.
target_column_name: { source_column_name | expression }: When you use this MAPPING syntax, specify the target_column_name and {source_column_name | expression} as described above.

FILE:TASK: Options

PREPARE_SQL: udf_or_sql_to_run: The user-defined function or SQL command(s) that you want GPSS to run before it executes the job. The default is null, no command to run.
TEARDOWN_SQL: udf_or_sql_to_run: The user-defined function or SQL command(s) that you want GPSS to run after the job stops. GPSS runs the function or command(s) on job success and job failure. The default is null, no command to run.

Job SCHEDULE: Options

SCHEDULE:

Controls the frequency and interval of restarting jobs.

RETRY_INTERVAL: retry_time: The period of time that GPSS waits before retrying a failed job. You can specify the time interval in day ( d), hour ( h), minute ( m), second ( s), or millisecond ( ms) integer units; do not mix units. The default retry interval is 5m (5 minutes).
MAX_RETRIES: num_retries: The maximum number of times that GPSS attempts to retry a failed job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.
RUNNING_DURATION: run_time: The amount of time after which GPSS automatically stops a job. GPSS does not automatically stop a job by default.
AUTO_STOP_RESTART_INTERVAL: restart_time: The amount of time after which GPSS restarts a job that it stopped due to reaching RUNNING_DURATION.
MAX_RESTART_TIMES: num_restarts: The maximum number of times that GPSS restarts a job that it stopped due to reaching RUNNING_DURATION. The default is 0, do not restart the job.
QUIT_AT_EOF_AFTER: clock_time: The clock time after which GPSS stops a job every day when it encounters an EOF. By default, GPSS does not automatically stop a job that reaches EOF. GPSS never stops a job when the current time is before clock_time, even when GPSS encounters an EOF.

Job ALERT: Options

Controls notification when a job is stopped for any reason (success, completion, error, user-initiated stop).

COMMAND: command_to_run: The program that the GPSS server runs on the GPSS server host, including arguments. The command must be executable by GPSS.; command_to_run has access to job-related environment variables that GPSS sets, including: $GPSSJOB_NAME, $GPSSJOB_STATUS, and $GPSSJOB_DETAIL.
WORKDIR: directory: The working directory for command_to_run. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.
TIMEOUT: alert_time: The amount of time after a job stops, prompting GPSS to trigger the alert (and run command_to_run). You can specify the time interval in day ( d), hour ( h), minute ( m), or second ( s) integer units; do not mix units. The default alert timeout is -1s (no timeout).

Template Variables

GPSS supports using template variables to specify property values in the load configuration file.

You specify a template variable value in the load configuration file as follows:

<PROPERTY>: {{<template_var>}}

For example:

MAX_RETRIES: {{numretries}}

GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value> option to the gpsscli dryrun, gpsscli submit, or gpsscli load command.

For example, if the command line specifies:

--property numretries=10

GPSS substitutes occurrences of {{numretries}} in the load configuration file with the value 10 before submitting the job, and uses that value while the job is running.

Notes

If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the filesource.yaml configuration file. For example, if you create a table as follows:

CREATE TABLE "MyTable" ("MyColumn" text);

Your filesource.yaml YAML configuration file would refer to the above table and column names as:

  COLUMNS:
     - name: '"MyColumn"'
       type: text

OUTPUT:
   TABLE: '"MyTable"'

You can specify backslash escape sequences in the CSV DELIMITER, QUOTE, and ESCAPE options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.

Examples

Submit a job to load data from an Avro file as defined in the version 2 load configuration file named loadfromfile.yaml:

$ gpsscli submit loadfromfile.yaml

Example loadfromfile.yaml configuration file:

DATABASE: ops
USER: gpadmin
PASSWORD: changeme
HOST: mdw-1
PORT: 15432
VERSION: 2
FILE:
   INPUT:
      SOURCE:
         URL: file:///tmp/file.avro
      VALUE:
         COLUMNS:
           - NAME: value
             TYPE: json
         FORMAT: avro
      META:
         COLUMNS:
           - NAME: meta
             TYPE: json
         FORMAT: json
      FILTER: (value->>'x')::int < 10
      ERROR_LIMIT: 25
   OUTPUT:
      SCHEMA: gpschema
      TABLE: gptable
      MODE: INSERT
      MAPPING:
        - NAME: a
          EXPRESSION: (value->>'x')::int
        - NAME: b
          EXPRESSION: (value->>'y')::text
        - NAME: c
          EXPRESSION: (meta->>'filename')::text
SCHEDULE:
  RETRY_INTERVAL: 500ms
  MAX_RETRIES: 2