unload-file-v3.yaml

GPSS configuration file for unloading to file data (version 3).

Synopsis

version: v3

sources:
  - gpdb:
      host: <host>
      port: <greenplum_port> 
      user: <user_name>
      password: <password>
      database: <db_name>
      work_schema: <work_schema_name>
      tables: 
        - table: <table_name>
          schema: <schema_name>
          filter: <input_filter_string>
          mapping:
            - <target_column_name_1>: <source_column_name_1>
            - <target_column_name_2>: <source_column_name_2>
            ...
      queries:
          - <queries>
      task:
      	batch_size:
      		max_count: <max_count_value>
      		interval_ms: <interval_value>

targets:
  - file:
      url: <file_path>
      content:
        <data_format> 
          columns:
          <column_spec>

option:
  schedule:
    max_retries: <num_retries>
    retry_interval: <retry_time>
    running_duration: <run_time>
    auto_stop_restart_interval: <restart_time>
    max_restart_times: <num_restarts>
    quit_at_eof_after: <clock_time>
  alert:
    command: <command_to_run>
    workdir: <directory>

Where data_format and column_spec are one of the following blocks:

csv:
  columns:
    - name: <column_name>
      type: <column_data_type>
    ...
  delimiter: <delim_char>
  quote: <quote_char>
  newline: <newline_str>
  header: <header_value>
  null: <null_value>
  escape: <escape_char>
  force_quote: <boolean>

json:
  column:
    name: <column_name>
    type: json | jsonb
  is_jsonl: <boolean>
  newline: <newline_str>

Description

You specify the configuration properties for a VMware Tanzu Greenplum streaming server (GPSS) file load job in a YAML-formatted configuration file that you provide to the gpsscli submit or gpsscli load commands. There are three types of configuration properties in this file - those that identify the source VMware Tanzu Greenplum connection and origin table, properties specific to the target file you will unload the data to, and job-related properties.

This reference page uses the name unload-file-v3.yaml to refer to this file; you may choose your own name for the file.

The gpsscli utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.

Keywords and Values

version Property

version: v3

The version of the configuration file. You must specify version: v3.

sources: gpdb Properties

host: host

The host name or IP address of the Tanzu Greenplum coordinator host.

port: <greenplum_port>

The port number of the Tanzu Greenplum server on the coordinator host.

user: <user_name>

The name of the Tanzu Greenplum user/role. This user_name must have permissions as described in Configuring Tanzu Greenplum Role Privileges.

password: <password>

The password for the Tanzu Greenplum user/role.

database: <db_name>

The name of the Tanzu Greenplum.

work_schema: <work_schema_name>

The name of the Tanzu Greenplum schema in which GPSS creates internal tables. The default work_schema_name is public.

tables

The Tanzu Greenplum tables, and the data that GPSS will unload from each.

table: <table_name>

The name of the Tanzu Greenplum table from where GPSS unloads the data.

schema: <schema_name>

The name of the Tanzu Greenplum schema in which table_name resides. Optional, the default schema is the public schema.

filter: <output_filter_string>

The filter to apply to the input data before GPSS unloads the data from Tanzu Greenplum. If the filter evaluates to true, GPSS unloads the data. If the filter evaluates to false, the data is not unloaded. output_filter_string must be a valid SQL conditional expression and may reference one or more column names.

mapping

Optional. Overrides the default external-to-database column mapping. It specifies the column to be unloaded into the target file, and provides mapping between the internal writable external table column name and the database column name expression. If you do not specify anything under mapping, all columns are unloaded into the target file.

Note
When you specify a mapping, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide a mapping block.

<target_column_name_X>: <source_column_name_X>: target_column_name_X specifies the column name of the writable external table that GPSS creates internally. GPSS maps the Tanzu Greenplum column name expression specified in source_column_name_X, which is the column name in the source Tanzu Greenplum to unload data from, to the writable external table column name.
queries: Optional. You may specify SQL queries to retrieve from the database, so the unloaded data is a result of the union all of all the queries and the specified tables.

Note
If you specify a system column name in the queries, you must specify as an ordinary column name, otherwise it will error out. For example: select id, item, price, gp_segment_id as segment from test_orders2 where id > 10000 specifies the system column gp_segment_id as an ordinary column name.

task: The data source task properties.
batch_size: Optional. Tune the batch size to unload into the files. When GPSS fetches data from Tanzu Greenplum, the data is stored in the GPSS memory and flushed onto the files using the bath size set by the parameters set by this option.

max_count: <max_count_value> : Indicates the number of lines to fetch before flushing to disk.

interval_ms: <interval_value>: Represents the amount of time (in ms) since the last flush to the file.

targets: file Properties

url: A URL identifying the file where the data will be unloaded to. It must point to a location within the same host where the Tanzu Greenplum streaming server is running. It cannot point to a remote hostname or IP address. If you need to unload data from Tanzu Greenplum to a remote URL, start Tanzu Greenplum streaming server on the remote host in order to unload the files from the database into this host.
content: The file type, field names, and type-specific properties of the file data. You must specify all data elements in the order in which they appear in the file.

data_format : The format of the key or value data. You may specify a data_format of csv or json.

column_spec : The Greenplum column mapping. The supported column specification differs for different data formats, as described below.

For csv format:: When you specify the csv data format, GPSS writes the data to the list of columns that you specify. The file content cannot contain line ending characters (CR and LF).

columns: A set of column name/type mappings. The value [] specifies all columns.
name: <column_name>: The name of the column of the writable external table that GPSS creates internally.
header: <header_name>: The name of the mapped column name in the output file.
delimiter: <delim_char>: Specifies a single ASCII character that separates columns within each message or row of data. The default delimiter is a comma (,).
quote: <quote_char>: Specifies the quotation character. Because GPSS does not provide a default value for this property, you must specify a value.
newline: <newline_str>: Specifies the string that represents a new line. GPSS does not specify a default value.
header: <header_value>: Pending description.
null: <null_value>: Pending description.
escape: <escape_char>: Pending description.
force_quote: <boolean>: Pending description.

For JSON format:

When you specify the json data format, GPSS can read the data as a single JSON object or as a single JSON record per line.

columns: A single column name/type mapping.
name: <column_name>: The name of the column of the writable external table that GPSS creates internally.
key: <key_name>: The name of the mapped column name in the output file.
is_jsonl: boolean: Identifies whether or not GPSS writes the JSON data as a single object or single-record-per-line. The default is false, GPSS writes the JSON data as a single object.
newline: newline_str: A string that specifies the new line character(s) that end each JSON record. The default newline is "\n".

option: Job-related Properties

schedule: Controls the frequency and interval of restarting jobs.
max_retries: <num_retries>: The maximum number of times that GPSS attempts to retry a failed job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.
retry_interval: <retry_time>: The period of time that GPSS waits before retrying a failed job. You can specify the time interval in day ( d), hour ( h), minute ( m), second ( s), or millisecond ( ms) integer units; do not mix units. The default retry interval is 5m (5 minutes).
running_duration: <run_time>: The amount of time after which GPSS automatically stops a job. GPSS does not automatically stop a job by default.
auto_stop_restart_interval: <restart_time>: The amount of time after which GPSS restarts a job that it stopped due to reaching running_duration.
max_restart_times: <num_restarts>: The maximum number of times that GPSS restarts a job that it stopped due to reaching running_duration. The default is 0, do not restart the job. If you specify the value -1, GPSS restarts the job indefinitely. You may use gpsscli stop to stop the jobs from being restarted indefinitely.
quit_at_eof_after: <clock_time>: The clock time after which GPSS stops a job every day when it encounters an EOF. By default, GPSS does not automatically stop a job that reaches EOF. GPSS never stops a job when the current time is before clock_time, even when GPSS encounters an EOF.
alert: Controls notification when a job is stopped for any reason (success, completion, error, user-initiated stop).
command: <command_to_run>: The program that the GPSS server runs on the GPSS server host, including arguments. The command must be executable by GPSS. command_to_run has access to job-related environment variables that GPSS sets, including: $GPSSJOB_NAME, $GPSSJOB_STATUS, and $GPSSJOB_DETAIL. The shell script of the alert program that command_to_run specifes must contain #!/bin/bash in the first line.
workdir: <directory>: The working directory for command_to_run. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.
timeout: <alert_time>: The amount of time after a job stops, prompting GPSS to trigger the alert (and run command_to_run). You can specify the time interval in day ( d), hour ( h), minute ( m), or second ( s) integer units; do not mix units. The default alert timeout is -1s (no timeout).

Template Variables

GPSS supports using template variables to specify property values in the unload configuration file.

You specify a template variable value in the load configuration file as follows:

<property>: {{<template_var>}}

For example:

max_retries: {{numretries}}

GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value> option to the gpsscli dryrun, gpsscli submit, or gpsscli load command.

For example, if the command line specifies:

--property numretries=10

GPSS substitutes occurrences of {{numretries}} in the load configuration file with the value 10 before submitting the job, and uses that value while the job is running.

Notes

If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:

CREATE TABLE "MyTable" (c1 text);

Your YAML configuration file would refer to the table name as:

sourcess:
- gpdb:
    tables:
      - table: '"MyTable"'

You can specify backslash escape sequences in the CSV delimiter, quote, and escape options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.

Examples

The following sample configuration file gathers data from Tanzu Greenplum using two queries run against the tables test_orders3 and test_orders4, and the specified columns from the table test_orders, and unloads them into two different targets, a csv file and a json file:

version: v3
sources:
  - gpdb:
      host: localhost
      port: 6000
      user: gpadmin
      database: testdb
      work_schema: public
      tables:
        - table: test_orders
          schema: public
          filter: id > 10000
          mapping:
            - id: id
            - item: item
            - price: price
            - segment: gp_segment_id as segment
      queries:
        - select id, item, price, gp_segment_id as segment from test_orders3 where id > 10000
        - select id, item, price, gp_segment_id as segment from test_orders4 where id > 10000
targets:
  - file:
      url: /home/gpadmin/path_to_unload/unload1.csv
      content:
        csv:
          columns:
            - name: id
              header: ID
            - name: item
              header: ITEM
            - name: price
              header: PRICE
            - name: segment
              header: SEGMENT
          delimiter: ","
          quote: "'"
          newline: LF
          header: true
  - file:
      url: /home/gpadmin/path_to_unload/unload1.json
      content:
        json:
          columns:
            - name: id
              key: "ID"
            - name: item
              key: "ITEM"
            - name: price
              key: "PRICE"
            - name: segment
              key: "SEGMENT"
          is_jsonl: true
          new_line: "\n"

Example of a configuration file that uses batch_size to tune the unload performace.

version: v3
sources:
  - gpdb:
      host: localhost
      port: 6000
      user: gpadmin
      database: testdb
      work_schema: public
      tables:
        - table: test_orders
          schema: public
          filter: id > 10000
          mapping:
            - id: id
            - item: item
            - price: price
            - segment: gp_segment_id as segment
      task:
          batch_size:
              max_count: 100000
              interval_ms: 1000
targets:
  - file:
      url: /home/gpadmin/path_to_unload/unload1.csv
      content:
        csv:
          columns:
            - name: id
              header: ID
            - name: item
              header: ITEM
            - name: price
              header: PRICE
            - name: segment
              header: SEGMENT
          delimiter: ","
          quote: "'"
          newline: LF
          header: true
  - file:
      url: /home/gpadmin/path_to_unload/unload1.json
      content:
        json:
          columns:
            - name: id
              key: "ID"
            - name: item
              key: "ITEM"
            - name: price
              key: "PRICE"
            - name: segment
              key: "SEGMENT"
          is_jsonl: true
          new_line: "\n"