GPSS configuration file for unloading to file data (version 3).
version: v3
sources:
- gpdb:
host: <host>
port: <greenplum_port>
user: <user_name>
password: <password>
database: <db_name>
work_schema: <work_schema_name>
tables:
- table: <table_name>
schema: <schema_name>
filter: <input_filter_string>
mapping:
- <target_column_name_1>: <source_column_name_1>
- <target_column_name_2>: <source_column_name_2>
...
queries:
- <queries>
task:
batch_size:
max_count: <max_count_value>
interval_ms: <interval_value>
targets:
- file:
url: <file_path>
content:
<data_format>
columns:
<column_spec>
option:
schedule:
max_retries: <num_retries>
retry_interval: <retry_time>
running_duration: <run_time>
auto_stop_restart_interval: <restart_time>
max_restart_times: <num_restarts>
quit_at_eof_after: <clock_time>
alert:
command: <command_to_run>
workdir: <directory>
Where data_format
and column_spec
are one of the following blocks:
csv:
columns:
- name: <column_name>
type: <column_data_type>
...
delimiter: <delim_char>
quote: <quote_char>
newline: <newline_str>
header: <header_value>
null: <null_value>
escape: <escape_char>
force_quote: <boolean>
json:
column:
name: <column_name>
type: json | jsonb
is_jsonl: <boolean>
newline: <newline_str>
You specify the configuration properties for a VMware Tanzu Greenplum streaming server (GPSS) file load job in a YAML-formatted configuration file that you provide to the gpsscli submit
or gpsscli load
commands. There are three types of configuration properties in this file - those that identify the source VMware Tanzu Greenplum connection and origin table, properties specific to the target file you will unload the data to, and job-related properties.
This reference page uses the name unload-file-v3.yaml
to refer to this file; you may choose your own name for the file.
The gpsscli
utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.
version Property
version: v3
The version of the configuration file. You must specify version: v3
.
sources: gpdb Properties
host: host
The host name or IP address of the Tanzu Greenplum coordinator host.
port: <greenplum_port>
user: <user_name>
user_name
must have permissions as described in
Configuring Tanzu Greenplum Role Privileges.
password: <password>
database: <db_name>
work_schema: <work_schema_name>
work_schema_name
is public.
tables
table: <table_name>
schema: <schema_name>
table_name
resides. Optional, the default schema is the
public
schema.
filter: <output_filter_string>
output_filter_string
must be a valid SQL conditional expression and may reference one or more column names.
mapping
mapping
, all columns are unloaded into the target file.
NoteWhen you specify a
mapping
, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide amapping
block.
<target_column_name_X>: <source_column_name_X>
target_column_name_X
specifies the column name of the writable external table that GPSS creates internally. GPSS maps the Tanzu Greenplum column name expression specified in
source_column_name_X
, which is the column name in the source Tanzu Greenplum to unload data from, to the writable external table column name.
queries
NoteIf you specify a system column name in the queries, you must specify as an ordinary column name, otherwise it will error out. For example:
select id, item, price, gp_segment_id as segment from test_orders2 where id > 10000
specifies the system columngp_segment_id
as an ordinary column name.
task
batch_size
max_count: <max_count_value>
: Indicates the number of lines to fetch before flushing to disk.
interval_ms: <interval_value>
targets: file Properties
url
content
data_format
: The format of the key or value data. You may specify a data_format
of csv
or json
.
column_spec
: The Greenplum column mapping. The supported column specification differs for different data formats, as described below.
columns
: A set of column name/type mappings. The value []
specifies all columns.name: <column_name>
: The name of the column of the writable external table that GPSS creates internally.header: <header_name>
: The name of the mapped column name in the output file.delimiter: <delim_char>
: Specifies a single ASCII character that separates columns within each message or row of data. The default delimiter is a comma (,
).quote: <quote_char>
: Specifies the quotation character. Because GPSS does not provide a default value for this property, you must specify a value.newline: <newline_str>
: Specifies the string that represents a new line. GPSS does not specify a default value.header: <header_value>
: Pending description.null: <null_value>
: Pending description.escape: <escape_char>
: Pending description.force_quote: <boolean>
: Pending description.When you specify the json data format, GPSS can read the data as a single JSON object or as a single JSON record per line.
columns:
A single column name/type mapping.name: <column_name>
: The name of the column of the writable external table that GPSS creates internally.key: <key_name>
: The name of the mapped column name in the output file.is_jsonl: boolean
: Identifies whether or not GPSS writes the JSON data as a single object or single-record-per-line. The default is false
, GPSS writes the JSON data as a single object.newline: newline_str
: A string that specifies the new line character(s) that end each JSON record. The default newline is "\n"
.option: Job-related Properties
schedule
max_retries: <num_retries>
retry_interval: <retry_time>
d
), hour (
h
), minute (
m
), second (
s
), or millisecond (
ms
) integer units; do not mix units. The default retry interval is
5m
(5 minutes).
running_duration: <run_time>
auto_stop_restart_interval: <restart_time>
running_duration
.
max_restart_times: <num_restarts>
running_duration
. The default is 0, do not restart the job. If you specify the value
-1
, GPSS restarts the job indefinitely. You may use
gpsscli stop
to stop the jobs from being restarted indefinitely.
quit_at_eof_after: <clock_time>
clock_time
, even when GPSS encounters an EOF.
alert
command: <command_to_run>
command_to_run
has access to job-related environment variables that GPSS sets, including:
$GPSSJOB_NAME
,
$GPSSJOB_STATUS
, and
$GPSSJOB_DETAIL
. The shell script of the alert program that
command_to_run
specifes must contain
#!/bin/bash
in the first line.
workdir: <directory>
command_to_run
. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.
timeout: <alert_time>
command_to_run
). You can specify the time interval in day (
d
), hour (
h
), minute (
m
), or second (
s
) integer units; do not mix units. The default alert timeout is
-1s
(no timeout).
GPSS supports using template variables to specify property values in the unload configuration file.
You specify a template variable value in the load configuration file as follows:
<property>: {{<template_var>}}
For example:
max_retries: {{numretries}}
GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value>
option to the gpsscli dryrun
, gpsscli submit
, or gpsscli load
command.
For example, if the command line specifies:
--property numretries=10
GPSS substitutes occurrences of {{numretries}}
in the load configuration file with the value 10
before submitting the job, and uses that value while the job is running.
If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:
CREATE TABLE "MyTable" (c1 text);
Your YAML configuration file would refer to the table name as:
sourcess:
- gpdb:
tables:
- table: '"MyTable"'
You can specify backslash escape sequences in the CSV delimiter
, quote
, and escape
options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x
). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.
The following sample configuration file gathers data from Tanzu Greenplum using two queries run against the tables test_orders3
and test_orders4
, and the specified columns from the table test_orders
, and unloads them into two different targets, a csv
file and a json
file:
version: v3
sources:
- gpdb:
host: localhost
port: 6000
user: gpadmin
database: testdb
work_schema: public
tables:
- table: test_orders
schema: public
filter: id > 10000
mapping:
- id: id
- item: item
- price: price
- segment: gp_segment_id as segment
queries:
- select id, item, price, gp_segment_id as segment from test_orders3 where id > 10000
- select id, item, price, gp_segment_id as segment from test_orders4 where id > 10000
targets:
- file:
url: /home/gpadmin/path_to_unload/unload1.csv
content:
csv:
columns:
- name: id
header: ID
- name: item
header: ITEM
- name: price
header: PRICE
- name: segment
header: SEGMENT
delimiter: ","
quote: "'"
newline: LF
header: true
- file:
url: /home/gpadmin/path_to_unload/unload1.json
content:
json:
columns:
- name: id
key: "ID"
- name: item
key: "ITEM"
- name: price
key: "PRICE"
- name: segment
key: "SEGMENT"
is_jsonl: true
new_line: "\n"
Example of a configuration file that uses batch_size
to tune the unload performace.
version: v3
sources:
- gpdb:
host: localhost
port: 6000
user: gpadmin
database: testdb
work_schema: public
tables:
- table: test_orders
schema: public
filter: id > 10000
mapping:
- id: id
- item: item
- price: price
- segment: gp_segment_id as segment
task:
batch_size:
max_count: 100000
interval_ms: 1000
targets:
- file:
url: /home/gpadmin/path_to_unload/unload1.csv
content:
csv:
columns:
- name: id
header: ID
- name: item
header: ITEM
- name: price
header: PRICE
- name: segment
header: SEGMENT
delimiter: ","
quote: "'"
newline: LF
header: true
- file:
url: /home/gpadmin/path_to_unload/unload1.json
content:
json:
columns:
- name: id
key: "ID"
- name: item
key: "ITEM"
- name: price
key: "PRICE"
- name: segment
key: "SEGMENT"
is_jsonl: true
new_line: "\n"