You can use Tanzu Greenplum streaming server to unload data from your Tanzu Greenplum to a file on the host where the Tanzu Greenplum streaming server is running. This method supports the following features:
Perform the following tasks when you use the Tanzu Greenplum streaming server to unload data from Tanzu Greenplum to a file:
Before using the gpsscli
utilities to unload data from Tanzu Greenplum to a file, ensure that:
The Tanzu Greenplum streaming server supports unloading to files in the following formats:
Format | Description |
---|---|
csv | Comma-delimited text format data. Specify the csv format to create an output file which is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF). |
json | JSON format data. Specify the json format to create an output file which is in JSON format. GPSS can write JSON data as a single object or write a single JSON record per line. You must define a mapping if you want GPSS to write the data into specific columns in the output file. |
Start an instance of the Tanzu Greenplum streaming server on the local host using the gpss
utility. By default, it uses port 5000 in localhost, and gpfdist
uses the port 8080.
$ gpss
You may provide the name of the json configuration file that defines the properties of the GPSS and gpfdist
service instances. For example:
$ gpss --config gpss.json
Where the gpss.json
configuration file contains the following:
{
"ListenAddress": {
"Host": "",
"Port": 5019
},
"Gpfdist": {
"Host": "",
"Port": 8319,
"ReuseTables": false
}
}
You may need to locate and prepare the source tables where you plan to unload the data from. The following example creates three tables, and inserts 1000000 lines into these tables.
# CREATE TABLE test_orders(id int, item text, price text) DISTRIBUTED BY (id);
# INSERT INTO test_orders SELECT generate_series(1,1000000), 'bag', '$100';
# CREATE TABLE test_orders3(id int, item text, price text) DISTRIBUTED BY (id);
# INSERT INTO test_orders3 SELECT * FROM test_orders;
# CREATE TABLE test_orders4(id int, item text, price text) DISTRIBUTED BY (id);
# INSERT INTO test_orders4 SELECT * FROM test_orders;
You configure a data unload operation from Greenplum to a file output via a YAML-formatted configuration file. This configuration file includes parameters that identify the source tables or queries, and information about the Tanzu Greenplum connection and target file location.
NoteThe target url must point to a location within the same host where the Tanzu Greenplum streaming server is running. It must not point to a remote hostname or IP address. If you need to unload data from Tanzu Greenplum to a remote url, start Tanzu Greenplum streaming server on the remote host in order to unload the files from the database into this host.
The Tanzu Greenplum streaming server only supports version 3 of the YAML configuration file when you unload data from Greenplum into a file or command output. Refer to the file-unload-v3.yaml reference page for configuration file format and the configuration properties supported.
The following is a sample version 3 file unload YAML configuration file named unload.yaml
:
version: v3
sources:
- gpdb:
host: localhost
port: 6000
user: gpadmin
database: testdb
work_schema: public
tables:
- table: test_orders
schema: public
filter: id > 10000
mapping:
- id: id
- item: item
- price: price
- segment: gp_segment_id as segment
queries:
- select id, item, price, gp_segment_id as segment from test_orders3 where id > 10000
- select id, item, price, gp_segment_id as segment from test_orders4 where id > 10000
targets:
- file:
url: /home/gpadmin/path_to_unload/unload1.csv
content:
csv:
columns:
- name: id
header: ID
- name: item
header: ITEM
- name: price
header: PRICE
- name: segment
header: SEGMENT
delimiter: ","
quote: "'"
newline: LF
header: true
- file:
url: /home/gpadmin/path_to_unload/unload1.json
content:
json:
columns:
- name: id
key: "ID"
- name: item
key: "ITEM"
- name: price
key: "PRICE"
- name: segment
key: "SEGMENT"
is_jsonl: true
new_line: "\n"
The configuration file above unloads data from the specified queries (from tables test_orders3
and test_orders4
) and the specified columns from the table test_orders
to two different target files: a text file named unload1.csv
and a json file named unload1.json
.
The fields name
under the file
section must match the table
columns, and the columns specified by the queries
.
Note that for queries
, if you specify a system column name as part of the query syntax, you must specify it as an ordinary column name, otherwise it will error out. For example, if you specify the following query:
queries:
- select id, item, price, gp_segment_id from test_orders2 where id > 10000
Because gp_segment_id
is a system column, this query will error out. Instead, specify the following query:
queries:
- select id, item, price, gp_segment_id as segment from test_orders2 where id > 10000
The following sample YAML configuration file named unload_add_task.yaml
tunes the performace by setting different batch sizes:
version: v3
sources:
- gpdb:
host: localhost
port: 6000
user: gpadmin
database: testdb
work_schema: public
tables:
- table: test_orders
schema: public
filter: id > 10000
mapping:
- id: id
- item: item
- price: price
- segment: gp_segment_id as segment
task:
batch_size:
max_count: 100000
interval_ms: 1000
targets:
- file:
url: /home/gpadmin/path_to_unload/unload1.csv
content:
csv:
columns:
- name: id
header: ID
- name: item
header: ITEM
- name: price
header: PRICE
- name: segment
header: SEGMENT
delimiter: ","
quote: "'"
newline: LF
header: true
- file:
url: /home/gpadmin/path_to_unload/unload1.json
content:
json:
columns:
- name: id
key: "ID"
- name: item
key: "ITEM"
- name: price
key: "PRICE"
- name: segment
key: "SEGMENT"
is_jsonl: true
new_line: "\n"
The options max_count
and interval_ms
under the section batch_size
tune the batch size to unload into the files. Each time Tanzu Greenplum streaming server fetches data from Greenplum, the data is stored in the GPSS memory, and GPSS flushes the data onto the files using the batch size set by the configuration file. max_count
indicates the number of lines to fetch before flushing to disk. interval_ms
represents the amount of time (in ms) since the last flush to the file. The larger max_count
and interval_ms
are, the larger the amount of data flushed onto disk at a time, which increases the unload performance.
The option schedule
controls the frequency and interval of restarting the unload jobs.
The option alert
manages notifications when a job is stopped. You may use this option to count the number of lines unloaded or record any other statistics about the job.
The fields encoding
, prepare_statement
, and teardown_statement
are not supported.
See file-unload-v3.yaml for more details.
Run the gpsscli load
client command to unload the data from VMware Greenplum into a file.
$ gpsscli load unload.yaml
The data is added to the files defined by the field url
in an append mode. If there is already existing data in the file, the unload job keeps the old data, and appends all the new data to the file. If there is no such file in the target URL, GPSS will create the new files.
If the table is too large to be unloaded into one file, you can create multiple jobs, each specifying different values for the field filter
, in order to split big files into smaller pieces. For example, the first job specifies the filter id< 10000
, the second one specifies the filter id >= 10000
and the last one specifies id <20000
. You can submit different jobs and run them in parallel.
Incremental data unloading is not supported, only full data unloading. However, using queries, you can split large tables into partitions, using the date as the partition strategy. In this way, GPSS can separately unload the most recent data into the files.
Since Tanzu Greenplum streaming server uses the gpfdist
protocol to unload the data, the data is unloaded directly from the segments to the files in batches, hence it is not possible to guarantee the order of the data among different segments.