Unloading File Data from VMware Tanzu Greenplum

You can use Tanzu Greenplum streaming server to unload data from your Tanzu Greenplum to a file on the host where the Tanzu Greenplum streaming server is running. This method supports the following features:

Specify multiple queries to retrieve from the database, so the unloaded data is a result of the union all of all the queries.
Specify one or more table columns, apply filters, and create mapping of the columns.
Specify delimited columns, quote, newline mark, and header for the target files.
Specify multiple targets within the Tanzu Greenplum streaming server host, so more than one file gets created.
Makes available other configuration yaml file features available such as schedule, task, alerts.

Unload Procedure

Perform the following tasks when you use the Tanzu Greenplum streaming server to unload data from Tanzu Greenplum to a file:

Ensure that you meet the Prerequisites.
Decide on the format of the target data
Start the Tanzu Greenplum Streaming Server.
Prepare the Data to Unload.
Construct the unload configuration file.
Use the gpsscli Client Command to Unload the Data.

Prerequisites

Before using the gpsscli utilities to unload data from Tanzu Greenplum to a file, ensure that:

Your systems meet the Prerequisites documented for Tanzu Greenplum Streaming Server.

About Supported Data Formats

The Tanzu Greenplum streaming server supports unloading to files in the following formats:

Format	Description
csv	Comma-delimited text format data. Specify the `csv` format to create an output file which is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF).
json	JSON format data. Specify the `json` format to create an output file which is in JSON format. GPSS can write JSON data as a single object or write a single JSON record per line. You must define a mapping if you want GPSS to write the data into specific columns in the output file.

Start the Tanzu Greenplum Streaming Server

Start an instance of the Tanzu Greenplum streaming server on the local host using the gpss utility. By default, it uses port 5000 in localhost, and gpfdist uses the port 8080.

$ gpss

You may provide the name of the json configuration file that defines the properties of the GPSS and gpfdist service instances. For example:

$ gpss --config gpss.json

Where the gpss.json configuration file contains the following:

{
    "ListenAddress": {
        "Host": "",
        "Port": 5019
    },
    "Gpfdist": {
        "Host": "",
        "Port": 8319,
        "ReuseTables": false
    }
}

Prepare the Data to Unload

You may need to locate and prepare the source tables where you plan to unload the data from. The following example creates three tables, and inserts 1000000 lines into these tables.

# CREATE TABLE test_orders(id int, item text, price text) DISTRIBUTED BY (id);
# INSERT INTO test_orders SELECT generate_series(1,1000000), 'bag', '$100';
 
# CREATE TABLE test_orders3(id int, item text, price text) DISTRIBUTED BY (id);
# INSERT INTO test_orders3 SELECT * FROM test_orders;
 
# CREATE TABLE test_orders4(id int, item text, price text) DISTRIBUTED BY (id);
# INSERT INTO test_orders4 SELECT * FROM test_orders;

Construct the unload configuration file

You configure a data unload operation from Greenplum to a file output via a YAML-formatted configuration file. This configuration file includes parameters that identify the source tables or queries, and information about the Tanzu Greenplum connection and target file location.

Note
The target url must point to a location within the same host where the Tanzu Greenplum streaming server is running. It must not point to a remote hostname or IP address. If you need to unload data from Tanzu Greenplum to a remote url, start Tanzu Greenplum streaming server on the remote host in order to unload the files from the database into this host.

The Tanzu Greenplum streaming server only supports version 3 of the YAML configuration file when you unload data from Greenplum into a file or command output. Refer to the file-unload-v3.yaml reference page for configuration file format and the configuration properties supported.

The following is a sample version 3 file unload YAML configuration file named unload.yaml:

version: v3
sources:
  - gpdb:
      host: localhost
      port: 6000
      user: gpadmin
      database: testdb
      work_schema: public
      tables:
        - table: test_orders
          schema: public
          filter: id > 10000
          mapping:
            - id: id
            - item: item
            - price: price
            - segment: gp_segment_id as segment
      queries:
        - select id, item, price, gp_segment_id as segment from test_orders3 where id > 10000
        - select id, item, price, gp_segment_id as segment from test_orders4 where id > 10000
targets:
  - file:
      url: /home/gpadmin/path_to_unload/unload1.csv
      content:
        csv:
          columns:
            - name: id
              header: ID
            - name: item
              header: ITEM
            - name: price
              header: PRICE
            - name: segment
              header: SEGMENT
          delimiter: ","
          quote: "'"
          newline: LF
          header: true
  - file:
      url: /home/gpadmin/path_to_unload/unload1.json
      content:
        json:
          columns:
            - name: id
              key: "ID"
            - name: item
              key: "ITEM"
            - name: price
              key: "PRICE"
            - name: segment
              key: "SEGMENT"
          is_jsonl: true
          new_line: "\n"

The configuration file above unloads data from the specified queries (from tables test_orders3 and test_orders4) and the specified columns from the table test_orders to two different target files: a text file named unload1.csv and a json file named unload1.json.

The fields name under the file section must match the table columns, and the columns specified by the queries.

Note that for queries, if you specify a system column name as part of the query syntax, you must specify it as an ordinary column name, otherwise it will error out. For example, if you specify the following query:

queries:
    - select id, item, price, gp_segment_id from test_orders2 where id > 10000

Because gp_segment_id is a system column, this query will error out. Instead, specify the following query:

queries:
     - select id, item, price, gp_segment_id as segment from test_orders2 where id > 10000

The following sample YAML configuration file named unload_add_task.yaml tunes the performace by setting different batch sizes:

version: v3
sources:
  - gpdb:
      host: localhost
      port: 6000
      user: gpadmin
      database: testdb
      work_schema: public
      tables:
        - table: test_orders
          schema: public
          filter: id > 10000
          mapping:
            - id: id
            - item: item
            - price: price
            - segment: gp_segment_id as segment
      task:
          batch_size:
              max_count: 100000
              interval_ms: 1000
targets:
  - file:
      url: /home/gpadmin/path_to_unload/unload1.csv
      content:
        csv:
          columns:
            - name: id
              header: ID
            - name: item
              header: ITEM
            - name: price
              header: PRICE
            - name: segment
              header: SEGMENT
          delimiter: ","
          quote: "'"
          newline: LF
          header: true
  - file:
      url: /home/gpadmin/path_to_unload/unload1.json
      content:
        json:
          columns:
            - name: id
              key: "ID"
            - name: item
              key: "ITEM"
            - name: price
              key: "PRICE"
            - name: segment
              key: "SEGMENT"
          is_jsonl: true
          new_line: "\n"

The options max_count and interval_ms under the section batch_size tune the batch size to unload into the files. Each time Tanzu Greenplum streaming server fetches data from Greenplum, the data is stored in the GPSS memory, and GPSS flushes the data onto the files using the batch size set by the configuration file. max_count indicates the number of lines to fetch before flushing to disk. interval_ms represents the amount of time (in ms) since the last flush to the file. The larger max_count and interval_ms are, the larger the amount of data flushed onto disk at a time, which increases the unload performance.

The option schedule controls the frequency and interval of restarting the unload jobs.

The option alert manages notifications when a job is stopped. You may use this option to count the number of lines unloaded or record any other statistics about the job.

The fields encoding, prepare_statement, and teardown_statement are not supported.

See file-unload-v3.yaml for more details.

Use the gpsscli Client Command to Unload the Data

Run the gpsscli load client command to unload the data from VMware Greenplum into a file.

$ gpsscli load unload.yaml

The data is added to the files defined by the field url in an append mode. If there is already existing data in the file, the unload job keeps the old data, and appends all the new data to the file. If there is no such file in the target URL, GPSS will create the new files.

Other Considerations

If the table is too large to be unloaded into one file, you can create multiple jobs, each specifying different values for the field filter, in order to split big files into smaller pieces. For example, the first job specifies the filter id< 10000, the second one specifies the filter id >= 10000 and the last one specifies id <20000. You can submit different jobs and run them in parallel.

Incremental data unloading is not supported, only full data unloading. However, using queries, you can split large tables into partitions, using the date as the partition strategy. In this way, GPSS can separately unload the most recent data into the files.

Since Tanzu Greenplum streaming server uses the gpfdist protocol to unload the data, the data is unloaded directly from the segments to the files in batches, hence it is not possible to guarantee the order of the data among different segments.