Loading from S3 into Greenplum (Beta)

You can use the gpsscli utility to load data from S3 into Greenplum Database. The GPSS s3 data source uses the Greenplum Database s3 Protocol to read data into an s3 external table and write it to a Greenplum table. Greenplum Database segment instances read the data from S3 in parallel, and similarly write the data to Greenplum in parallel. GPSS is not involved in the data transfer.

Note
The GPSS s3 data source does not read directly from S3.

The GPSS s3 data source can load text and CSV files residing on S3. The data source also supports loading gzip-compressed versions of these files.

Load Procedure

You will perform the following tasks when you use the Greenplum Streaming Server to load S3 data into a Greenplum Database table:

Ensure that you meet the Prerequisites.
Register the Greenplum Streaming Server extension.
Identify the format of the data.
Construct the load configuration file.
Create the target Greenplum Database table.
Assign Greenplum Database role permissions to the table, if required.
Run the gpsscli Client Commands to load the data into Greenplum Database.
Check for load errors.

Prerequisites

Before using the gpsscli utilities to load S3 data to Greenplum Database, ensure that:

Your systems meet the Prerequisites documented for the Greenplum Streaming Server.
You have configured the s3 protocol as described in the Greenplum Database s3 Protocol documentation.
You can identify the URI of the S3 file that you want to load.
You can identify an S3 access ID and secret key that have the permissions required to access the file.

About Supported File Formats

To write data from S3 into a Greenplum Database table, you must identify the format of the file in the load configuration file.

The Greenplum Streaming Server s3 data source supports loading files of the following formats:

Format	Description
csv	Comma-delimited text format data. Specify the `csv` format when your file data is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF).
text	Plain text format. Specify the `csv` format and an empty `delimiter` when you want to read a plain text file from S3.
gzipped csv and text files	Gzip-compressed file. Specify the `csv` format and the `delimiter` (the gzipped file is a csv file) or an empty `delimiter` (the gzipped file is a plain text file) when you want to read a `.gz` file from S3.

Constructing the s3source.yaml Configuration File

You configure a data load operation from a file to Greenplum Database via a YAML-formatted configuration file. This configuration file includes parameters that identify the source file and information about the Greenplum Database connection and target table, as well as error thresholds for the operation.

The Greenplum Streaming Server supports version 3 (Beta) of the YAML configuration file when you load data into Greenplum from S3. Refer to the s3source-v3.yaml reference page for the configuration file format and the configuration properties supported.

A sample version 3 s3 load YAML configuration file named loadfroms3.yaml follows:

version: v3
targets:
  - gpdb:
      host: localhost
      port: 6000
      user: bill
      password: changeme
      database: testdb
      work_schema: public
      error_limit: "25"
      filter_expression: "test_filter"
      tables:
        - table: s3_target
          schema: public
          mode:
            insert: {}
sources:
  - s3:
      uri:
        - "s3://s3-us-east-1.amazonaws.com/mydir/mybucket/data0000"
      content:
        csv: 
          columns:
            - name: c1
              type: text
            - name: c2
              type: int
          delimiter: ""
      s3param:
        version: 1
        accessid: 123
        secret: 456
        chunksize: 4096
        threadnum: 4
        gpcheckcloud_newline: "\n"
        autocompress: false
        encryption: false
        verifycert: false
        low_speed_limit: 1

Creating the Greenplum Table

You must pre-create the Greenplum table before you load S3 data into Greenplum Database. You use the table:schema and tables:table load configuration file properties to identify the schema and table names.

The target Greenplum table definition must include each column that GPSS will load into the table. The table definition may include additional columns; GPSS ignores these columns, and loads no data into them.

The name and data type that you specify for a column of the target Greenplum Database table must match the name and data type of the related, S3 file data element. If you have defined a column mapping, the name of the Greenplum Database column must match the target column name that you specified for the mapping, and the type must match the target column type or expression that you define.

A CREATE TABLE command for the target Greenplum Database table receiving the file data defined in the loadfroms3.yaml file presented in the Constructing the s3source.yaml Configuration File section follows:

 testdb=# CREATE TABLE s3_target(c1 text, c2 int);