You can use the gpsscli utility to load data from S3 into Greenplum Database. The GPSS s3
data source uses the Greenplum Database s3 Protocol to read data into an s3
external table and write it to a Greenplum table. Greenplum Database segment instances read the data from S3 in parallel, and similarly write the data to Greenplum in parallel. GPSS is not involved in the data transfer.
NoteThe GPSS s3 data source does not read directly from S3.
The GPSS s3 data source can load text and CSV files residing on S3. The data source also supports loading gzip-compressed versions of these files.
You will perform the following tasks when you use the Greenplum Streaming Server to load S3 data into a Greenplum Database table:
Before using the gpsscli
utilities to load S3 data to Greenplum Database, ensure that:
s3
protocol as described in the Greenplum Database s3 Protocol documentation.To write data from S3 into a Greenplum Database table, you must identify the format of the file in the load configuration file.
The Greenplum Streaming Server s3 data source supports loading files of the following formats:
Format | Description |
---|---|
csv | Comma-delimited text format data. Specify the csv format when your file data is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF). |
text | Plain text format. Specify the csv format and an empty delimiter when you want to read a plain text file from S3. |
gzipped csv and text files | Gzip-compressed file. Specify the csv format and the delimiter (the gzipped file is a csv file) or an empty delimiter (the gzipped file is a plain text file) when you want to read a .gz file from S3. |
You configure a data load operation from a file to Greenplum Database via a YAML-formatted configuration file. This configuration file includes parameters that identify the source file and information about the Greenplum Database connection and target table, as well as error thresholds for the operation.
The Greenplum Streaming Server supports version 3 (Beta) of the YAML configuration file when you load data into Greenplum from S3. Refer to the s3source-v3.yaml reference page for the configuration file format and the configuration properties supported.
A sample version 3 s3 load YAML configuration file named loadfroms3.yaml
follows:
version: v3
targets:
- gpdb:
host: localhost
port: 6000
user: bill
password: changeme
database: testdb
work_schema: public
error_limit: "25"
filter_expression: "test_filter"
tables:
- table: s3_target
schema: public
mode:
insert: {}
sources:
- s3:
uri:
- "s3://s3-us-east-1.amazonaws.com/mydir/mybucket/data0000"
content:
csv:
columns:
- name: c1
type: text
- name: c2
type: int
delimiter: ""
s3param:
version: 1
accessid: 123
secret: 456
chunksize: 4096
threadnum: 4
gpcheckcloud_newline: "\n"
autocompress: false
encryption: false
verifycert: false
low_speed_limit: 1
You must pre-create the Greenplum table before you load S3 data into Greenplum Database. You use the table:schema
and tables:table
load configuration file properties to identify the schema and table names.
The target Greenplum table definition must include each column that GPSS will load into the table. The table definition may include additional columns; GPSS ignores these columns, and loads no data into them.
The name and data type that you specify for a column of the target Greenplum Database table must match the name and data type of the related, S3 file data element. If you have defined a column mapping, the name of the Greenplum Database column must match the target column name that you specified for the mapping, and the type must match the target column type or expression that you define.
A CREATE TABLE
command for the target Greenplum Database table receiving the file data defined in the loadfroms3.yaml
file presented in the Constructing the s3source.yaml Configuration File section follows:
testdb=# CREATE TABLE s3_target(c1 text, c2 int);