You can use the gpsscli utility to load data from S3 into VMware Tanzu Greenplum. The GPSS s3
data source uses the Tanzu Greenplum s3 Protocol to read data into an s3
external table and write it to a Greenplum table. Tanzu Greenplum segment instances read the data from S3 in parallel, and similarly write the data to Greenplum in parallel. GPSS is not involved in the data transfer.
NoteThe GPSS s3 data source does not read directly from S3.
The GPSS s3 data source can load text and CSV files residing on S3. The data source also supports loading gzip-compressed versions of these files.
You will perform the following tasks when you use the VMware Tanzu Greenplum streaming server to load S3 data into a Tanzu Greenplum table:
Before using the gpsscli
utilities to load S3 data to Tanzu Greenplum, ensure that:
s3
protocol as described in the VMware Tanzu Greenplum s3 Protocol documentation.To write data from S3 into a Tanzu Greenplum table, you must identify the format of the file in the load configuration file.
The Tanzu Greenplum streaming server s3 data source supports loading files of the following formats:
Format | Description |
---|---|
csv | Comma-delimited text format data. Specify the csv format when your file data is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF). |
text | Plain text format. Specify the csv format and an empty delimiter when you want to read a plain text file from S3. |
gzipped csv and text files | Gzip-compressed file. Specify the csv format and the delimiter (the gzipped file is a csv file) or an empty delimiter (the gzipped file is a plain text file) when you want to read a .gz file from S3. |
You configure a data load operation from a file to Tanzu Greenplum via a YAML-formatted configuration file. This configuration file includes parameters that identify the source file and information about the Tanzu Greenplum connection and target table, as well as error thresholds for the operation.
The Tanzu Greenplum streaming server supports version 3 (Beta) of the YAML configuration file when you load data into Greenplum from S3. Refer to the s3source-v3.yaml reference page for the configuration file format and the configuration properties supported.
A sample version 3 s3 load YAML configuration file named loadfroms3.yaml
follows:
version: v3
targets:
- gpdb:
host: localhost
port: 6000
user: bill
password: changeme
database: testdb
work_schema: public
error_limit: "25"
filter_expression: "test_filter"
tables:
- table: s3_target
schema: public
mode:
insert: {}
sources:
- s3:
uri:
- "s3://s3-us-east-1.amazonaws.com/mydir/mybucket/data0000"
content:
csv:
columns:
- name: c1
type: text
- name: c2
type: int
delimiter: ""
s3param:
version: 1
accessid: 123
secret: 456
chunksize: 4096
threadnum: 4
gpcheckcloud_newline: "\n"
autocompress: false
encryption: false
verifycert: false
low_speed_limit: 1
You must pre-create the Greenplum table before you load S3 data into Tanzu Greenplum. You use the table:schema
and tables:table
load configuration file properties to identify the schema and table names.
The target Greenplum table definition must include each column that GPSS will load into the table. The table definition may include additional columns; GPSS ignores these columns, and loads no data into them.
The name and data type that you specify for a column of the target Tanzu Greenplum table must match the name and data type of the related, S3 file data element. If you have defined a column mapping, the name of the Tanzu Greenplum column must match the target column name that you specified for the mapping, and the type must match the target column type or expression that you define.
A CREATE TABLE
command for the target Tanzu Greenplum table receiving the file data defined in the loadfroms3.yaml
file presented in the Constructing the s3source.yaml Configuration File section follows:
testdb=# CREATE TABLE s3_target(c1 text, c2 int);