The PXF S3 connector supports reading certain CSV-format and Parquet-format data from S3 using the Amazon S3 Select service. S3 Select provides direct query-in-place features on data stored in Amazon S3.
When you enable it, PXF uses S3 Select to filter the contents of S3 objects to retrieve the subset of data that you request. This typically reduces both the amount of data transferred to Greenplum Database and the query time.
You can use the PXF S3 Connector with S3 Select to read:
gzip
-compressed or bzip2
-compressed CSV
filesParquet
files with gzip
-compressed or snappy
-compressed columnsThe data must be UTF-8
-encoded, and may be server-side encrypted.
PXF supports column projection as well as predicate pushdown for AND
, OR
, and NOT
operators when using S3 Select.
The S3_SELECT
external table custom option governs PXF’s use of S3 Select when accessing the S3 object store. You can provide the following values when you set the S3_SELECT
option:
S3-SELECT Value | Description |
---|---|
OFF | PXF does not use S3 Select; the default. |
ON | PXF always uses S3 Select. |
AUTO | PXF uses S3 Select when it will benefit access or performance. |
By default, PXF does not use S3 Select (S3_SELECT=OFF
). You can enable PXF to always use S3 Select, or to use S3 Select only when PXF determines that it could be beneficial for performance. For example, when S3_SELECT=AUTO
, PXF automatically uses S3 Select when a query on the external table utilizes column projection or predicate pushdown, or when the referenced CSV file has a header row.
Note: The IGNORE_MISSING_PATH
custom option is not available when you use a PXF external table to read CSV text and Parquet data from S3 using S3 Select.
PXF supports reading Parquet data from S3 as described in Reading and Writing Parquet Data in an Object Store. If you want PXF to use S3 Select when reading the Parquet data, you add the S3_SELECT
custom option and value to the CREATE EXTERNAL TABLE
LOCATION
URI.
If columns in the Parquet file are gzip
-compressed or snappy
-compressed, use the COMPRESSION_CODEC
custom option in the LOCATION
URI to identify the compression codec alias. For example:
&COMPRESSION_CODEC=gzip
Or,
&COMPRESSION_CODEC=snappy
Use the following syntax to create a Greenplum Database external table that references a Parquet file on S3 that you want PXF to access with the S3 Select service:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=s3:parquet&SERVER=<server_name>&S3_SELECT=ON|AUTO[&<other-custom-option>=<value>[...]]')
FORMAT 'CSV';
FORMAT 'CSV'
when you enable PXF to use S3 Select on an external table that accesses a Parquet file on S3.
For example, use the following command to have PXF use S3 Select to access a Parquet file on S3 when optimal:
CREATE EXTERNAL TABLE parquet_on_s3 ( LIKE table1 )
LOCATION ('pxf://bucket/file.parquet?PROFILE=s3:parquet&SERVER=s3srvcfg&S3_SELECT=AUTO')
FORMAT 'CSV';
PXF supports reading CSV data from S3 as described in Reading and Writing Text Data in an Object Store. If you want PXF to use S3 Select when reading the CSV data, you add the S3_SELECT
custom option and value to the CREATE EXTERNAL TABLE
LOCATION
URI. You may also specify the delimiter formatter option and the file header and compression custom options.
CSV files may include a header line. When you enable PXF to use S3 Select to access a CSV-format file, you use the FILE_HEADER
custom option in the LOCATION
URI to identify whether or not the CSV file has a header row and, if so, how you want PXF to handle the header. PXF never returns the header row.
Note: You must specify S3_SELECT=ON
or S3_SELECT=AUTO
when the CSV file has a header row. Do not specify S3_SELECT=OFF
in this case.
The FILE_HEADER
option takes the following values:
FILE_HEADER Value | Description |
---|---|
NONE | The file has no header row; the default. |
IGNORE | The file has a header row; ignore the header. Use when the order of the columns in the external table and the CSV file are the same. (When the column order is the same, the column names and the CSV header names may be different.) |
USE | The file has a header row; read the header. Use when the external table column names and the CSV header names are the same, but are in a different order. |
If both the order and the names of the external table columns and the CSV header are the same, you can specify either FILE_HEADER=IGNORE
or FILE_HEADER=USE
.
Some headers in the query are missing from the file
.
For example, if the order of the columns in the CSV file header and the external table are the same, add the following to the CREATE EXTERNAL TABLE
LOCATION
URI to have PXF ignore the CSV header:
&FILE_HEADER=IGNORE
If the CSV file is gzip
- or bzip2
-compressed, use the COMPRESSION_CODEC
custom option in the LOCATION
URI to identify the compression codec alias. For example:
&COMPRESSION_CODEC=gzip
Or,
&COMPRESSION_CODEC=bzip2
Use the following syntax to create a Greenplum Database external table that references a CSV file on S3 that you want PXF to access with the S3 Select service:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>
?PROFILE=s3:text&SERVER=<server_name>&S3_SELECT=ON|AUTO[&FILE_HEADER=IGNORE|USE][&COMPRESSION_CODEC=gzip|bzip2][&<other-custom-option>=<value>[...]]')
FORMAT 'CSV' [(delimiter '<delim_char>')];
Note: Do not use the (HEADER)
formatter option in the CREATE EXTERNAL TABLE
command.
Note: PXF does not support the SKIP_HEADER_COUNT
custom option when you read a CSV file on S3 using the S3 Select service.
For example, use the following command to have PXF always use S3 Select to access a gzip
-compressed file on S3, where the field delimiter is a pipe (‘|’) character and the external table and CSV header columns are in the same order.
CREATE EXTERNAL TABLE gzippedcsv_on_s3 ( LIKE table2 )
LOCATION ('pxf://bucket/file.csv.gz?PROFILE=s3:text&SERVER=s3srvcfg&S3_SELECT=ON&FILE_HEADER=USE')
FORMAT 'CSV' (delimiter '|');