Runs a load job as defined in a YAML formatted control file.
gpload -f <control_file> [-l <log_file>] [-h <hostname>] [-p <port>] [-U <username>] [-d <database>] [-W] [--gpfdist_timeout <seconds>] [--no_auto_trans] [--max_retries <retry_times>] [[-v | -V] [-q]] [-D] gpload -? gpload --version
The client machine where
gpload is run must have the following:
$PATH. This program is located in
$GPHOME/binof your Greenplum Database server installation.
gpload is a data loading utility that acts as an interface to the Greenplum Database external table parallel loading feature. Using a load specification defined in a YAML formatted control file,
gpload runs a load by invoking the Greenplum Database parallel file server (gpfdist), creating an external table definition based on the source data defined, and running an
MERGE operation to load the source data into the target table in the database.
gpfdist is compatible only with the Greenplum Database major version in which it is shipped. For example, a
gpfdist utility that is installed with Greenplum Database 4.x cannot be used with Greenplum Database 5.x or 6.x.
Note: The Greenplum Database 5.22 and later
gpload for Linux is compatible with Greenplum Database 6.x. The Greenplum Database 6.x
gpload for both Linux and Windows is compatible with Greenplum 5.x.
UPDATE operations are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
The operation, including any SQL commands specified in the
SQL collection of the YAML control file (see Control File Format), are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.
gpfdistparallel file distribution program to send a response. Enter a value from
30seconds (entering "
0" to disables timeouts). Note that you might need to increase this value when operating on high-traffic networks.
~/gpAdminLogs/gpload_YYYYMMDD. For more information about the log file, see Log File Format.
--no_auto_trans to disable processing the load operation as a single transaction if you are performing a single load operation on the target table.
gploadprocesses each load operation as a single transaction to prevent inconsistent data when performing multiple, simultaneous operations on a target table.
$PGDATABASEor defaults to the current system user name.
$PGHOSTor defaults to
$PGPORTor defaults to 5432.
gploadattempts to connect to Greenplum Database after a connection timeout. The default value is
0, do not attempt to connect after a connection timeout. A negative integer, such as
-1, specifies an unlimited number of attempts.
$PGUSERor defaults to the current system user name.
$PGPASSWORDor from a password file specified by
~/.pgpass. If these are not set, then
gploadwill prompt for a password even if
-Wis not supplied.
gpload control file uses the YAML 1.1 document format and then implements its own schema for defining the various steps of a Greenplum Database load operation. The control file must be a valid YAML document.
gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. The use of white space is significant. White space should not be used simply for formatting purposes, and tabs should not be used at all.
The basic structure of a load control file is:
--- [VERSION](#cfversion): 18.104.22.168 [DATABASE](#cfdatabase): <db_name> [USER](#cfuser): <db_username> [HOST](#cfhost): <master_hostname> [PORT](#cfport): <master_port> [GPLOAD](#cfgpload): [INPUT](#cfinput): - [SOURCE](#cfsource): [LOCAL\_HOSTNAME](#cfsourcelocalname): - <hostname_or_ip> [PORT](#cfsourceport): <http_port> | [PORT\_RANGE](#cfversion): [<start_port_range>, <end_port_range>] [FILE](#cfsourcefile): - </path/to/input_file> [SSL](#cfsourcessl): true | false [CERTIFICATES\_PATH](#cfsourcecertificatespath): </path/to/certificates> - [FULLY\_QUALIFIED\_DOMAIN\_NAME](#fqdn): true | false - [COLUMNS](#cfcolumns): - <field_name>: <data_type> - [TRANSFORM](#cftransform): '<transformation>' - [TRANSFORM\_CONFIG](#cftransformconfig): '<configuration-file-path>' - [MAX\_LINE\_LENGTH](#cfmaxlinelength): <integer> - [FORMAT](#cfformat): text | csv - [DELIMITER](#cfdelimiter): '<delimiter_character>' - [ESCAPE](#cfescape): '<escape_character>' | 'OFF' - [NEWLINE](#newline): 'LF' | 'CR' | 'CRLF' - [NULL\_AS](#cfnullas): '<null_string>' - [FILL\_MISSING\_FIELDS](#cfillfields): true | false - [FORCE\_NOT\_NULL](#cfforcenotnull): true | false - [QUOTE](#cfquote): '<csv_quote_character>' - [HEADER](#cfheader): true | false - [ENCODING](#cfencoding): <database_encoding> - [ERROR\_LIMIT](#cferrorlimit): <integer> - [LOG\_ERRORS](#cferrorlog): true | false [EXTERNAL](#cfexternal): - [SCHEMA](#cfschema): <schema> | '%' [OUTPUT](#cfoutput): - [TABLE](#cftable): <schema.table_name> - [MODE](#cfmode): insert | update | merge - [MATCH\_COLUMNS](#cfmatchcolumns): - <target_column_name> - [UPDATE\_COLUMNS](#cfupdatecolumns): - <target_column_name> - [UPDATE\_CONDITION](#cfupdatecondition): '<boolean_condition>' - [MAPPING](#cfmapping): <target_column_name>: <source_column_name> | '<expression>' [PRELOAD](#cfpreload): - [TRUNCATE](#cftruncate): true | false - [REUSE\_TABLES](#cfreusetables): true | false - [STAGING\_TABLE](#cfstagetbl): <external_table_name> - [FAST\_MATCH](#cffastmatch): true | false [SQL](#cfsql): - [BEFORE](#cfbefore): "<sql_command>" - [AFTER](#cfafter): "<sql_command>"
gploadcontrol file schema. The current version is 22.214.171.124.
$PGDATABASEif set or the current system user name. You can also specify the database on the command line using the
Optional. Specifies which database role to use to connect. If not specified, defaults to the current user or
$PGUSER if set. You can also specify the database role on the command line using the
gploadis not a Greenplum Database superuser, then the appropriate rights must be granted to the user for the load to be processed. See the Greenplum Database Reference Guide for more information.
$PGHOSTif set. You can also specify the master host name on the command line using the
$PGPORTif set. You can also specify the master port on the command line using the
Required. Begins the load specification section. A
GPLOAD specification must have an
INPUT and an
OUTPUT section defined.
gploadwill start one or more instances of the gpfdist file distribution program on the current host and create the required external table definition(s) in Greenplum Database that point to the source data. Note that the host from which you run
gploadmust be accessible over the network by all Greenplum Database hosts (master and segments).
SOURCEblock of an
INPUTspecification defines the location of a source file. An
INPUTsection can have more than one
SOURCEblock defined. Each
SOURCEblock defined corresponds to one instance of the gpfdist file distribution program that will be started on the local machine. Each
SOURCEblock defined must have a
For more information about using the
gpfdist parallel file server and single and multiple
gpfdist instances, see Loading and Unloading Data.
gploadis running. If this machine is configured with multiple network interface cards (NICs), you can specify the host name or IP of each individual NIC to allow network traffic to use all NICs simultaneously. The default is to use the local machine's primary host name or IP only.
PORT_RANGEto select an available port from the specified range. If both
PORT_RANGEare defined, then
PORTtakes precedence. If neither
PORT_RANGEare defined, the default is to select an available port between 8000 and 9000.
If multiple host names are declared in
LOCAL_HOSTNAME, this port number is used for all hosts. This configuration is desired if you want to use all NICs to load the same file or set of files in a given directory location.
PORTto supply a range of port numbers from which
gploadcan choose an available port for this instance of the gpfdist file distribution program.
If the files are compressed using
bzip2 (have a
.bz2 file extension), the files will be uncompressed automatically (provided that
bunzip2 is in your path).
When specifying which source files to load, you can use the wildcard character (
*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the current directory from which
gpload is run (or you can declare an absolute path).
SSLis set to true,
gpfdistserver with the
--ssloption and uses the
true; cannot be specified when SSL is
falseor unspecified. The location specified in
CERTIFICATES_PATHmust contain the following files:
The root directory (
/) cannot be specified as
gploadresolve hostnames to the fully qualified domain name (FQDN) or the local hostname. If the value is set to
true, names are resolved to the FQDN. If the value is set to
false, resolution is to the local hostname. The default is
A fully qualified domain name might be required in some situations. For example, if the Greenplum Database system is in a different domain than an ETL application that is being accessed by
DELIMITERcharacter in the source file is what separates two data value fields (columns). A row is determined by a line feed character (
If the input
COLUMNS are not specified, then the schema of the output
TABLE is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table.
The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target
TABLE. This default mapping can be overridden using the
gpload. For information about XML transformations, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.
TRANSFORMis specified. Specifies the location of the transformation configuration file that is specified in the
TEXT) or comma separated values (
CSV) format. Defaults to
TEXTif not specified. For more information about the format of the source data, see Loading and Unloading Data.
"\u001B". The escape string syntax,
E'<character-code>', is also supported for non-printable characters. The ASCII or unicode character must be enclosed in single quotes. For example:
\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a
"(double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value
'OFF'as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
If not specified, Greenplum Database detects the newline type by examining the first row of data that it receives, and uses the first newline type that it encounters.
TEXTmode, and an empty value with no quotations in
CSVmode. You might prefer an empty string even in
TEXTmode for cases where you do not want to distinguish nulls from empty strings. Any source data item that matches this string will be considered a null value.
false. When reading a row of data that has missing trailing field values (the row of data has missing data fields at the end of a line or row), Greenplum Database returns an error.
If the value is
true, when reading a row of data that has missing trailing field values, the values are set to
NULL. Blank rows, fields with a
NOT NULL constraint, and trailing delimiters on a line will still report an error.
FILL MISSING FIELDS clause of the CREATE EXTERNAL TABLE command.
CSV. Specifies the quotation character for
CSVmode. The default is double-quote (
'SQL_ASCII'), an integer encoding number, or
'DEFAULT'to use the default client encoding. If not specified, the default client encoding is used. For information about supported character sets, see the Greenplum Database Reference Guide.
Note: If you change the
ENCODING value in an existing
gpload control file, you must manually drop any external tables that were creating using the previous
gpload does not drop and recreate external tables to use the new
REUSE_TABLES is set to
ERROR_LIMITis declared. Value is either
false. The default value is
false. If the value is
true, rows with formatting errors are logged internally when running in single row error isolation mode. You can examine formatting errors with the Greenplum Database built-in SQL function
gp_read_error_log('<table_name>'). If formatting errors are detected when loading data,
gploadgenerates a warning message with the name of the table that contains the error information similar to this message.
<timestamp>|WARN|1 bad row, please use GPDB built-in function gp_read_error_log('table-name') to access the detailed error row
LOG_ERRORS: true is specified,
REUSE_TABLES: true must be specified to retain the formatting errors in Greenplum Database error logs. If
REUSE_TABLES: true is not specified, the error information is deleted after the
gpload operation. Only summary information about formatting errors is returned. You can delete the formatting errors from the error logs with the Greenplum Database function
gpfdist reads data and encounters a data formatting error, the error message includes a row number indicating the location of the formatting error.
gpfdist attempts to capture the row that contains the error. However,
gpfdist might not capture the exact row for some formatting errors.
For more information about handling load errors, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide. For information about the
gp_read_error_log() function, see the CREATE EXTERNAL TABLE command.
The default is to use the Greenplum Database
EXTERNAL is declared. The name of the schema of the external table. If the schema does not exist, an error is returned.
% (percent character) is specified, the schema of the table name specified by
TABLE in the
OUTPUT section is used. If the table name does not specify a schema, the default schema is used.
INSERTif not specified. There are three available load modes:
INSERT - Loads data into the target table using the following method:
INSERT INTO <target_table> SELECT * FROM <input_data>;
UPDATE - Updates the
UPDATE_COLUMNS of the target table where the rows have
MATCH_COLUMNS attribute values equal to those of the input data, and the optional
UPDATE_CONDITION is true.
UPDATE is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
MERGE - Inserts new rows and updates the
UPDATE_COLUMNS of existing rows where
FOOBAR attribute values are equal to those of the input data, and the optional
MATCH_COLUMNS is true. New rows are identified when the
MATCH_COLUMNS value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only the
UPDATE columns. If there are multiple new
MATCH_COLUMNS values that are the same, only one new row for that value will be inserted. Use
UPDATE_CONDITION to filter out the rows to discard.
MERGE is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
MERGE. Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
MERGE. Specifies the column(s) to update for the rows that meet the
MATCH_COLUMNScriteria and the optional
WHEREclause) that must be met in order for a row in the target table to be updated.
COLUMNSsection and the column names of the target
TABLE. A mapping is specified as either:
Where <expression> is any expression that you would specify in the
SELECT list of a query, such as a constant value, a column reference, an operator invocation, a function call, and so on.
Optional. Specifies operations to run prior to the load operation. Right now the only preload operation is
gploadwill remove all rows in the target table prior to loading it. Default is false.
gploadwill not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves performance of trickle loads (ongoing small loads to the same target table).
LOG_ERRORS: true is specified,
REUSE_TABLES: true must be specified to retain the formatting errors in Greenplum Database error logs. If
REUSE_TABLES: true is not specified, formatting error information is deleted after the
If the <external_table_name> exists, the utility uses the existing table. The utility returns an error if the table schema does not match the
OUTPUT table schema.
gploadoperation. The external table is used by
REUSE_TABLES: truemust also specified. If
REUSE_TABLESis false or not specified,
STAGING_TABLEis ignored. By default,
gploadcreates a temporary external table with a randomly generated name.
If external_table_name contains a period (.),
gpload returns an error. If the table exists, the utility uses the table. The utility returns an error if the existing table schema does not match the
OUTPUT table schema.
The utility uses the value of SCHEMA in the
EXTERNAL section as the schema for <external_table_name>. If the
SCHEMA value is
%, the schema for <external_table_name> is the same as the schema of the target table, the schema of TABLE in the
SCHEMA is not set, the utility searches for the table (using the schemas in the database
search_path). If the table is not found, external_table_name is created in the default
gpload creates the staging table using the distribution key(s) of the target table as the distribution key(s) for the staging table. If the target table was created
MATCH_COLUMNS as the staging table's distribution key(s).
gploadonly searches the database for matching external table objects when reusing external tables. The utility does not check the external table column names and column types in the catalog table
pg_attributeto ensure that the table can be used for a
gploadoperation. Set the value to true to improve
gploadperformance when reusing external table objects and the database catalog table
pg_attributecontains a large number of rows. The utility returns an error and quits if the column definitions are not compatible.
The default value is false, the utility checks the external table definition column names and column types.
REUSE_TABLES: true must also specified. If
REUSE_TABLES is false or not specified and
FAST_MATCH: true is specified,
gpload returns a warning message.
Optional. Defines SQL commands to run before and/or after the load operation. You can specify multiple
AFTER commands. List commands in the order of desired execution.
Log files output by
gpload have the following format:
Where <timestamp> takes the form:
YYYY-MM-DD HH:MM:SS, level is one of
ERROR, and message is a normal text message.
INFO messages that may be of interest in the log files are (where # corresponds to the actual number of seconds, units of data, or failed rows):
INFO|running time: <#.##> seconds INFO|transferred <#.#> kB of <#.#> kB. INFO|gpload succeeded INFO|gpload succeeded with warnings INFO|gpload failed INFO|1 bad row INFO|<#> bad rows
If your database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the
gpload control file. For example, if you create a table as follows:
CREATE TABLE "MyTable" ("MyColumn" text);
gpload control file would refer to the above table and column names as follows:
- COLUMNS: - '"MyColumn"': text OUTPUT: - TABLE: public.'"MyTable"'
If the YAML control file contains the
ERROR_TABLE element that was available in Greenplum Database 4.3.x,
gpload logs a warning stating that
ERROR_TABLE is not supported, and load errors are handled as if the
REUSE_TABLE elements were set to
true. Rows with formatting errors are logged internally when running in single row error isolation mode.
Run a load job as defined in
gpload -f my_load.yml
Example load control file:
--- VERSION: 126.96.36.199 DATABASE: ops USER: gpadmin HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - etl1-1 - etl1-2 - etl1-3 - etl1-4 PORT: 8081 FILE: - /var/load/data/* - COLUMNS: - name: text - amount: float4 - category: text - descr: text - date: date - FORMAT: text - DELIMITER: '|' - ERROR_LIMIT: 25 - LOG_ERRORS: true OUTPUT: - TABLE: payables.expenses - MODE: INSERT PRELOAD: - REUSE_TABLES: true SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"