Description of the different ways to add data to Greenplum Database.
Parent topic: Greenplum Database Best Practices
A singleton INSERT
statement with values adds a single row to a table. The row flows through the coordinator and is distributed to a segment. This is the slowest method and is not suitable for loading large amounts of data.
The PostgreSQL COPY
statement copies data from an external file into a database table. It can insert multiple rows more efficiently than an INSERT
statement, but the rows are still passed through the coordinator. All of the data is copied in one command; it is not a parallel process.
Data input to the COPY
command is from a file or the standard input. For example:
COPY table FROM '/data/mydata.csv' WITH CSV HEADER;
Use COPY
to add relatively small sets of data, for example dimension tables with up to ten thousand rows, or one-time data loads.
Use COPY
when scripting a process that loads small amounts of data, less than 10 thousand rows.
Since COPY is a single command, there is no need to deactivate autocommit when you use this method to populate a table.
You can run multiple concurrent COPY
commands to improve performance.
External tables provide access to data in sources outside of Greenplum Database. They can be accessed with SELECT
statements and are commonly used with the Extract, Load, Transform (ELT) pattern, a variant of the Extract, Transform, Load (ETL) pattern that takes advantage of Greenplum Database's fast parallel data loading capability.
With ETL, data is extracted from its source, transformed outside of the database using external transformation tools, such as Datastage, and then loaded into the database.
With ELT, Greenplum external tables provide access to data in external sources, which could be read-only files (for example, text, CSV, or XML files), Web servers, Hadoop file systems, executable OS programs, or the Greenplum gpfdist
file server, described in the next section. External tables support SQL operations such as select, sort, and join so the data can be loaded and transformed simultaneously, or loaded into a load table and transformed in the database into target tables.
The external table is defined with a CREATE EXTERNAL TABLE
statement, which has a LOCATION
clause to define the location of the data and a FORMAT
clause to define the formatting of the source data so that the system can parse the input data. Files use the file://
protocol, and must reside on a segment host in a location accessible by the Greenplum superuser. The data can be spread out among the segment hosts with no more than one file per primary segment on each host. The number of files listed in the LOCATION
clause is the number of segments that will read the external table in parallel.
The fastest way to load large fact tables is to use external tables with gpfdist
. gpfdist
is a file server program using an HTTP protocol that serves external data files to Greenplum Database segments in parallel. A gpfdist
instance can serve 200 MB/second and many gpfdist
processes can run simultaneously, each serving up a portion of the data to be loaded. When you begin the load using a statement such as INSERT INTO <table> SELECT * FROM <external_table>
, the INSERT
statement is parsed by the coordinator and distributed to the primary segments. The segments connect to the gpfdist
servers and retrieve the data in parallel, parse and validate the data, calculate a hash from the distribution key data and, based on the hash key, send the row to its destination segment. By default, each gpfdist
instance will accept up to 64 connections from segments. With many segments and gpfdist
servers participating in the load, data can be loaded at very high rates.
Primary segments access external files in parallel when using gpfdist
up to the value of gp_external_max_segs
. When optimizing gpfdist
performance, maximize the parallelism as the number of segments increase. Spread the data evenly across as many ETL nodes as possible. Split very large data files into equal parts and spread the data across as many file systems as possible.
Run two gpfdist
instances per file system. gpfdist
tends to be CPU bound on the segment nodes when loading. But if, for example, there are eight racks of segment nodes, there is lot of available CPU on the segments to drive more gpfdist
processes. Run gpfdist
on as many interfaces as possible. Be aware of bonded NICs and be sure to start enough gpfdist
instances to work them.
It is important to keep the work even across all these resources. The load is as fast as the slowest node. Skew in the load file layout will cause the overall load to bottleneck on that resource.
The gp_external_max_segs
configuration parameter controls the number of segments each gpfdist
process serves. The default is 64. You can set a different value in the postgresql.conf
configuration file on the coordinator. Always keep gp_external_max_segs
and the number of gpfdist
processes an even factor; that is, the gp_external_max_segs
value should be a multiple of the number of gpfdist
processes. For example, if there are 12 segments and 4 gpfdist
processes, the planner round robins the segment connections as follows:
Segment 1 - gpfdist 1
Segment 2 - gpfdist 2
Segment 3 - gpfdist 3
Segment 4 - gpfdist 4
Segment 5 - gpfdist 1
Segment 6 - gpfdist 2
Segment 7 - gpfdist 3
Segment 8 - gpfdist 4
Segment 9 - gpfdist 1
Segment 10 - gpfdist 2
Segment 11 - gpfdist 3
Segment 12 - gpfdist 4
Drop indexes before loading into existing tables and re-create the index after loading. Creating an index on pre-existing data is faster than updating it incrementally as each row is loaded.
Run ANALYZE
on the table after loading. Deactivate automatic statistics collection during loading by setting gp_autostats_mode
to NONE
. Run VACUUM
after load errors to recover space.
Performing small, high frequency data loads into heavily partitioned column-oriented tables can have a high impact on the system because of the number of physical files accessed per time interval.
gpload
is a data loading utility that acts as an interface to the Greenplum external table parallel loading feature.
Beware of using gpload
as it can cause catalog bloat by creating and dropping external tables. Use gpfdist
instead, since it provides the best performance.
gpload
runs a load using a specification defined in a YAML-formatted control file. It performs the following operations:
gpfdist
processesINSERT
, UPDATE
, or MERGE
operation to load the source data into the target table in the databasegpfdist
processesThe load is accomplished in a single transaction.
Drop any indexes on an existing table before loading data and recreate the indexes after loading. Newly creating an index is faster than updating an index incrementally as each row is loaded.
Deactivate automatic statistics collection during loading by setting the gp_autostats_mode
configuration parameter to NONE
.
External tables are not intended for frequent or ad hoc access.
When using gpfdist
, maximize network bandwidth by running one gpfdist
instance for each NIC on the ETL server. Divide the source data evenly between the gpfdist
instances.
When using gpload
, run as many simultaneous gpload
instances as resources allow. Take advantage of the CPU, memory, and networking resources available to increase the amount of data that can be transferred from ETL servers to the Greenplum Database.
Use the SEGMENT REJECT LIMIT
clause of the COPY
statement to set a limit for the number or percentage of rows that can have errors before the COPY FROM
command is cancelled. The reject limit is per segment; when any one segment exceeds the limit, the command is cancelled and no rows are added. Use the LOG ERRORS
clause to save error rows. If a row has errors in the formatting—for example missing or extra values, or incorrect data types—Greenplum Database stores the error information and row internally. Use the gp_read_error_log()
built-in SQL function to access this stored information.
If the load has errors, run VACUUM
on the table to recover space.
After you load data into a table, run VACUUM
on heap tables, including system catalogs, and ANALYZE
on all tables. It is not necessary to run VACUUM
on append-optimized tables. If the table is partitioned, you can vacuum and analyze just the partitions affected by the data load. These steps clean up any rows from prematurely ended loads, deletes, or updates and update statistics for the table.
Recheck for segment skew in the table after loading a large amount of data. You can use a query like the following to check for skew:
SELECT gp_segment_id, count(*)
FROM schema.table
GROUP BY gp_segment_id ORDER BY 2;
By default, gpfdist
assumes a maximum record size of 32K. To load data records larger than 32K, you must increase the maximum row size parameter by specifying the -m <*bytes*>
option on the gpfdist
command line. If you use gpload
, set the MAX_LINE_LENGTH
parameter in the gpload
control file.
See the Greenplum Database Reference Guide for detailed instructions for loading data using gpfdist
and gpload
.