Description of the different ways to add data to Greenplum Database.
Parent topic: Greenplum Database Best Practices
INSERT statement with values adds a single row to a table. The row flows through the master and is distributed to a segment. This is the slowest method and is not suitable for loading large amounts of data.
COPY statement copies data from an external file into a database table. It can insert multiple rows more efficiently than an
INSERT statement, but the rows are still passed through the master. All of the data is copied in one command; it is not a parallel process.
Data input to the
COPY command is from a file or the standard input. For example:
COPY table FROM '/data/mydata.csv' WITH CSV HEADER;
COPY to add relatively small sets of data, for example dimension tables with up to ten thousand rows, or one-time data loads.
COPY when scripting a process that loads small amounts of data, less than 10 thousand rows.
Since COPY is a single command, there is no need to disable autocommit when you use this method to populate a table.
You can run multiple concurrent
COPY commands to improve performance.
External tables provide access to data in sources outside of Greenplum Database. They can be accessed with
SELECT statements and are commonly used with the Extract, Load, Transform (ELT) pattern, a variant of the Extract, Transform, Load (ETL) pattern that takes advantage of Greenplum Database's fast parallel data loading capability.
With ETL, data is extracted from its source, transformed outside of the database using external transformation tools, such as Informatica or Datastage, and then loaded into the database.
With ELT, Greenplum external tables provide access to data in external sources, which could be read-only files (for example, text, CSV, or XML files), Web servers, Hadoop file systems, executable OS programs, or the Greenplum
gpfdist file server, described in the next section. External tables support SQL operations such as select, sort, and join so the data can be loaded and transformed simultaneously, or loaded into a load table and transformed in the database into target tables.
The external table is defined with a
CREATE EXTERNAL TABLE statement, which has a
LOCATION clause to define the location of the data and a
FORMAT clause to define the formatting of the source data so that the system can parse the input data. Files use the
file:// protocol, and must reside on a segment host in a location accessible by the Greenplum superuser. The data can be spread out among the segment hosts with no more than one file per primary segment on each host. The number of files listed in the
LOCATION clause is the number of segments that will read the external table in parallel.
The fastest way to load large fact tables is to use external tables with
gpfdist is a file server program using an HTTP protocol that serves external data files to Greenplum Database segments in parallel. A
gpfdist instance can serve 200 MB/second and many
gpfdist processes can run simultaneously, each serving up a portion of the data to be loaded. When you begin the load using a statement such as
INSERT INTO <table> SELECT * FROM <external_table>, the
INSERT statement is parsed by the master and distributed to the primary segments. The segments connect to the
gpfdist servers and retrieve the data in parallel, parse and validate the data, calculate a hash from the distribution key data and, based on the hash key, send the row to its destination segment. By default, each
gpfdist instance will accept up to 64 connections from segments. With many segments and
gpfdist servers participating in the load, data can be loaded at very high rates.
Primary segments access external files in parallel when using
gpfdist up to the value of
gp_external_max_segs. When optimizing
gpfdist performance, maximize the parallelism as the number of segments increase. Spread the data evenly across as many ETL nodes as possible. Split very large data files into equal parts and spread the data across as many file systems as possible.
gpfdist instances per file system.
gpfdist tends to be CPU bound on the segment nodes when loading. But if, for example, there are eight racks of segment nodes, there is lot of available CPU on the segments to drive more
gpfdist processes. Run
gpfdist on as many interfaces as possible. Be aware of bonded NICs and be sure to start enough
gpfdist instances to work them.
It is important to keep the work even across all these resources. The load is as fast as the slowest node. Skew in the load file layout will cause the overall load to bottleneck on that resource.
gp_external_max_segs configuration parameter controls the number of segments each
gpfdist process serves. The default is 64. You can set a different value in the
postgresql.conf configuration file on the master. Always keep
gp_external_max_segs and the number of
gpfdist processes an even factor; that is, the
gp_external_max_segs value should be a multiple of the number of
gpfdist processes. For example, if there are 12 segments and 4
gpfdist processes, the planner round robins the segment connections as follows:
Segment 1 - gpfdist 1 Segment 2 - gpfdist 2 Segment 3 - gpfdist 3 Segment 4 - gpfdist 4 Segment 5 - gpfdist 1 Segment 6 - gpfdist 2 Segment 7 - gpfdist 3 Segment 8 - gpfdist 4 Segment 9 - gpfdist 1 Segment 10 - gpfdist 2 Segment 11 - gpfdist 3 Segment 12 - gpfdist 4
Drop indexes before loading into existing tables and re-create the index after loading. Creating an index on pre-existing data is faster than updating it incrementally as each row is loaded.
ANALYZE on the table after loading. Disable automatic statistics collection during loading by setting
VACUUM after load errors to recover space.
Performing small, high frequency data loads into heavily partitioned column-oriented tables can have a high impact on the system because of the number of physical files accessed per time interval.
gpload is a data loading utility that acts as an interface to the Greenplum external table parallel loading feature.
Beware of using
gpload as it can cause catalog bloat by creating and dropping external tables. Use
gpfdist instead, since it provides the best performance.
gpload runs a load using a specification defined in a YAML-formatted control file. It performs the following operations:
MERGEoperation to load the source data into the target table in the database
The load is accomplished in a single transaction.
Drop any indexes on an existing table before loading data and recreate the indexes after loading. Newly creating an index is faster than updating an index incrementally as each row is loaded.
Disable automatic statistics collection during loading by setting the
gp_autostats_mode configuration parameter to
External tables are not intended for frequent or ad hoc access.
gpfdist, maximize network bandwidth by running one
gpfdist instance for each NIC on the ETL server. Divide the source data evenly between the
gpload, run as many simultaneous
gpload instances as resources allow. Take advantage of the CPU, memory, and networking resources available to increase the amount of data that can be transferred from ETL servers to the Greenplum Database.
SEGMENT REJECT LIMIT clause of the
COPY statement to set a limit for the number or percentage of rows that can have errors before the
COPY FROM command is cancelled. The reject limit is per segment; when any one segment exceeds the limit, the command is cancelled and no rows are added. Use the
LOG ERRORS clause to save error rows. If a row has errors in the formatting—for example missing or extra values, or incorrect data types—Greenplum Database stores the error information and row internally. Use the
gp_read_error_log() built-in SQL function to access this stored information.
If the load has errors, run
VACUUM on the table to recover space.
After you load data into a table, run
VACUUM on heap tables, including system catalogs, and
ANALYZE on all tables. It is not necessary to run
VACUUM on append-optimized tables. If the table is partitioned, you can vacuum and analyze just the partitions affected by the data load. These steps clean up any rows from prematurely ended loads, deletes, or updates and update statistics for the table.
Recheck for segment skew in the table after loading a large amount of data. You can use a query like the following to check for skew:
SELECT gp_segment_id, count(*) FROM schema.table GROUP BY gp_segment_id ORDER BY 2;
gpfdist assumes a maximum record size of 32K. To load data records larger than 32K, you must increase the maximum row size parameter by specifying the
-m <*bytes*> option on the
gpfdist command line. If you use
gpload, set the
MAX_LINE_LENGTH parameter in the
gpload control file.
Note: Integrations with Informatica Power Exchange are currently limited to the default 32K record length.
See the Greenplum Database Reference Guide for detailed instructions for loading data using