The topics in this section describe methods for loading and writing data into and out of a Greenplum Database, and how to format data files.
Greenplum Database supports high-performance parallel data loading and unloading, and for smaller amounts of data, single file, non-parallel data import and export.
Greenplum Database can read from and write to several types of external data sources, including text files, Hadoop file systems, Amazon S3, and web servers.
COPY
SQL command transfers data between an external text file on the master host, or multiple text files on segment hosts, and a Greenplum Database table.SELECT
, JOIN
, or SORT EXTERNAL TABLE DATA
, and you can create views for external tables. External tables are often used to load external data into a regular database table using a command such as CREATE TABLE table AS SELECT * FROM ext\_table
.gpfdist
utility is the Greenplum Database parallel file distribution program. It is an HTTP server that is used with external tables to allow Greenplum Database segments to load external data in parallel, from multiple file systems. You can run multiple instances of gpfdist
on different hosts and network interfaces and access them in parallel.gpload
utility automates the steps of a load task using gpfdist
and a YAML-formatted control file.The method you choose to load data depends on the characteristics of the source data—its location, size, format, and any transformations required.
In the simplest case, the COPY
SQL command loads data into a table from a text file that is accessible to the Greenplum Database master instance. This requires no setup and provides good performance for smaller amounts of data. With the COPY
command, the data copied into or out of the database passes between a single file on the master host and the database. This limits the total size of the dataset to the capacity of the file system where the external file resides and limits the data transfer to a single file write stream.
More efficient data loading options for large datasets take advantage of the Greenplum Database MPP architecture, using the Greenplum Database segments to load data in parallel. These methods allow data to load simultaneously from multiple file systems, through multiple NICs, on multiple hosts, achieving very high data transfer rates. External tables allow you to access external files from within the database as if they are regular database tables. When used with gpfdist
, the Greenplum Database parallel file distribution program, external tables provide full parallelism by using the resources of all Greenplum Database segments to load or unload data.
Greenplum Database leverages the parallel architecture of the Hadoop Distributed File System to access files on that system.
Accessing External Data with PXF
Data managed by your organization may already reside in external sources such as Hadoop, object stores, and other SQL databases. The Greenplum Platform Extension Framework (PXF) provides access to this external data via built-in connectors that map an external data source to a Greenplum Database table definition.
Transforming External Data with gpfdist and gpload
The gpfdist
parallel file server allows you to set up transformations that enable Greenplum Database external tables to read and write files in formats that are not supported with the CREATE EXTERNAL TABLE
command's FORMAT
clause. An input transformation reads a file in the foreign data format and outputs rows to gpfdist
in the CSV or other text format specified in the external table's FORMAT
clause. An output transformation receives rows from gpfdist
in text format and converts them to the foreign data format.
Parent topic: Greenplum Database Administrator Guide