External tables enable accessing external data as if it were a regular database table. They are often used to move data into and out of a Greenplum database.
To create an external table definition, you specify the format of your input files and the location of your external data sources. For information about input file formats, see Formatting Data Files.
Use one of the following protocols to access external table data sources. You cannot mix protocols in CREATE EXTERNAL TABLE
statements:
file://
accesses external data files on segment hosts that the Greenplum Database superuser (gpadmin
) can access. See file:// Protocol.
gpfdist://
points to a directory on the file host and serves external data files to all Greenplum Database segments in parallel. See gpfdist:// Protocol.
gpfdists://
is the secure version of gpfdist
. See gpfdists:// Protocol.
gphdfs://
accesses files on a Hadoop Distributed File System (HDFS). See gphdfs:// Protocol (Deprecated).
Note: The gphdfs
external table protocol is deprecated and will be removed in the next major release of Greenplum Database.
s3://
accesses files in an Amazon S3 bucket. See s3:// Protocol.
The pxf://
protocol accesses external HDFS files and HBase and Hive tables using the Greenplum Platform Extension Framework (PXF). See pxf:// Protocol.
Note:
The gphdfs://
(deprecated), s3://
, and pxf://
protocols are custom data access protocols, where the file://
, gpfdist://
, and gpfdists://
protocols are implemented internally in Greenplum Database. The custom and internal protocols differ in these ways:
CREATE PROTOCOL
command. The gphdfs://
protocol (deprecated) is preregistered when you install Greenplum Database. Registering the PXF extension in a database creates the pxf://
protocol. (See Accessing External Data with PXF.) You can optionally register the s3://
protocol. (See Configuring and Using S3 External Tables.) Internal protocols are always present and cannot be unregistered.pg_extprotocol
catalog table to specify the handler functions that implement the protocol. The protocol's shared libraries must have been installed on all Greenplum Database hosts. The internal protocols are not represented in the pg_extprotocol
table and have no additional libraries to install.GRANT [SELECT | INSERT | ALL] ON PROTOCOL
. To allow (or deny) users permissions on the internal protocols, you use CREATE ROLE
or ALTER ROLE
to add the CREATEEXTTABLE
(or NOCREATEEXTTABLE
) attribute to each user's role.External tables access external files from within the database as if they are regular database tables. External tables defined with the gpfdist
/gpfdists
, gphdfs
(deprecated), and s3
protocols utilize Greenplum parallelism by using the resources of all Greenplum Database segments to load or unload data. The gphdfs
protocol (deprecated) leverages the parallel architecture of the Hadoop Distributed File System to access files on that system. The s3
protocol utilizes the Amazon Web Services (AWS) capabilities.
You can query external table data directly and in parallel using SQL commands such as SELECT
, JOIN
, or SORT EXTERNAL TABLE DATA
, and you can create views for external tables.
The steps for using external tables are:
Define the external table.
To use the s3
protocol, you must also configure Greenplum Database and enable the protocol. See s3:// Protocol.
Do one of the following:
gpfdist
or gpdists
protocols.gphdfs
protocol (deprecated).s3
protocol.Place the data files in the correct locations.
Query the external table with SQL commands.
Greenplum Database provides readable and writable external tables:
Readable external tables for data loading. Readable external tables support:
WHERE
clause, it may be passed to the external data source. Refer to the gp_external_enable_filter_pushdown server configuration parameter discussion for more information. Note that this feature is currently supported only with the pxf
protocol (see pxf:// Protocol). Readable external tables allow only SELECT
operations.Writable external tables for data unloading. Writable external tables support:
INSERT
operations.External tables can be file-based or web-based. External tables using the file://
protocol are read-only tables.
http://
protocol or by executing OS commands or scripts. External web tables are not rescannable: the data can change while the query runs.Dump and restore operate only on external and external web table definitions, not on the data sources.
file://
protocol is used in a URI that specifies the location of an operating system file.gpfdist://
protocol is used in a URI to reference a running gpfdist
instance.gpfdists://
protocol is a secure version of the gpfdist:// protocol
.gphdfs://
protocol specifies an external file path on a Hadoop Distributed File System (HDFS).pxf://
protocol to access data residing on external Hadoop systems (HDFS, Hive, HBase), object store systems (Azure, Google Cloud Storage, Minio, S3), and SQL databases.s3
protocol is used in a URL that specifies the location of an Amazon S3 bucket and a prefix to use for reading or writing files in the bucket.file://
, gpfdist://
, or gphdfs://
(deprecated) protocols.CREATE EXTERNAL TABLE
command can contain only one protocol.Parent topic: Working with External Data