Serves data files to or writes data files out from Greenplum Database segments.
gpfdist [-d <directory>] [-p <http_port>] [-P <last_http_port>] [-l <log_file>]
[-t <timeout>] [-k <clean_up_timeout>] [-S] [-w <time>] [-v | -V] [-s] [-m <max_length>]
[--ssl <certificate_path> [--ssl_verify_peer <boolean>] [--sslclean <wait_time>] ]
[--compress] [--multi_thread <num_threads>]
[-c <config.yml>]
gpfdist -? | --help
gpfdist --version
gpfdist
is Greenplum Database parallel file distribution program. It is used by readable external tables and gpload
to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.
Note
gpfdist
andgpload
are compatible only with the Greenplum Database major version in which they are shipped. For example, agpfdist
utility that is installed with Greenplum Database 4.x cannot be used with Greenplum Database 5.x or 6.x.
In order for gpfdist
to be used by an external table, the LOCATION
clause of the external table definition must specify the external table data using the gpfdist://
protocol (see the Greenplum Database command CREATE EXTERNAL TABLE
).
NoteIf you specify the
--ssl
option to enable SSL security, you must create the external table with thegpfdists://
protocol.
The benefit of using gpfdist
is that you are guaranteed maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables.
For readable external tables, gpfdist
parses and serves data files evenly to all the segment instances in the Greenplum Database system when users SELECT
from the external table. For writable external tables, gpfdist
accepts parallel output streams from the segments when users INSERT
into the external table, and writes to an output file.
NoteWhen
gpfdist
reads data and encounters a data formatting error, the error message includes a row number indicating the location of the formatting error.gpfdist
attempts to capture the row that contains the error. However,gpfdist
might not capture the exact row for some formatting errors.
For readable external tables, if load files are compressed using gzip
or bzip2
(have a .gz
or .bz2
file extension), gpfdist
uncompresses the data while loading the data (on the fly). For writable external tables, gpfdist
compresses the data using gzip
if the target file has a .gz
extension.
NoteCompression is not supported for readable and writeable external tables when the
gpfdist
utility runs on Windows platforms.
When reading or writing data with the gpfdist
or gpfdists
protocol, Greenplum Database includes X-GP-PROTO
in the HTTP request header to indicate that the request is from Greenplum Database. The utility rejects HTTP requests that do not include X-GP-PROTO
in the request header.
Most likely, you will want to run gpfdist
on your ETL machines rather than the hosts where Greenplum Database is installed. To install gpfdist
on another host, simply copy the utility over to that host and add gpfdist
to your $PATH
.
NoteWhen using IPv6, always enclose the numeric IP address in brackets.
gpfdist
will serve files for readable external tables or create output files for writable external tables. If not specified, defaults to the current directory.
gpfdist
will serve files. Defaults to 8080.
gpfdist
will attempt to serve files.
gpfdist
serves the files on the first port number in the range to which it successfully binds.
gpfdist
process. Default is 5 seconds. Allowed values are 2 to 7200 seconds (2 hours). May need to be increased on systems with a lot of network traffic.
gpfdist
waits before cleaning up the session when there are no
POST
requests from the segments. Default is 300. Allowed values are 300 to 86400. You may increase this value when experiencing heavy network traffic.
Sets the maximum allowed data row length in bytes. Default is 32768. Should be used when user data includes very wide rows (or when line too long
error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32K to 256MB. The upper limit is 1MB on Windows systems.
NoteMemory issues might occur if you specify a large maximum row length and run a large number of
gpfdist
concurrent connections. For example, setting this value to the maximum of 256MB with 96 concurrentgpfdist
processes requires approximately 24GB of memory ((96 + 1) x 246MB
).
Enables simplified logging. When this option is specified, only messages with WARN
level and higher are written to the gpfdist
log file. INFO
level messages are not written to the log file. If this option is not specified, all gpfdist
messages are written to the log file.
O_SYNC
flag. Any writes to the resulting file descriptor block
gpfdist
until the data is physically written to the underlying hardware.
Sets the number of seconds that Greenplum Database delays before closing a target file such as a named pipe. The default value is 0, no delay. The maximum value is 7200 seconds (2 hours).
Prompts gpfdist
to use SSL encryption for the data transfer. After running gpfdist
with the --ssl <certificate_path>
option, you must use the gpfdists://
protocol to load data from this file server. For information on gpfdists
, refer to gpfdists:// Protocol in the Greenplum Database Administrator Guide.
The setting of the --ssl_verify_peer
option determines which of the following files are required in the file system location specified by certificate_path:
server.crt
server.key
root.crt
The settings of both the verify_gpfdists_cert server configuration parameter and the --ssl_verify_peer
option also determine which certificate files are required on the Greenplum Database segments as described in gpfdists:// Protocol.
/
) as the certificate_path.
--ssl
option is also provided, specifies whether
gpfdist
enables or disables SSL certificate authentication on the Greenplum Database side.
on
;
gpfdist
checks the identities of clients, and requires that the
root.crt
,
server.crt
, and
server.key
be present in the certificate_path specified.
off
,
gpfdist
does not check the identities of clients, and requires that only the
server.crt
and
server.key
be present in the
--ssl <certificate_path>
specified. The certificate authority file
root.crt
is not required.
When the utility is run with the --ssl
option, sets the number of seconds that the utility delays before closing an SSL session and cleaning up the SSL resources after it completes writing data to or from a Greenplum Database segment. The default value is 0, no delay. The maximum value is 500 seconds. If the delay is increased, the transfer speed decreases.
gpfdist server closed connection
. To avoid the error, you can add a delay, for example
--sslclean 5
.
gpfdist
utilizes the Zstandard (
zstd
) compression algorithm.
gpfdist
uses during data transfer, parallelizing the operation. When specified,
gpfdist
automatically compresses the data (also parallelized) before transferring.
gpfdist
supports a maximum of 256 threads.
Specifies rules that gpfdist
uses to select a transform to apply when loading or extracting data. The gpfdist
configuration file is a YAML 1.1 document.
For information about the file format, see Configuration File Format in the Greenplum Database Administrator Guide. For information about configuring data transformation with gpfdist
, see Transforming External Data with gpfdist and gpload in the Greenplum Database Administrator Guide.
You can set the server configuration parameter gpfdist_retry_timeout to control the time that Greenplum Database waits before returning an error when a gpfdist
server does not respond while Greenplum Database is attempting to write data to gpfdist
. The default is 300 seconds (5 minutes).
If the gpfdist
utility hangs with no read or write activity occurring, you can generate a core dump the next time a hang occurs to help debug the issue. Set the environment variable GPFDIST_WATCHDOG_TIMER
to the number of seconds of no activity to wait before gpfdist
is forced to exit. When the environment variable is set and gpfdist
hangs, the utility is stopped after the specified number of seconds, creates a core dump, and sends relevant information to the log file.
This example sets the environment variable on a Linux system so that gpfdist
exits after 300 seconds (5 minutes) of no activity.
export GPFDIST_WATCHDOG_TIMER=300
When you enable compression, gpfdist
transmits a larger amount of data while maintaining low network usage. Note that compression can be time-intensive, and may potentially reduce transmission speeds. When you utilize multi-threaded execution, the overall time required for compression may decrease, which facilitates faster data transmission while maintaining low network occupancy and high speed.
To serve files from a specified directory using port 8081 (and start gpfdist
in the background):
gpfdist -d /var/load_files -p 8081 &
To start gpfdist
in the background and redirect output and errors to a log file:
gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &
To enable multi-threaded data transfer (with implicit compression) using four threads, start gpfdist
as follows:
gpfdist -d /var/load_files -p 8081 --multi_thread 4
To enable SSL certificate authentication without peer verification, start gpfdist
as follows:
gpfdist -d /var/load_files -p 8081 --ssl /path/to/certs --ssl_verify_peer off
To stop gpfdist
when it is running in the background:
--First find its process id:
ps ax | grep gpfdist
--Then stop the process, for example:
kill 3456
gpload, CREATE EXTERNAL TABLE