You can use the gpcopy
utility to transfer data between databases in different Greenplum Database systems.
gpcopy is a high-performance utility that can copy metadata and data from one Greenplum database to another Greenplum database. You can migrate the entire contents of a database, or just selected tables.
You can migrate data between different Greenplum Database versions. gpcopy
supports migrating data from a Greenplum Database version 4.23.6 (or later) system to a 5.9 (or later) or a 6.x Greenplum system, or from a Greenplum Database version 5.9+ system to a Greenplum 6.x system.
The gpcopy
interface includes options to transfer one or more full databases, or one or more database tables. A full database transfer includes the database schema, table data, indexes, views, roles, user-defined functions, resource queues, and resource groups. If a copied table or database does not exist in the destination system, gpcopy
creates it automatically, along with indexes as necessary.
Configuration files, including postgresql.conf
and pg_hba.conf
, must be transferred manually by an administrator. Extensions installed in the database with gppkg
, such as MADlib and programming language extensions, must be installed in the destination database by an administrator.
gpcopy
is a command-line tool that includes these features:
gpcopy
can migrate data between Greenplum Database systems where the source and destination systems are configured with a different number of segment instances.
gpcopy
provides detailed reporting and summary information about all aspects of the copy operation.
gpcopy
allows the source table data to change while the data is being copied. A lock is not acquired on the source table when data is copied.
The gpcopy
utility includes the --truncate-source-after
option to help migrate data from one VMware Greenplum system to another on the same hardware, requiring minimal free space available.
The source and destination Greenplum Database systems must already exist, have network access between all hosts, and have coordinator host and primary segment hosts in both systems.
If you are copying data between Greenplum Database systems having different versions, each system must have gpcopy
installed locally. You can download gpcopy
from Broadcom Support Portal.
NoteFor more information about download prerequisites, troubleshooting, and instructions, see Download Broadcom products and software.
gpcopy
is dependent on the pg_dump
, pg_dumpall
, and psql
utilities installed with Greenplum Database. In most cases, you run gpcopy
from a Greenplum Database host system, so the dependencies are automatically met. If you need to run gpcopy
on a remote server, such as an ETL system, copy the gpcopy
binary and install a compatible Greenplum Clients package to meet the gpcopy
dependencies.
gpcopy
transfers data from user databases only; the postgres
, template0
, and template1
databases cannot be transferred. Administrators must transfer configuration files manually and install extensions into the destination database with gppkg
.
For information about how gpcopy
handles dependent database objects when copying a specific database or a set of tables, see the Notes section in the gpcopy
reference.
If you are copying data from a VMware Greenplum system running on RHEL7 for loading to a system running on RHEL8, the default --parallelize-leaf-partitions
setting may lead to data being copied to an incorrect partition in range-partitioned tables because of glibc
changes between the operating system versions. In this scenario, always set this option to false
to copy as a single table based on the root partition table. See Migrating VMware Greenplum from Enterprise Linux 7 to 8 in the VMware Greenplum documentation for more information.
gpcopy
cannot copy a row that is larger than 1GB in size.
The --on-segment-threshold
setting determines where gpcopy
performs the data transfer:
--on-segment-threshold
is set to -1
(the default), gpcopy
copies table data using the source and destination Greenplum Database segment instances.--on-segment-threshold
is set to -2
, gpcopy
copies table data using the source and destination Greenplum Database coordinators.--on-segment-threshold
is set to a positive value, it identifies a row number threshold. If a table contains this number of rows or less, gpcopy
copies the table data using the source and destination coordinators. If the number of rows in a table is more than the threshold, gpcopy
copies the data using the source and destination segment instances. gpcopy
uses the source table statistics to determine the number of table rows. If a source table is not analyzed, gpcopy
assumes that the table is a small table, ignores the threshold setting, and copies the table data using only the coordinators. If your database has large tables without statistics, set the --on-segment-threshold
option to -1
to force gpcopy
to copy table data using segment instances.gpcopy
does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table. You can copy those tables in a gpcopy
operation and specify the --no-distribution-check
option to deactivate checking of data distribution.
CautionBefore you perform a
gpcopy
operation with the--no-distribution-check
option, ensure that you have a backup of the destination database and that the distribution policies of the tables that are being copied are the same in the source and destination database. Copying data into segment instances with incorrect data distribution can cause incorrect query results and can cause database corruption.
When the pg_hba.conf
file specifies password authentication, gpcopy
can use the value of the PGPASSWORD
or PGPASSFILE
environment variable to obtain the connection password for the source and/or destination Greenplum Database cluster as follows:
PGPASSFILE
specifies the name of the password file to use for lookups. gpcopy
can use this file to obtain passwords for both the source and the destination Greenplum Database clusters.
The PGPASSWORD
environment variable behaves the same as the password connection parameter. The password for the source and destination Greenplum Database clusters must be the same when you specify a PGPASSWORD
.
PGPASSWORD
environment variable is not recommended for security reasons, as some operating systems allow non-root users to see process environment variables via
ps
; instead consider using a password file.
When both environment variables are set, the PGPASSWORD
setting takes precedence. When neither is set, gpcopy
attempts to use the default password file, typically ~/.pgpass
, to obtain the password.
When the pg_hba.conf
file specifies the SSL/TLS connection type, you can configure gpcopy
to initiate an encrypted connection to both the source and the destination Greenplum Database clusters. In this scenario, gpcopy
is the SSL client and the source and destination Greenplum Database clusters are the SSL servers.
gpcopy
supports SSL encryption on the connection to Greenplum as described in the PostgreSQL SSL Support documentation.
You have two options to specify the SSL key and certificate files:
gpcopy
via the PGSSLCERT
, PGSSLKEY
, and PGSSLROOTCERT
environment variables. Refer to the PostgreSQL documentation for more information about these variables.~/.postgresql
directory on the gpcopy
host.When both the source and the destination Greenplum Database clusters are configured for SSL/TLS, gpcopy
uses the specified SSL private key and certificate for both connections.
You can use the PGSSLMODE
environment variable to specify the SSL mode for the connections. Because the default SSL mode of gpcopy
does not support prefer
, you must set PGSSLMODE
to disable
when one of the Greenplum clusters is configured for encryption and the other is not.
gpcopy
supports using SSL/TLS encryption during data transfer from the source to the destination Greenplum Database cluster. In this scenario, a TLS connection is established between a gpcopy_helper
running in the source Greenplum Database cluster and a gpcopy_helper
running in the destination Greenplum Database cluster, and the data to copy is sent over this connection.
If you choose to direct gpcopy
to use this encryption method for the data channel between gpcopy_helper
s:
--ssl-key <key_file>
and --ssl-cert <cert_file>
options to the gpcopy
command. These files must reside in the same file system location on all hosts (coordinator and segment) in the destination Greenplum Database cluster.--ssl-ca <ca_file>
option to the gpcopy
command. When you specify the root CA, <ca_file>
must reside in the same file system location on all hosts (coordinator and segment) in the source Greenplum Database cluster.--ssl-min-tls <min_version_str>
) for the encrypted connection. Valid <min_version_str>
values are '1.0'
, '1.1'
, '1.2'
, or '1.3'
. The default minimum TLS version is 1.0
.gpcopy
determines the sslmode to use for the connection based on the --ssl-<xxx>
options provided to the command.
To avoid copying data changes that occur while gpcopy
is transfering data, you can choose to run the gpcopy
operation in a transaction snapshot.
NoteYou can instruct
gpcopy
to run the copy operation in a specific transaction when initiated from a version 6 or 7 source Greenplum Database cluster.gpcopy
does not support transaction snapshots with version 5.x source Greenplum clusters.
If you choose this option, you must start the transaction before you run gpcopy
, and you must provide the identifier of the transaction snapshot to the command, as follows:
Open a psql
session to the source Greenplum cluster and start a transaction:
db1=# BEGIN;
BEGIN
NoteDo not terminate the transaction until the
gpcopy
operation completes.
Use the pg_export_snapshot()
function to export the transaction snapshot identifier. For example:
db1=# SELECT pg_export_snapshot();
pg_export_snapshot
---------------------
00000006-0007686E-1
(1 row)
Provide the --snapshot
option with the returned identifier when you invoke the gpcopy
command. For example:
$ gpcopy --source-host 1.1.1.1 --source-port 6000 \
--dest-host 1.1.1.1 --dest-port 6000 \
--include-table db1.public.test --dest-table newdb.public.newtbl \
--snapshot 00000006-0007686E-1
gpcopy
will internally invoke SET TRANSACTION SNAPSHOT <snapshot_id>
before it initiates the copy operation.
When the gpcopy
operation completes, end the transaction:
db1=# END;
END
The degree of parallelism when running gpcopy
is determined the option --jobs
. The option controls the number processes that gpcopy
runs in parallel. The default is 4. The range is from 1 to 64512.
The --jobs
value, n, produces 2*n+1
database connections. For example, the default --jobs
value of 4 creates 9 connections.
If you are copying partitioned tables, the default is to copy the leaf partition tables of a partitioned table in parallel. The --jobs
value might affect the number of leaf partitions that are copied in parallel. Increasing the --jobs
value might improve performance depending on factors including, the number of Greenplum Database hosts and segments, the number of leaf partitions, and amount of data being transferred. You can deactivate the copying of leaf partitions in parallel. See the --parallelize-leaf-partitions
option for more information.
If you increase the --jobs
option, ensure that the Greenplum Database systems are configured with a sufficient maximum concurrent connection value to accommodate the gpcopy
connections and any other concurrent connections (such as user connections) that you require. See the Greenplum Database server configuration parameter max_connections
.
By default, gpcopy
does not validate the data transferred. You can request validation using the --validate=*type*
option. You must include --validate=*type*
if you specify the --truncate-source-after
option. The validation type can be one of the following:
count
- compares the row counts between the source and destination tables.md5xor
- validates by selecting all rows of the source and destination tables, converting all columns in a row to text, and then calculating the md5 value of each row. gpcopy
then performs an XOR over the MD5 values to ensure that all rows were successfully copied for the table.Note: Avoid using --append
with either validation option. If you use --append
and the source table already includes rows, then either validation method will fail due to the different number of rows in the source and destination tables.
When gpcopy
encounters errors and quits or is cancelled by the user, current copy operations on tables in the destination database are rolled back. Copy operations that have completed are not rolled back.
If an error occurs during the process of copying a table, or table validation fails, gpcopy
continues copying the other specified tables. After gpcopy
finishes, it displays a list of tables where errors occurred or validation failed and displays a gpcopy
command. You can use the provided command to retry copying the failed tables.
The gpcopy
utility logs messages in log file gpcopy_date.log
in the ~/gpAdminLogs
directory on the coordinator host. If you run multiple gpcopy
commands on the same day, the utility appends messages to that day's log file.
After gpcopy
completes, it displays a summary of the operations performed. If the utility fails to copy tables, they are highlighted in summary, and there is a gpcopy
command provided in summary for user to just copy the failed tables. The information is displayed at the command prompt and in the gpcopy
log file.
After resolving the issues that caused the copy operations to fail, you can run the provided command to copy the tables that failed in the previous gpcopy
command.
Follow this procedure to migrate data from one Greenplum Database system to another with gpcopy
:
Start both the source and destination systems.
Perform a full database backup in the source Greenplum Database system.
As a best practice, source the greenplum_path.sh
file in the source Greenplum 5 installation, so that you run gpcopy
from the source system. For example:
$ source /usr/local/greenplum-db-5.20.0/greenplum_path.sh
Use gpcopy
with the --full
option to migrate your data to the destination system. A full migration automatically copies all database objects including tables, indexes, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop
to drop any destination tables that may already exist (recreating them as necessary). For example:
gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
--dest-host demohost --dest-port 1234 --dest-user gpuser \
--full --drop --validate count
With the above command, the utility drops tables in the destination database (--drop
option) and uses the row count of the source and destination tables to validate the data transfer (--validate count
option). The other gpcopy
options specify the source and destination Greenplum Database system coordinator hosts, ports, and the User ID to use to connect to the Greenplum Database systems.
Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6.0. See the VMware Greenplum 6.0 Release Notes for features or changes that may necessitate post-migration tasks.
See the gpcopy reference page for complete syntax and usage information.
In order to migrate data between two systems on the same hardware, you should have enough free disk space to accommodate over 5 times the original data set. This enables you to maintain 2 full copies of the primary and mirror data sets (on the source and destination systems), plus the original backup data in ASCII format.
If you attempt to migrate on the same system but you run out of disk space, the gpcopy
utility provides the --truncate-source-after
option to help you complete the operation with only a minimum of free disk space. The --truncate-source-after
option instructs the utility to truncate each source table after successfully copying the table data to the destination system and validating that the copy succeeded.
If you choose to use --truncate-source-after
, consider the following:
--truncate-source-after
option does not allow for an easy rollback to the source system to its original condition if errors occur or validation checks fail during the gpcopy
operation. Table errors or validation failures during the gpcopy
operation can leave some tables remaining in the source system, while other tables may be empty (having been truncated after being copied to the new system). Back up all source data before using gpcopy
with --truncate-source-after
.--truncate-source-after
still requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy
. For example, with a --jobs
setting of 5, you must ensure that you have free space equal to the sum of the 5 largest tables copied in the batch. The procedure below provides sample commands to determine the largest table sizes.--validate
option with --truncate-source-after
to ensure that data is successfully copied before source tables are truncated.If you attempt to use the instructions in Performing a Basic Data Migration to migrate systems that use the same hardware, but you do not have the required free space:
Start both the source and destination systems.
Perform a full database backup in the source Greenplum Database system.
Determine if you have enough free space to migrate your data using --truncate-source-after
. Migrating data "in-place" requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy
. For example, if you want to use a --jobs
setting of 5, ensure that you have free space equal to the sum of the 5 largest tables copied in the batch.
The following query lists the largest 5 tables in your source system; modify the query as needed depending on the --jobs
setting you intend to use:
gpadmin=# SELECT n.nspname, c.relname, c.relstorage, pg_relation_size(c.oid)
FROM
pg_class c JOIN pg_namespace n ON (c.relnamespace=n.oid)
JOIN pg_catalog.gp_distribution_policy p ON (c.oid = p.localoid)
WHERE
n.nspname NOT IN ('gpexpand', 'pg_bitmap', 'information_schema', 'gp_toolkit')
AND n.nspname NOT LIKE 'pg_temp_%%' AND c.relstorage <> 'v'
ORDER BY 4 DESC LIMIT 5;
Either free enough disk space to cover the sum of the table sizes shown in the above query, or consider using a smaller --jobs
value to reduce the free space requirements.
As a best practice, source the greenplum_path.sh
file in the source Greenplum 5 installation, so that you run gpcopy
from the source system. For example:
$ source /usr/local/greenplum-db-5.20.0/greenplum_path.sh
Use a gpcopy
with the --truncate-source-after
and --validate
options to migrate your data to the destination system. A full migration automatically copies all database objects including tables, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop
to drop any destination tables that may already exist (recreating es as necessary). The --truncate-source-after
truncates each source table, only after copying and validating the table data in the destination system. For example:
gpcopy --source-host my_host --source-port 1234 --source-user gpuser \
--dest-host my_host --dest-port 1235 --dest-user gpuser --full --drop \
-truncate-source-after --analyze --validate count
The above command performs a full database copy, first dropping tables in the destination database (--drop
option) if they already exist. gpcopy
truncates each table in the source system only after successfully copying and validating the table data in the destination system. The other gpcopy
options specify the source and destination Greenplum Database system coordinator hosts, ports, and the User ID to use to connect to the Greenplum Database systems.
Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6. See Migrating Data from Greenplum 4.3 or 5 for features and changes that may necessitate post-migration tasks.
See the gpcopy reference page for complete syntax and usage information.