You can use the gpcopy
utility to transfer data between databases in different Greenplum Database systems.
gpcopy is a high-performance utility that can copy metadata and data from one Greenplum database to another Greenplum database. You can migrate the entire contents of a database, or just selected tables.
You can migrate data between different Greenplum Database versions. gpcopy
supports migrating data from a Greenplum Database version 4.23.6 (or later) system to a 5.9 (or later) or a 6.x Greenplum system, or from a Greenplum Database version 5.9+ system to a Greenplum 6.x system.
The gpcopy
interface includes options to transfer one or more full databases, or one or more database tables. A full database transfer includes the database schema, table data, indexes, views, roles, user-defined functions, resource queues, and resource groups. If a copied table or database does not exist in the destination system, gpcopy
creates it automatically, along with indexes as necessary.
Configuration files, including postgresql.conf
and pg_hba.conf
, must be transferred manually by an administrator. Extensions installed in the database with gppkg
, such as MADlib and programming language extensions, must be installed in the destination database by an administrator.
gpcopy
is a command-line tool that includes these features:
gpcopy
can migrate data between Greenplum Database systems where the source and destination systems are configured with a different number of segment instances.
gpcopy
provides detailed reporting and summary information about all aspects of the copy operation.
gpcopy
allows the source table data to change while the data is being copied. A lock is not acquired on the source table when data is copied.
The gpcopy
utility includes the --truncate-source-after
option to help migrate data from one Tanzu Greenplum Database system to another on the same hardware, requiring minimal free space available.
The source and destination Greenplum Database systems must already exist, have network access between all hosts, and have master host and primary segment hosts in both systems.
If you are copying data between Greenplum Database systems having different versions, each system must have gpcopy
installed locally. You can download gpcopy
from VMware Tanzu Network.
gpcopy
is dependent on the pg_dump
, pg_dumpall
, and psql
utilities installed with Greenplum Database. In most cases, you run gpcopy
from a Greenplum Database host system, so the dependencies are automatically met. If you need to run gpcopy
on a remote server, such as an ETL system, copy the gpcopy
binary and install a compatible Greenplum Clients package to meet the gpcopy
dependencies.
gpcopy
transfers data from user databases only; the postgres
, template0
, and template1
databases cannot be transferred. Administrators must transfer configuration files manually and install extensions into the destination database with gppkg
.
For information about how gpcopy
handles dependent database objects when copying a specific database or a set of tables, see the Notes section in the gpcopy
reference.
gpcopy
cannot copy a row that is larger than 1GB in size.
gpcopy
uses the source table statistics to determine the number of table rows and whether to copy table data using Greenplum Database segment instances or only the master instance. If a source table is not analyzed, gpcopy
assumes the table is a small table and copies the table data using only the master. If your database has large tables without statistics, you can set the --on-segment-threshold
option to -1
to force gpcopy
to copy table data using segment instances.
gpcopy
does not currently support SSL encryption for its connections.
gpcopy
does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table. You can copy those tables in a gpcopy
operation and specify the --no-distribution-check
option to deactivate checking of data distribution.
Warning: Before you perform a gpcopy
operation with the --no-distribution-check
option, ensure that you have a backup of the destination database and that the distribution policies of the tables that are being copied are the same in the source and destination database. Copying data into segment instances with incorrect data distribution can cause incorrect query results and can cause database corruption.
The degree of parallelism when running gpcopy
is determined the option --jobs
. The option controls the number processes that gpcopy
runs in parallel. The default is 4. The range is from 1 to 64512.
The --jobs
value, n, produces 2*n+1
database connections. For example, the default --jobs
value of 4 creates 9 connections.
If you are copying partitioned tables, the default is to copy the leaf partition tables of a partitioned table in parallel. The --jobs
value might affect the number of leaf partitions that are copied in parallel. Increasing the --jobs
value might improve performance depending on factors including, the number of Greenplum Database hosts and segments, the number of leaf partitions, and amount of data being transferred. You can deactivate the copying of leaf partitions in parallel. See the --parallelize-leaf-partitions
option for more information.
If you increase the --jobs
option, ensure that the Greenplum Database systems are configured with a sufficient maximum concurrent connection value to accommodate the gpcopy
connections and any other concurrent connections (such as user connections) that you require. See the Greenplum Database server configuration parameter max_connections
.
By default, gpcopy
does not validate the data transferred. You can request validation using the --validate=*type*
option. You must include --validate=*type*
if you specify the --truncate-source-after
option. The validation type can be one of the following:
count
- compares the row counts between the source and destination tables.md5xor
- validates by selecting all rows of the source and destination tables, converting all columns in a row to text, and then calculating the md5 value of each row. gpcopy
then performs an XOR over the MD5 values to ensure that all rows were successfully copied for the table.Note: Avoid using --append
with either validation option. If you use --append
and the source table already includes rows, then either validation method will fail due to the different number of rows in the source and destination tables.
When gpcopy
encounters errors and quits or is cancelled by the user, current copy operations on tables in the destination database are rolled back. Copy operations that have completed are not rolled back.
If an error occurs during the process of copying a table, or table validation fails, gpcopy
continues copying the other specified tables. After gpcopy
finishes, it displays a list of tables where errors occurred or validation failed and displays a gpcopy
command. You can use the provided command to retry copying the failed tables.
The gpcopy
utility logs messages in log file gpcopy_date.log
in the ~/gpAdminLogs
directory on the master host. If you run multiple gpcopy
commands on the same day, the utility appends messages to that day's log file.
After gpcopy
completes, it displays a summary of the operations performed. If the utility fails to copy tables, they are highlighted in summary, and there is a gpcopy
command provided in summary for user to just copy the failed tables. The information is displayed at the command prompt and in the gpcopy
log file.
After resolving the issues that caused the copy operations to fail, you can run the provided command to copy the tables that failed in the previous gpcopy
command.
Follow this procedure to migrate data from one Greenplum Database system to another with gpcopy
:
Start both the source and destination systems.
Perform a full database backup in the source Greenplum Database system.
As a best practice, source the greenplum_path.sh
file in the source Greenplum 5 installation, so that you run gpcopy
from the source system. For example:
$ source /usr/local/greenplum-db-5.20.0/greenplum_path.sh
Use gpcopy
with the --full
option to migrate your data to the destination system. A full migration automatically copies all database objects including tables, indexes, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop
to drop any destination tables that may already exist (recreating them as necessary). For example:
gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
--dest-host demohost --dest-port 1234 --dest-user gpuser \
--full --drop --validate count
With the above command, the utility drops tables in the destination database (--drop
option) and uses the row count of the source and destination tables to validate the data transfer (--validate count
option). The other gpcopy
options specify the source and destination Greenplum Database system master hosts, ports, and the User ID to use to connect to the Greenplum Database systems.
Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6.0. See the Tanzu Greenplum 6.0 Release Notes for features or changes that may necessitate post-migration tasks.
See the gpcopy reference page for complete syntax and usage information.
In order to migrate data between two systems on the same hardware, you should have enough free disk space to accommodate over 5 times the original data set. This enables you to maintain 2 full copies of the primary and mirror data sets (on the source and destination systems), plus the original backup data in ASCII format.
If you attempt to migrate on the same system but you run out of disk space, the gpcopy
utility provides the --truncate-source-after
option to help you complete the operation with only a minimum of free disk space. The --truncate-source-after
option instructs the utility to truncate each source table after successfully copying the table data to the destination system and validating that the copy succeeded.
If you choose to use --truncate-source-after
, consider the following:
--truncate-source-after
option does not allow for an easy rollback to the source system to its original condition if errors occur or validation checks fail during the gpcopy
operation. Table errors or validation failures during the gpcopy
operation can leave some tables remaining in the source system, while other tables may be empty (having been truncated after being copied to the new system). Back up all source data before using gpcopy
with --truncate-source-after
.--truncate-source-after
still requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy
. For example, with a --jobs
setting of 5, you must ensure that you have free space equal to the sum of the 5 largest tables copied in the batch. The procedure below provides sample commands to determine the largest table sizes.--validate
option with --truncate-source-after
to ensure that data is successfully copied before source tables are truncated.If you attempt to use the instructions in Performing a Basic Data Migration to migrate systems that use the same hardware, but you do not have the required free space:
Start both the source and destination systems.
Perform a full database backup in the source Greenplum Database system.
Determine if you have enough free space to migrate your data using --truncate-source-after
. Migrating data "in-place" requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy
. For example, if you want to use a --jobs
setting of 5, ensure that you have free space equal to the sum of the 5 largest tables copied in the batch.
The following query lists the largest 5 tables in your source system; modify the query as needed depending on the --jobs
setting you intend to use:
gpadmin=# SELECT n.nspname, c.relname, c.relstorage, pg_relation_size(c.oid)
FROM
pg_class c JOIN pg_namespace n ON (c.relnamespace=n.oid)
JOIN pg_catalog.gp_distribution_policy p ON (c.oid = p.localoid)
WHERE
n.nspname NOT IN ('gpexpand', 'pg_bitmap', 'information_schema', 'gp_toolkit')
AND n.nspname NOT LIKE 'pg_temp_%%' AND c.relstorage <> 'v'
ORDER BY 4 DESC LIMIT 5;
Either free enough disk space to cover the sum of the table sizes shown in the above query, or consider using a smaller --jobs
value to reduce the free space requirements.
As a best practice, source the greenplum_path.sh
file in the source Greenplum 5 installation, so that you run gpcopy
from the source system. For example:
$ source /usr/local/greenplum-db-5.20.0/greenplum_path.sh
Use a gpcopy
with the --truncate-source-after
and --validate
options to migrate your data to the destination system. A full migration automatically copies all database objects including tables, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop
to drop any destination tables that may already exist (recreating es as necessary). The --truncate-source-after
truncates each source table, only after copying and validating the table data in the destination system. For example:
gpcopy --source-host my_host --source-port 1234 --source-user gpuser \
--dest-host my_host --dest-port 1235 --dest-user gpuser --full --drop \
-truncate-source-after --analyze --validate count
The above command performs a full database copy, first dropping tables in the destination database (--drop
option) if they already exist. gpcopy
truncates each table in the source system only after successfully copying and validating the table data in the destination system. The other gpcopy
options specify the source and destination Greenplum Database system master hosts, ports, and the User ID to use to connect to the Greenplum Database systems.
Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6. See Migrating Data from Greenplum 4.3 or 5 for features and changes that may necessitate post-migration tasks.
See the gpcopy reference page for complete syntax and usage information.