Migrating Data with gpcopy

You can use the gpcopy utility to transfer data between databases in different Greenplum Database systems.

gpcopy is a high-performance utility that can copy metadata and data from one Greenplum database to another Greenplum database. You can migrate the entire contents of a database, or just selected tables.

You can migrate data between different Greenplum Database versions. gpcopy supports migrating data from a Greenplum Database version 4.23.6 (or later) system to a 5.9 (or later) or a 6.x Greenplum system, or from a Greenplum Database version 5.9+ system to a Greenplum 6.x system.

The gpcopy interface includes options to transfer one or more full databases, or one or more database tables. A full database transfer includes the database schema, table data, indexes, views, roles, user-defined functions, resource queues, and resource groups. If a copied table or database does not exist in the destination system, gpcopy creates it automatically, along with indexes as necessary.

Configuration files, including postgresql.conf and pg_hba.conf, must be transferred manually by an administrator. Extensions installed in the database with gppkg, such as MADlib and programming language extensions, must be installed in the destination database by an administrator.

gpcopy is a command-line tool that includes these features:

gpcopy can migrate data between Greenplum Database systems where the source and destination systems are configured with a different number of segment instances.
gpcopy provides detailed reporting and summary information about all aspects of the copy operation.
gpcopy allows the source table data to change while the data is being copied. A lock is not acquired on the source table when data is copied.
The gpcopy utility includes the --truncate-source-after option to help migrate data from one VMware Greenplum system to another on the same hardware, requiring minimal free space available.

Prerequisites

The source and destination Greenplum Database systems must already exist, have network access between all hosts, and have coordinator host and primary segment hosts in both systems.

If you are copying data between Greenplum Database systems having different versions, each system must have gpcopy installed locally. You can download gpcopy from Broadcom Support Portal.

Note
For more information about download prerequisites, troubleshooting, and instructions, see Download Broadcom products and software.

gpcopy is dependent on the pg_dump, pg_dumpall, and psql utilities installed with Greenplum Database. In most cases, you run gpcopy from a Greenplum Database host system, so the dependencies are automatically met. If you need to run gpcopy on a remote server, such as an ETL system, copy the gpcopy binary and install a compatible Greenplum Clients package to meet the gpcopy dependencies.

Limitations for the Source and Destination Systems

gpcopy transfers data from user databases only; the postgres, template0, and template1 databases cannot be transferred. Administrators must transfer configuration files manually and install extensions into the destination database with gppkg.

For information about how gpcopy handles dependent database objects when copying a specific database or a set of tables, see the Notes section in the gpcopy reference.

If you are copying data from a VMware Greenplum system running on RHEL7 for loading to a system running on RHEL8, the default --parallelize-leaf-partitions setting may lead to data being copied to an incorrect partition in range-partitioned tables because of glibc changes between the operating system versions. In this scenario, always set this option to false to copy as a single table based on the root partition table. See Migrating VMware Greenplum from Enterprise Linux 7 to 8 in the VMware Greenplum documentation for more information.

gpcopy cannot copy a row that is larger than 1GB in size.

The --on-segment-threshold setting determines where gpcopy performs the data transfer:

When --on-segment-threshold is set to -1 (the default), gpcopy copies table data using the source and destination Greenplum Database segment instances.
When --on-segment-threshold is set to -2, gpcopy copies table data using the source and destination Greenplum Database coordinators.
When --on-segment-threshold is set to a positive value, it identifies a row number threshold. If a table contains this number of rows or less, gpcopy copies the table data using the source and destination coordinators. If the number of rows in a table is more than the threshold, gpcopy copies the data using the source and destination segment instances. gpcopy uses the source table statistics to determine the number of table rows. If a source table is not analyzed, gpcopy assumes that the table is a small table, ignores the threshold setting, and copies the table data using only the coordinators. If your database has large tables without statistics, set the --on-segment-threshold option to -1 to force gpcopy to copy table data using segment instances.

gpcopy does not support table data distribution checking when copying a partitioned table that is defined with a leaf table that is an external table or if a leaf table is defined with a distribution policy that is different from the root partitioned table. You can copy those tables in a gpcopy operation and specify the --no-distribution-check option to deactivate checking of data distribution.

Caution
Before you perform a gpcopy operation with the --no-distribution-check option, ensure that you have a backup of the destination database and that the distribution policies of the tables that are being copied are the same in the source and destination database. Copying data into segment instances with incorrect data distribution can cause incorrect query results and can cause database corruption.

About Connecting Using Password Authentication

When the pg_hba.conf file specifies password authentication, gpcopy can use the value of the PGPASSWORD or PGPASSFILE environment variable to obtain the connection password for the source and/or destination Greenplum Database cluster as follows:

PGPASSFILE specifies the name of the password file to use for lookups. gpcopy can use this file to obtain passwords for both the source and the destination Greenplum Database clusters.
The PGPASSWORD environment variable behaves the same as the password connection parameter. The password for the source and destination Greenplum Database clusters must be the same when you specify a PGPASSWORD.

Note: Use of the PGPASSWORD environment variable is not recommended for security reasons, as some operating systems allow non-root users to see process environment variables via ps; instead consider using a password file.

When both environment variables are set, the PGPASSWORD setting takes precedence. When neither is set, gpcopy attempts to use the default password file, typically ~/.pgpass, to obtain the password.

About Connecting Using SSL/TLS

When the pg_hba.conf file specifies the SSL/TLS connection type, you can configure gpcopy to initiate an encrypted connection to both the source and the destination Greenplum Database clusters. In this scenario, gpcopy is the SSL client and the source and destination Greenplum Database clusters are the SSL servers.

gpcopy supports SSL encryption on the connection to Greenplum as described in the PostgreSQL SSL Support documentation.

You have two options to specify the SSL key and certificate files:

Option 1: Specify the SSL private key, certificate, and certificate authority files for gpcopy via the PGSSLCERT, PGSSLKEY, and PGSSLROOTCERT environment variables. Refer to the PostgreSQL documentation for more information about these variables.
Option 2: Locate the certificates and keys in the ~/.postgresql directory on the gpcopy host.

When both the source and the destination Greenplum Database clusters are configured for SSL/TLS, gpcopy uses the specified SSL private key and certificate for both connections.

You can use the PGSSLMODE environment variable to specify the SSL mode for the connections. Because the default SSL mode of gpcopy does not support prefer, you must set PGSSLMODE to disable when one of the Greenplum clusters is configured for encryption and the other is not.

About SSL/TLS Encryption on the Data Channel

gpcopy supports using SSL/TLS encryption during data transfer from the source to the destination Greenplum Database cluster. In this scenario, a TLS connection is established between a gpcopy_helper running in the source Greenplum Database cluster and a gpcopy_helper running in the destination Greenplum Database cluster, and the data to copy is sent over this connection.

If you choose to direct gpcopy to use this encryption method for the data channel between gpcopy_helpers:

You must specify the private key and certificate via the --ssl-key <key_file> and --ssl-cert <cert_file> options to the gpcopy command. These files must reside in the same file system location on all hosts (coordinator and segment) in the destination Greenplum Database cluster.
You can optionally specify the root certificate authority via the --ssl-ca <ca_file> option to the gpcopy command. When you specify the root CA, <ca_file> must reside in the same file system location on all hosts (coordinator and segment) in the source Greenplum Database cluster.
You can optionally specify the minimum TLS version (--ssl-min-tls <min_version_str>) for the encrypted connection. Valid <min_version_str> values are '1.0', '1.1', '1.2', or '1.3'. The default minimum TLS version is 1.0.

gpcopy determines the sslmode to use for the connection based on the --ssl-<xxx> options provided to the command.

About Specifying a Transaction Snapshot

To avoid copying data changes that occur while gpcopy is transfering data, you can choose to run the gpcopy operation in a transaction snapshot.

Note
You can instruct gpcopy to run the copy operation in a specific transaction when initiated from a version 6 or 7 source Greenplum Database cluster. gpcopy does not support transaction snapshots with version 5.x source Greenplum clusters.

If you choose this option, you must start the transaction before you run gpcopy, and you must provide the identifier of the transaction snapshot to the command, as follows:

Open a psql session to the source Greenplum cluster and start a transaction:
```
db1=# BEGIN;
BEGIN
```
Note
Do not terminate the transaction until the gpcopy operation completes.

Use the pg_export_snapshot() function to export the transaction snapshot identifier. For example:

db1=# SELECT pg_export_snapshot();
 pg_export_snapshot
---------------------
 00000006-0007686E-1
(1 row)

Provide the --snapshot option with the returned identifier when you invoke the gpcopy command. For example:
```
$ gpcopy --source-host 1.1.1.1 --source-port 6000 \
     --dest-host 1.1.1.1 --dest-port 6000 \
     --include-table db1.public.test --dest-table newdb.public.newtbl \
     --snapshot 00000006-0007686E-1
```
gpcopy will internally invoke SET TRANSACTION SNAPSHOT <snapshot_id> before it initiates the copy operation.
When the gpcopy operation completes, end the transaction:
```
db1=# END;
END
```

About YAML Support for gpcopy

To avoid command lines that are long or complicated, VMware Greenplum supports yaml format for gpcopy. In yaml configurations, certain flags or settings may correspond to command line flags. For example, the following snippet of a sample yaml file (test.yaml) displays the correspondence between the yaml flags and the command line flags:

source:
   host: 100.1.1.1
   port: 15432
   user: gpadmin
   dbname: yamltestdb1
dest:
   host: 100.1.1.1
   port: 15432
   user: gpadmin
   dest-dbname: yamltestdb2
task:
   drop: true

If you run gpcopy --config test.yaml or gpcopy -c test.yaml, the result is same as gpcopy --source-host 100.1.1.1 --source-port 15432 --source-user gpadmin --dest-host 100.1.1.1 --dest-port 15432 --dest-user gpadmin --dbname yamltestdb1 --dest-dbname yamltestdb2 --drop.

We have shown a complete sample gpcopy yaml file below.

Note
This is a sample file and should not be used as it is.

source:
   host: 100.1.1.1
   port: 15432
   user: gpadmin
   full: false
   dbname: yamltestdb1
   include-table:
   - yamltestdb1.public.a1
   - yamltestdb1.public.a2
   exclude-table:
   - yamltestdb1.public.a2
   include-table-file:
   - /tmp/file1.txt
   - /tmp/file2.txt
   exclude-table-file:
   - /tmp/file3.txt
   include-table-json:
   - /tmp/abc1.json
   - /tmp/abc2.json
   snapshot: 00000001
dest:
   host: 100.1.1.1
   port: 15432
   user: gpadmin
   dest-dbname: yamltestdb2
   dest-table:
   - yamltestdb12public.a1
   - yamltestdb2.public.a2
   mapping-file: /tmp/host_ip_map.txt
task:
   drop: true
   truncate: false
   append: false
   skip-existing: false
   no-compression: true
   analyze: true
   no-distribution-check: true
   truncate-source-after: true
   skip-prompt: true
   validate: "count"
   metadata-only: false
   no-ownership: false
option:
   enable-receive-daemon: false
   jobs: 4
   on-segment-threshold: 0
   parallelize-leaf-partitions: true
   data-port-range: 9010-9020
   timeout: 30
   dumper: pgdump
   rpc-port: 7667
   ssl:
      ssl-key: /tmp/server.key
      ssl-cert: /tmp/server.crt
      ssl-ca: /tmp/root.crt
      ssl-min-tls: 1.3

gpcopy --config yaml file can only be used with --debug/verbose/quiet and --dryrun. Otherwise, the operation will error out.

You can enter single or multiple inputs for the parameters dbname, include-table, exclude-table, include-table-file, exclude-table-file, include-table-json, dest-dbname, and dest-table as shown below:

include-table: yamltestdb1.public.a1
 
include-table:
- yamltestdb1.public.a1
- yamltestdb1.public.a2

Configuring Parallel Jobs

The degree of parallelism when running gpcopy is determined the option --jobs. The option controls the number processes that gpcopy runs in parallel. The default is 4. The range is from 1 to 64512.

The --jobs value, n, produces 2*n+1 database connections. For example, the default --jobs value of 4 creates 9 connections.

If you are copying partitioned tables, the default is to copy the leaf partition tables of a partitioned table in parallel. The --jobs value might affect the number of leaf partitions that are copied in parallel. Increasing the --jobs value might improve performance depending on factors including, the number of Greenplum Database hosts and segments, the number of leaf partitions, and amount of data being transferred. You can deactivate the copying of leaf partitions in parallel. See the --parallelize-leaf-partitionsoption for more information.

If you increase the --jobs option, ensure that the Greenplum Database systems are configured with a sufficient maximum concurrent connection value to accommodate the gpcopy connections and any other concurrent connections (such as user connections) that you require. See the Greenplum Database server configuration parameter max_connections.

Monitoring and Managing Parallel Jobs Using gpcopy job Subcommand

While running gpcopy, an RPC server will start automatically to monitor the gpcopy job command. The default port for the RPC server is 7667. You can use the gpcopy job subcommand to monitor and manage the number of parallel jobs. For example, while copying multiple large tables, the CPU will be fully occupied impacting the performance of the servers. While copying multiple small tables, you can increase the number of parallel jobs to speed up the operation. To obtain higer efficiency, you can use the gpcopy subcommand to monitor and manage the number of jobs that are running simultaneously.

To check the number of parallel jobs that are running, you can use the following command:

$ gpcopy job --status
 workers | status
 5       | running

In the above example, 5 concurrent jobs are running parallely. If you want to increase the number of parallel jobs by 1, run the following command:

$ gpcopy job --increase 1
$ gpcopy job --status
 workers | status
 6       | running

In the above example, the number of parallel jobs have increased by one. Now, if you want to reduce the number of parallel jobs by one, run the following command:

$ gpcopy job --decrease 1
$ gpcopy job --status
 workers | status
 5       | running

In the above example, the number of parallel jobs have reduced by one.

Note
The gpcopy job --decrease 1 operation takes a while to complete. You must wait for the operation to finish.

Validating Copied Data

By default, gpcopy does not validate the data transferred. You can request validation using the --validate=*type* option. You must include --validate=*type* if you specify the --truncate-source-after option. The validation type can be one of the following:

count - compares the row counts between the source and destination tables.
md5xor - validates by selecting all rows of the source and destination tables, converting all columns in a row to text, and then calculating the md5 value of each row. gpcopy then performs an XOR over the MD5 values to ensure that all rows were successfully copied for the table.

Note: Avoid using --append with either validation option. If you use --append and the source table already includes rows, then either validation method will fail due to the different number of rows in the source and destination tables.

Addressing Failed Data Transfers

When gpcopy encounters errors and quits or is cancelled by the user, current copy operations on tables in the destination database are rolled back. Copy operations that have completed are not rolled back.

If an error occurs during the process of copying a table, or table validation fails, gpcopy continues copying the other specified tables. After gpcopy finishes, it displays a list of tables where errors occurred or validation failed and displays a gpcopy command. You can use the provided command to retry copying the failed tables.

The gpcopy utility logs messages in log file gpcopy_date.log in the ~/gpAdminLogs directory on the coordinator host. If you run multiple gpcopy commands on the same day, the utility appends messages to that day's log file.

After gpcopy completes, it displays a summary of the operations performed. If the utility fails to copy tables, they are highlighted in summary, and there is a gpcopy command provided in summary for user to just copy the failed tables. The information is displayed at the command prompt and in the gpcopy log file.

After resolving the issues that caused the copy operations to fail, you can run the provided command to copy the tables that failed in the previous gpcopy command.

Performing a Basic Data Migration

Follow this procedure to migrate data from one Greenplum Database system to another with gpcopy:

Start both the source and destination systems.
Perform a full database backup in the source Greenplum Database system.
As a best practice, source the greenplum_path.sh file in the source Greenplum 5 installation, so that you run gpcopy from the source system. For example:
```
$ source /usr/local/greenplum-db-5.20.0/greenplum_path.sh
```
Use gpcopy with the --full option to migrate your data to the destination system. A full migration automatically copies all database objects including tables, indexes, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop to drop any destination tables that may already exist (recreating them as necessary). For example:
```
gpcopy --source-host mytest --source-port 1234 --source-user gpuser \
    --dest-host demohost --dest-port 1234 --dest-user gpuser \
    --full --drop --validate count
```
With the above command, the utility drops tables in the destination database (--drop option) and uses the row count of the source and destination tables to validate the data transfer (--validate count option). The other gpcopy options specify the source and destination Greenplum Database system coordinator hosts, ports, and the User ID to use to connect to the Greenplum Database systems.

Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6.0. See the VMware Greenplum 6.0 Release Notes for features or changes that may necessitate post-migration tasks.

See the gpcopy reference page for complete syntax and usage information.

In order to migrate data between two systems on the same hardware, you should have enough free disk space to accommodate over 5 times the original data set. This enables you to maintain 2 full copies of the primary and mirror data sets (on the source and destination systems), plus the original backup data in ASCII format.

If you attempt to migrate on the same system but you run out of disk space, the gpcopy utility provides the --truncate-source-after option to help you complete the operation with only a minimum of free disk space. The --truncate-source-after option instructs the utility to truncate each source table after successfully copying the table data to the destination system and validating that the copy succeeded.

If you choose to use --truncate-source-after, consider the following:

Using the --truncate-source-after option does not allow for an easy rollback to the source system to its original condition if errors occur or validation checks fail during the gpcopy operation. Table errors or validation failures during the gpcopy operation can leave some tables remaining in the source system, while other tables may be empty (having been truncated after being copied to the new system). Back up all source data before using gpcopy with --truncate-source-after.
Migrating data with --truncate-source-after still requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy. For example, with a --jobs setting of 5, you must ensure that you have free space equal to the sum of the 5 largest tables copied in the batch. The procedure below provides sample commands to determine the largest table sizes.
You must use the --validate option with --truncate-source-after to ensure that data is successfully copied before source tables are truncated.

If you attempt to use the instructions in Performing a Basic Data Migration to migrate systems that use the same hardware, but you do not have the required free space:

Start both the source and destination systems.
Perform a full database backup in the source Greenplum Database system.
Determine if you have enough free space to migrate your data using --truncate-source-after. Migrating data "in-place" requires an amount of free disk space equal to the sum of the largest tables that you will migrate in a single batch using gpcopy. For example, if you want to use a --jobs setting of 5, ensure that you have free space equal to the sum of the 5 largest tables copied in the batch.

The following query lists the largest 5 tables in your source system; modify the query as needed depending on the --jobs setting you intend to use:
```
gpadmin=# SELECT n.nspname, c.relname, c.relstorage, pg_relation_size(c.oid)
FROM 
pg_class c JOIN pg_namespace n ON (c.relnamespace=n.oid)
JOIN pg_catalog.gp_distribution_policy p ON (c.oid = p.localoid)
WHERE
n.nspname NOT IN ('gpexpand', 'pg_bitmap', 'information_schema', 'gp_toolkit')
AND n.nspname NOT LIKE 'pg_temp_%%' AND c.relstorage <> 'v'
ORDER BY 4 DESC LIMIT 5;
```
Either free enough disk space to cover the sum of the table sizes shown in the above query, or consider using a smaller --jobs value to reduce the free space requirements.
As a best practice, source the greenplum_path.sh file in the source Greenplum 5 installation, so that you run gpcopy from the source system. For example:
```
$ source /usr/local/greenplum-db-5.20.0/greenplum_path.sh
```
Use a gpcopy with the --truncate-source-after and --validate options to migrate your data to the destination system. A full migration automatically copies all database objects including tables, views, roles, functions, user defined types (UDT), resource queues, and resource groups for all user defined databases. Include the --drop to drop any destination tables that may already exist (recreating es as necessary). The --truncate-source-after truncates each source table, only after copying and validating the table data in the destination system. For example:
```
gpcopy --source-host my_host --source-port 1234 --source-user gpuser \
--dest-host my_host --dest-port 1235 --dest-user gpuser --full --drop \
-truncate-source-after --analyze --validate count
```
The above command performs a full database copy, first dropping tables in the destination database (--drop option) if they already exist. gpcopy truncates each table in the source system only after successfully copying and validating the table data in the destination system. The other gpcopy options specify the source and destination Greenplum Database system coordinator hosts, ports, and the User ID to use to connect to the Greenplum Database systems.

Note: While the example command performs a full system copy, consider migrating only portions of the your data at a time, so that you can reduce downtime while addressing table errors or validation failures that may occur during the copy operation.
The gpcopy utility does not copy configuration files such as postgresql.conf and pg_hba.conf. You must set up the destination system configuration as necessary to match the source system.
The gpcopy utility does not copy external objects such as Greenplum Database extensions, third party jar files, and shared object files. You must recreate these external objects as necessary to match the source system.
After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6. See Migrating Data from Greenplum 4.3 or 5 for features and changes that may necessitate post-migration tasks.