You can perform a Greenplum Database expansion to add segment instances and segment hosts with minimal downtime. In general, adding nodes to a Greenplum cluster achieves a linear scaling of performance and storage capacity.
Data warehouses typically grow over time, often at a continuous pace, as additional data is gathered and the retention period increases for existing data. At times, it is necessary to increase database capacity to consolidate disparate data warehouses into a single database. The data warehouse may also require additional computing capacity (CPU) to accommodate added analytics projects. It is good to provide capacity for growth when a system is initially specified, but even if you anticipate high rates of growth, it is generally unwise to invest in capacity long before it is required. Database expansion, therefore, is a project that you should expect to have to run periodically.
When you expand your database, you should expect the following qualities:
The planning and physical aspects of an expansion project are a greater share of the work than expanding the database itself. It will take a multi-discipline team to plan and run the project. For on-premise installations, space must be acquired and prepared for the new servers. The servers must be specified, acquired, installed, cabled, configured, and tested. For cloud deployments, similar plans should also be made. Planning New Hardware Platforms describes general considerations for deploying new hardware.
After you provision the new hardware platforms and set up their networks, configure the operating systems and run performance tests using Greenplum utilities. The Greenplum Database software distribution includes utilities that are helpful to test and burn-in the new servers before beginning the software phase of the expansion. See Preparing and Adding Hosts for steps to prepare the new hosts for Greenplum Database.
Once the new servers are installed and tested, the software phase of the Greenplum Database expansion process begins. The software phase is designed to be minimally disruptive, transactionally consistent, reliable, and flexible.
The first step of the software phase of expansion process is preparing the Greenplum Database system: adding new segment hosts and initializing new segment instances. This phase can be scheduled to occur during a period of low activity to avoid disrupting ongoing business operations. During the initialization process, the following tasks are performed:
Greenplum Database software is installed.
Databases and database objects are created in the new segment instances on the new segment hosts.
The gpexpand schema is created in the postgres database. You can use the tables and view in the schema to monitor and control the expansion process. After the system has been updated, the new segment instances on the new segment hosts are available.
New segments are immediately available and participate in new queries and data loads. The existing data, however, is skewed. It is concentrated on the original segments and must be redistributed across the new total number of primary segments.
Because some of the table data is skewed, some queries might be less efficient because more data motion operations might be needed.
The last step of the software phase is redistributing table data. Using the expansion control tables in the gpexpand schema as a guide, tables are redistributed. For each table:
ImportantThe
gprestore
utility cannot restore backups you made before the expansion with thegpbackup
utility, so back up your databases immediately after the system expansion is complete.
Redistributing table data is a long-running process that creates a large volume of network and disk activity. It can take days to redistribute some very large databases. To minimize the effects of the increased activity on business operations, system administrators can pause and resume expansion activity on an ad hoc basis, or according to a predetermined schedule. Datasets can be prioritized so that critical applications benefit first from the expansion.
In a typical operation, you run the gpexpand
utility four times with different options during the complete expansion process.
To create an expansion input file:
gpexpand -f <hosts_file>
To initialize segments and create the expansion schema:
gpexpand -i <input_file>
gpexpand
creates a data directory, copies user tables from all existing databases on the new segments, and captures metadata for each table in an expansion schema for status tracking. After this process completes, the expansion operation is committed and irrevocable.
gpexpand -d <duration>
During initialization, gpexpand
adds and initializes new segment instances. To complete system expansion, you must run gpexpand
to redistribute data tables across the newly added segment instances. Depending on the size and scale of your system, redistribution can be accomplished in a single session during low-use hours, or you can divide the process into batches over an extended period. Each table or partition is unavailable for read or write operations during redistribution. As each table is redistributed across the new segments, database performance should incrementally improve until it exceeds pre-expansion performance levels.
You may need to run gpexpand
several times to complete the expansion in large-scale systems that require multiple redistribution sessions. gpexpand
can benefit from explicit table redistribution ranking; see Planning Table Redistribution.
Users can access Greenplum Database during initialization, but they may experience performance degradation on systems that rely heavily on hash distribution of tables. Normal operations such as ETL jobs, user queries, and reporting can continue, though users might experience slower response times.
To remove the expansion schema:
gpexpand -c
For information about the gpexpand utility and the other utilities that are used for system expansion, see the Greenplum Database Utility Guide.
Parent topic: Expanding a Greenplum System