Routine System Maintenance Tasks

To keep a Greenplum Database system running efficiently, the database must be regularly cleared of expired data and the table statistics must be updated so that the query optimizer has accurate information.

Greenplum Database requires that certain tasks be performed regularly to achieve optimal performance. The tasks discussed here are required, but database administrators can automate them using standard UNIX tools such as cron scripts. An administrator sets up the appropriate scripts and checks that they ran successfully. See Recommended Monitoring and Maintenance Tasks for additional suggested maintenance activities you can implement to keep your Greenplum system running optimally.

Parent topic: Managing a Greenplum System

Routine Vacuum and Analyze

The design of the MVCC transaction concurrency model used in Greenplum Database means that deleted or updated data rows still occupy physical space on disk even though they are not visible to new transactions. If your database has many updates and deletes, many expired rows exist and the space they use must be reclaimed with the VACUUM command. The VACUUM command also collects table-level statistics, such as numbers of rows and pages, so it is also necessary to vacuum append-optimized tables, even when there is no space to reclaim from updated or deleted rows.

Vacuuming an append-optimized table follows a different process than vacuuming heap tables. On each segment, a new segment file is created and visible rows are copied into it from the current segment. When the segment file has been copied, the original is scheduled to be dropped and the new segment file is made available. This requires sufficient available disk space for a copy of the visible rows until the original segment file is dropped.

If the ratio of hidden rows to total rows in a segment file is less than a threshold value (10, by default), the segment file is not compacted. The threshold value can be configured with the gp_appendonly_compaction_threshold server configuration parameter. VACUUM FULL ignores the value of gp_appendonly_compaction_threshold and rewrites the segment file regardless of the ratio.

You can use the __gp_aovisimap_compaction_info() function in the gp_toolkit schema to investigate the effectiveness of a VACUUM operation on append-optimized tables.

For information about the __gp_aovisimap_compaction_info() function, see Checking Append-Optimized Tables.

VACUUM can be deactivated for append-optimized tables using the gp_appendonly_compaction server configuration parameter.

For details about vacuuming a database, see Vacuuming the Database.

For information about the gp_appendonly_compaction_threshold server configuration parameter and the VACUUM command, see the Greenplum Database Reference Guide.

Transaction ID Management

Greenplum's MVCC transaction semantics depend on comparing transaction ID (XID) numbers to determine visibility to other transactions. Transaction ID numbers are compared using modulo 232 arithmetic, so a Greenplum system that runs more than about two billion transactions can experience transaction ID wraparound, where past transactions appear to be in the future. This means past transactions' outputs become invisible. Therefore, it is necessary to VACUUM every table in every database at least once per two billion transactions.

Greenplum Database assigns XID values only to transactions that involve DDL or DML operations, which are typically the only transactions that require an XID.

Important
Greenplum Database monitors transaction IDs. If you do not vacuum the database regularly, Greenplum Database will generate a warning and error.

Greenplum Database issues the following warning when a significant portion of the transaction IDs are no longer available and before transaction ID wraparound occurs:

WARNING: database "database_name" must be vacuumed within *number\_of\_transactions* transactions

When the warning is issued, a VACUUM operation is required. If a VACUUM operation is not performed, Greenplum Database stops creating transactions when it reaches a limit prior to when transaction ID wraparound occurs. Greenplum Database issues this error when it stops creating transactions to avoid possible data loss:

FATAL: database is not accepting commands to avoid 
wraparound data loss in database "database_name"

The Greenplum Database configuration parameter xid_warn_limit controls when the warning is displayed. The parameter xid_stop_limit controls when Greenplum Database stops creating transactions.

Recovering from a Transaction ID Limit Error

When Greenplum Database reaches the xid_stop_limit transaction ID limit due to infrequent VACUUM maintenance, it becomes unresponsive. To recover from this situation, perform the following steps as database administrator:

Shut down Greenplum Database.
Temporarily lower the xid_stop_limit by 10,000,000.
Start Greenplum Database.
Run VACUUM FREEZE on all affected databases.
Reset the xid_stop_limit to its original value.
Restart Greenplum Database.

For information about the configuration parameters, see the Greenplum Database Reference Guide.

For information about transaction ID wraparound see the PostgreSQL documentation.

System Catalog Maintenance

Numerous database updates with CREATE and DROP commands increase the system catalog size and affect system performance. For example, running many DROP TABLE statements degrades the overall system performance due to excessive data scanning during metadata operations on catalog tables. The performance loss occurs between thousands to tens of thousands of DROP TABLE statements, depending on the system.

You should run a system catalog maintenance procedure regularly to reclaim the space occupied by deleted objects. If a regular procedure has not been run for a long time, you may need to run a more intensive procedure to clear the system catalog. This topic describes both procedures.

Regular System Catalog Maintenance

It is recommended that you periodically run REINDEX and VACUUM on the system catalog to clear the space that deleted objects occupy in the system indexes and tables. If regular database operations include numerous DROP statements, it is safe and appropriate to run a system catalog maintenance procedure with VACUUM daily at off-peak hours. You can do this while the system is available.

These are Greenplum Database system catalog maintenance steps.

Perform a REINDEX on the system catalog tables to rebuild the system catalog indexes. This removes bloat in the indexes and improves VACUUM performance.

Note
REINDEX causes locking of system catalog tables, which could affect currently running queries. To avoid disrupting ongoing business operations, schedule the REINDEX operation during a period of low activity.
Perform a VACUUM on the system catalog tables.
Perform an ANALYZE on the system catalog tables to update the catalog table statistics.

This example script performs a REINDEX, VACUUM, and ANALYZE of a Greenplum Database system catalog. In the script, replace <database-name> with a database name.

#!/bin/bash
DBNAME="<database-name>"
SYSTABLES="' pg_catalog.' || relname || ';' FROM pg_class a, pg_namespace b 
WHERE a.relnamespace=b.oid AND b.nspname='pg_catalog' AND a.relkind='r'"

reindexdb --system -d $DBNAME
psql -tc "SELECT 'VACUUM' || $SYSTABLES" $DBNAME | psql -a $DBNAME
analyzedb -as pg_catalog -d $DBNAME

Note
If you are performing catalog maintenance during a maintenance period and you need to stop a process due to time constraints, run the Greenplum Database function pg_cancel_backend(<PID>) to safely stop the Greenplum Database process.

Intensive System Catalog Maintenance

If system catalog maintenance has not been performed in a long time, the catalog can become bloated with dead space; this causes excessively long wait times for simple metadata operations. A wait of more than two seconds to list user tables, such as with the \d metacommand from within psql, is an indication of catalog bloat.

If you see indications of system catalog bloat, you must perform an intensive system catalog maintenance procedure with VACUUM FULL during a scheduled downtime period. During this period, stop all catalog activity on the system; the VACUUM FULL system catalog maintenance procedure takes exclusive locks against the system catalog.

Running regular system catalog maintenance procedures can prevent the need for this more costly procedure.

These are steps for intensive system catalog maintenance.

Stop all catalog activity on the Greenplum Database system.
Perform a REINDEX on the system catalog tables to rebuild the system catalog indexes. This removes bloat in the indexes and improves VACUUM performance.
Perform a VACUUM FULL on the system catalog tables. See the following Note.
Perform an ANALYZE on the system catalog tables to update the catalog table statistics.

Note
The system catalog table pg_attribute is usually the largest catalog table. If the pg_attribute table is significantly bloated, a VACUUM FULL operation on the table might require a significant amount of time and might need to be performed separately. The presence of both of these conditions indicate a significantly bloated pg_attribute table that might require a long VACUUM FULL time:

The pg_attribute table contains a large number of records.
The diagnostic message for pg_attribute is significant amount of bloat in the gp_toolkit.gp_bloat_diag view.

Vacuum and Analyze for Query Optimization

Greenplum Database uses a cost-based query optimizer that relies on database statistics. Accurate statistics allow the query optimizer to better estimate selectivity and the number of rows that a query operation retrieves. These estimates help it choose the most efficient query plan. The ANALYZE command collects column-level statistics for the query optimizer.

You can run both VACUUM and ANALYZE operations in the same command. For example:

=# VACUUM ANALYZE mytable;

Running the VACUUM ANALYZE command might produce incorrect statistics when the command is run on a table with a significant amount of bloat (a significant amount of table disk space is occupied by deleted or obsolete rows). For large tables, the ANALYZE command calculates statistics from a random sample of rows. It estimates the number rows in the table by multiplying the average number of rows per page in the sample by the number of actual pages in the table. If the sample contains many empty pages, the estimated row count can be inaccurate.

For a table, you can view information about the amount of unused disk space (space that is occupied by deleted or obsolete rows) in the gp_toolkit view gp_bloat_diag. If the bdidiag column for a table contains the value significant amount of bloat suspected, a significant amount of table disk space consists of unused space. Entries are added to the gp_bloat_diag view after a table has been vacuumed.

To remove unused disk space from the table, you can run the command VACUUM FULL on the table. Due to table lock requirements, VACUUM FULL might not be possible until a maintenance period.

As a temporary workaround, run ANALYZE to compute column statistics and then run VACUUM on the table to generate an accurate row count. This example runs ANALYZE and then VACUUM on the cust_info table.

ANALYZE cust_info;
VACUUM cust_info;

Important
If you intend to run queries on partitioned tables with GPORCA enabled (the default), you must collect statistics on the root partitioned table with the ANALYZE command. For information about GPORCA, see Overview of GPORCA.

Note
You can use the Greenplum Database utility analyzedb to update table statistics. Tables can be analyzed concurrently. For append optimized tables, analyzedb updates statistics only if the statistics are not current. See the analyzedb utility.

Routine Reindexing

For B-tree indexes, a freshly-constructed index is slightly faster to access than one that has been updated many times because logically adjacent pages are usually also physically adjacent in a newly built index. Reindexing older indexes periodically can improve access speed. If all but a few index keys on a page have been deleted, there will be wasted space on the index page. A reindex will reclaim that wasted space. In Greenplum Database it is often faster to drop an index (DROP INDEX) and then recreate it (CREATE INDEX) than it is to use the REINDEX command.

For table columns with indexes, some operations such as bulk updates or inserts to the table might perform more slowly because of the updates to the indexes. To enhance performance of bulk operations on tables with indexes, you can drop the indexes, perform the bulk operation, and then re-create the index.

Database Server Log Files

Greenplum Database log output tends to be voluminous, especially at higher debug levels, and you do not need to save it indefinitely. Administrators should purge older log files periodically.

Greenplum Database by default has log file rotation enabled for the coordinator and segment database logs. Log files are created in the log subdirectory of the coordinator and each segment data directory using the following naming convention: gpdb-*YYYY*-*MM*-*DD_hhmmss*.csv. Administrators need to implement scripts or programs to periodically clean up old log files in the log directory of the coordinator and each segment instance.

Log rotation can be triggered by the size of the current log file or the age of the current log file. The log_rotation_size configuration parameter sets the size of an individual log file that triggers log rotation. When the log file size is equal to or greater than the specified size, the file is closed and a new log file is created. The log_rotation_size value is specified in kilobytes. The default is 1048576 kilobytes, or 1GB. If log_rotation_size is set to 0, size-based rotation is deactivated.

The log_rotation_age configuration parameter specifies the age of a log file that triggers rotation. When the specified amount of time has elapsed since the log file was created, the file is closed and a new log file is created. The default log_rotation_age, 1d, creates a new log file 24 hours after the current log file was created. If log_rotation_age is set to 0, time-based rotation is deactivated.

For information about viewing the database server log files, see Viewing the Database Server Log Files.

Management Utility Log Files

Log files for the Greenplum Database management utilities are written to ~/gpAdminLogs by default. The naming convention for management log files is:

<script_name>_<date>.log

The log entry format is:

<timestamp>:<utility>:<host>:<user>:[INFO|WARN|FATAL]:<message>

The log file for a particular utility execution is appended to its daily log file each time that utility is run.