To keep a Greenplum Database system running efficiently, the database must be regularly cleared of expired data and the table statistics must be updated so that the query optimizer has accurate information.
Greenplum Database requires that certain tasks be performed regularly to achieve optimal performance. The tasks discussed here are required, but database administrators can automate them using standard UNIX tools such as
cron scripts. An administrator sets up the appropriate scripts and checks that they ran successfully. See Recommended Monitoring and Maintenance Tasks for additional suggested maintenance activities you can implement to keep your Greenplum system running optimally.
Parent topic: Managing a Greenplum System
The design of the MVCC transaction concurrency model used in Greenplum Database means that deleted or updated data rows still occupy physical space on disk even though they are not visible to new transactions. If your database has many updates and deletes, many expired rows exist and the space they use must be reclaimed with the
VACUUM command. The
VACUUM command also collects table-level statistics, such as numbers of rows and pages, so it is also necessary to vacuum append-optimized tables, even when there is no space to reclaim from updated or deleted rows.
Vacuuming an append-optimized table follows a different process than vacuuming heap tables. On each segment, a new segment file is created and visible rows are copied into it from the current segment. When the segment file has been copied, the original is scheduled to be dropped and the new segment file is made available. This requires sufficient available disk space for a copy of the visible rows until the original segment file is dropped.
If the ratio of hidden rows to total rows in a segment file is less than a threshold value (10, by default), the segment file is not compacted. The threshold value can be configured with the
gp_appendonly_compaction_threshold server configuration parameter.
VACUUM FULL ignores the value of
gp_appendonly_compaction_threshold and rewrites the segment file regardless of the ratio.
You can use the
__gp_aovisimap_compaction_info() function in the gp_toolkit schema to investigate the effectiveness of a VACUUM operation on append-optimized tables.
For information about the
__gp_aovisimap_compaction_info() function see, "Checking Append-Optimized Tables" in the Greenplum Database Reference Guide.
VACUUM can be disabled for append-optimized tables using the
gp_appendonly_compaction server configuration parameter.
For details about vacuuming a database, see Vacuuming the Database.
For information about the
gp_appendonly_compaction_threshold server configuration parameter and the
VACUUM command, see the Greenplum Database Reference Guide.
Greenplum's MVCC transaction semantics depend on comparing transaction ID (XID) numbers to determine visibility to other transactions. Transaction ID numbers are compared using modulo 232 arithmetic, so a Greenplum system that runs more than about two billion transactions can experience transaction ID wraparound, where past transactions appear to be in the future. This means past transactions' outputs become invisible. Therefore, it is necessary to
VACUUM every table in every database at least once per two billion transactions.
Greenplum Database assigns XID values only to transactions that involve DDL or DML operations, which are typically the only transactions that require an XID.
Important: Greenplum Database monitors transaction IDs. If you do not vacuum the database regularly, Greenplum Database will generate a warning and error.
Greenplum Database issues the following warning when a significant portion of the transaction IDs are no longer available and before transaction ID wraparound occurs:
WARNING: database "database_name" must be vacuumed within *number\_of\_transactions* transactions
When the warning is issued, a
VACUUM operation is required. If a
VACUUM operation is not performed, Greenplum Database stops creating transactions when it reaches a limit prior to when transaction ID wraparound occurs. Greenplum Database issues this error when it stops creating transactions to avoid possible data loss:
FATAL: database is not accepting commands to avoid wraparound data loss in database "database_name"
The Greenplum Database configuration parameter
xid_warn_limit controls when the warning is displayed. The parameter
xid_stop_limit controls when Greenplum Database stops creating transactions.
When Greenplum Database reaches the
xid_stop_limit transaction ID limit due to infrequent
VACUUM maintenance, it becomes unresponsive. To recover from this situation, perform the following steps as database administrator:
VACUUM FREEZEon all affected databases.
xid_stop_limitto its original value.
For information about the configuration parameters, see the Greenplum Database Reference Guide.
For information about transaction ID wraparound see the PostgreSQL documentation.
Numerous database updates with
DROP commands increase the system catalog size and affect system performance. For example, running many
DROP TABLE statements degrades the overall system performance due to excessive data scanning during metadata operations on catalog tables. The performance loss occurs between thousands to tens of thousands of
DROP TABLE statements, depending on the system.
You should run a system catalog maintenance procedure regularly to reclaim the space occupied by deleted objects. If a regular procedure has not been run for a long time, you may need to run a more intensive procedure to clear the system catalog. This topic describes both procedures.
It is recommended that you periodically run
VACUUM on the system catalog to clear the space that deleted objects occupy in the system indexes and tables. If regular database operations include numerous
DROP statements, it is safe and appropriate to run a system catalog maintenance procedure with
VACUUM daily at off-peak hours. You can do this while the system is available.
These are Greenplum Database system catalog maintenance steps.
REINDEX on the system catalog tables to rebuild the system catalog indexes. This removes bloat in the indexes and improves
REINDEX causes locking of system catalog tables, which could affect currently running queries. To avoid disrupting ongoing business operations, schedule the
REINDEX operation during a period of low activity.
VACUUM on the system catalog tables.
ANALYZE on the system catalog tables to update the catalog table statistics.
This example script performs a
ANALYZE of a Greenplum Database system catalog. In the script, replace
<database-name> with a database name.
#!/bin/bash DBNAME="<database-name>" SYSTABLES="' pg_catalog.' || relname || ';' FROM pg_class a, pg_namespace b WHERE a.relnamespace=b.oid AND b.nspname='pg_catalog' AND a.relkind='r'" reindexdb --system -d $DBNAME psql -tc "SELECT 'VACUUM' || $SYSTABLES" $DBNAME | psql -a $DBNAME analyzedb -as pg_catalog -d $DBNAME
Note: If you are performing catalog maintenance during a maintenance period and you need to stop a process due to time constraints, run the Greenplum Database function
pg_cancel_backend(<PID>) to safely stop the Greenplum Database process.
If system catalog maintenance has not been performed in a long time, the catalog can become bloated with dead space; this causes excessively long wait times for simple metadata operations. A wait of more than two seconds to list user tables, such as with the
\d metacommand from within
psql, is an indication of catalog bloat.
If you see indications of system catalog bloat, you must perform an intensive system catalog maintenance procedure with
VACUUM FULL during a scheduled downtime period. During this period, stop all catalog activity on the system; the
VACUUM FULL system catalog maintenance procedure takes exclusive locks against the system catalog.
Running regular system catalog maintenance procedures can prevent the need for this more costly procedure.
These are steps for intensive system catalog maintenance.
REINDEXon the system catalog tables to rebuild the system catalog indexes. This removes bloat in the indexes and improves
VACUUM FULLon the system catalog tables. See the following Note.
ANALYZEon the system catalog tables to update the catalog table statistics.
Note: The system catalog table
pg_attribute is usually the largest catalog table. If the
pg_attribute table is significantly bloated, a
VACUUM FULL operation on the table might require a significant amount of time and might need to be performed separately. The presence of both of these conditions indicate a significantly bloated
pg_attribute table that might require a long
VACUUM FULL time:
pg_attributetable contains a large number of records.
significant amount of bloatin the
Greenplum Database uses a cost-based query optimizer that relies on database statistics. Accurate statistics allow the query optimizer to better estimate selectivity and the number of rows that a query operation retrieves. These estimates help it choose the most efficient query plan. The
ANALYZE command collects column-level statistics for the query optimizer.
You can run both
ANALYZE operations in the same command. For example:
=# VACUUM ANALYZE mytable;
Running the VACUUM ANALYZE command might produce incorrect statistics when the command is run on a table with a significant amount of bloat (a significant amount of table disk space is occupied by deleted or obsolete rows). For large tables, the
ANALYZE command calculates statistics from a random sample of rows. It estimates the number rows in the table by multiplying the average number of rows per page in the sample by the number of actual pages in the table. If the sample contains many empty pages, the estimated row count can be inaccurate.
For a table, you can view information about the amount of unused disk space (space that is occupied by deleted or obsolete rows) in the gp_toolkit view gp_bloat_diag. If the
bdidiag column for a table contains the value
significant amount of bloat suspected, a significant amount of table disk space consists of unused space. Entries are added to the gp_bloat_diag view after a table has been vacuumed.
To remove unused disk space from the table, you can run the command VACUUM FULL on the table. Due to table lock requirements, VACUUM FULL might not be possible until a maintenance period.
As a temporary workaround, run ANALYZE to compute column statistics and then run VACUUM on the table to generate an accurate row count. This example runs ANALYZE and then VACUUM on the cust_info table.
ANALYZE cust_info; VACUUM cust_info;
Important: If you intend to run queries on partitioned tables with GPORCA enabled (the default), you must collect statistics on the partitioned table root partition with the ANALYZE command. For information about GPORCA, see Overview of GPORCA.
Note: You can use the Greenplum Database utility analyzedb to update table statistics. Tables can be analyzed concurrently. For append optimized tables, analyzedb updates statistics only if the statistics are not current. See the analyzedb utility.
For B-tree indexes, a freshly-constructed index is slightly faster to access than one that has been updated many times because logically adjacent pages are usually also physically adjacent in a newly built index. Reindexing older indexes periodically can improve access speed. If all but a few index keys on a page have been deleted, there will be wasted space on the index page. A reindex will reclaim that wasted space. In Greenplum Database it is often faster to drop an index (
DROP INDEX) and then recreate it (
CREATE INDEX) than it is to use the
For table columns with indexes, some operations such as bulk updates or inserts to the table might perform more slowly because of the updates to the indexes. To enhance performance of bulk operations on tables with indexes, you can drop the indexes, perform the bulk operation, and then re-create the index.
Greenplum Database log output tends to be voluminous, especially at higher debug levels, and you do not need to save it indefinitely. Administrators should purge older log files periodically.
Greenplum Database by default has log file rotation enabled for the master and segment database logs. Log files are created in the
log subdirectory of the master and each segment data directory using the following naming convention:
gpdb-*YYYY*-*MM*-*DD_hhmmss*.csv. Administrators need to implement scripts or programs to periodically clean up old log files in the
log directory of the master and each segment instance.
Log rotation can be triggered by the size of the current log file or the age of the current log file. The
log_rotation_size configuration parameter sets the size of an individual log file that triggers log rotation. When the log file size is equal to or greater than the specified size, the file is closed and a new log file is created. The
log_rotation_size value is specified in kilobytes. The default is 1048576 kilobytes, or 1GB. If
log_rotation_size is set to 0, size-based rotation is disabled.
log_rotation_age configuration parameter specifies the age of a log file that triggers rotation. When the specified amount of time has elapsed since the log file was created, the file is closed and a new log file is created. The default
log_rotation_age, 1d, creates a new log file 24 hours after the current log file was created. If
log_rotation_age is set to 0, time-based rotation is disabled.
For information about viewing the database server log files, see Viewing the Database Server Log Files.
Log files for the Greenplum Database management utilities are written to
~/gpAdminLogs by default. The naming convention for management log files is:
The log entry format is:
The log file for a particular utility execution is appended to its daily log file each time that utility is run.