To keep a Greenplum Database system running efficiently, the database must be regularly cleared of expired data and the table statistics must be updated so that the query optimizer has accurate information.
Greenplum Database requires that certain tasks be performed regularly to achieve optimal performance. The tasks discussed here are required, but database administrators can automate them using standard UNIX tools such as cron
scripts. An administrator sets up the appropriate scripts and checks that they execute successfully. See Recommended Monitoring and Maintenance Tasks for additional suggested maintenance activities you can implement to keep your Greenplum system running optimally.
Parent topic: Managing a Greenplum System
The design of the MVCC transaction concurrency model used in Greenplum Database means that deleted or updated data rows still occupy physical space on disk even though they are not visible to new transactions. If your database has many updates and deletes, many expired rows exist and the space they use must be reclaimed with the VACUUM
command. The VACUUM
command also collects table-level statistics, such as numbers of rows and pages, so it is also necessary to vacuum append-optimized tables, even when there is no space to reclaim from updated or deleted rows.
Vacuuming an append-optimized table follows a different process than vacuuming heap tables. On each segment, a new segment file is created and visible rows are copied into it from the current segment. When the segment file has been copied, the original is scheduled to be dropped and the new segment file is made available. This requires sufficient available disk space for a copy of the visible rows until the original segment file is dropped.
If the ratio of hidden rows to total rows in a segment file is less than a threshold value (10, by default), the segment file is not compacted. The threshold value can be configured with the gp_appendonly_compaction_threshold
server configuration parameter. VACUUM FULL
ignores the value of gp_appendonly_compaction_threshold
and rewrites the segment file regardless of the ratio.
You can use the __gp_aovisimap_compaction_info()
function in the the gp_toolkit schema to investigate the effectiveness of a VACUUM operation on append-optimized tables.
For information about the __gp_aovisimap_compaction_info()
function see, "Checking Append-Optimized Tables" in the Greenplum Database Reference Guide.
VACUUM
can be deactivated for append-optimized tables using the gp_appendonly_compaction
server configuration parameter.
For details about vacuuming a database, see Vacuuming the Database.
For information about the gp_appendonly_compaction_threshold
server configuration parameter and the VACUUM
command, see the Greenplum Database Reference Guide.
Greenplum's MVCC transaction semantics depend on comparing transaction ID (XID) numbers to determine visibility to other transactions. Transaction ID numbers are compared using modulo 232 arithmetic, so a Greenplum system that runs more than about two billion transactions can experience transaction ID wraparound, where past transactions appear to be in the future. This means past transactions' outputs become invisible. Therefore, it is necessary to VACUUM
every table in every database at least once per two billion transactions.
Greenplum Database assigns XID values only to transactions that involve DDL or DML operations, which are typically the only transactions that require an XID.
Important: Greenplum Database monitors transaction IDs. If you do not vacuum the database regularly, Greenplum Database will generate a warning and error.
Greenplum Database issues the following warning when a significant portion of the transaction IDs are no longer available and before transaction ID wraparound occurs:
WARNING: database "database_name" must be vacuumed within <number_of_transactions> transactions
When the warning is issued, a VACUUM
operation is required. If a VACUUM
operation is not performed, Greenplum Database stops creating transactions when it reaches a limit prior to when transaction ID wraparound occurs. Greenplum Database issues this error when it stops creating transactions to avoid possible data loss:
FATAL: database is not accepting commands to avoid
wraparound data loss in database "database_name"
The Greenplum Database configuration parameter xid_warn_limit
controls when the warning is displayed. The parameter xid_stop_limit
controls when Greenplum Database stops creating transactions.
When Greenplum Database reaches the xid_stop_limit
transaction ID limit due to infrequent VACUUM
maintenance, it becomes unresponsive. To recover from this situation, perform the following steps as database administrator:
xid_stop_limit
by 10,000,000.VACUUM FREEZE
on all affected databases.xid_stop_limit
to its original value.For information about the configuration parameters, see the Greenplum Database Reference Guide.
For information about transaction ID wraparound see the PostgreSQL documentation.
Numerous database updates with CREATE
and DROP
commands increase the system catalog size and affect system performance. For example, running many DROP TABLE
statements degrades the overall system performance due to excessive data scanning during metadata operations on catalog tables. The performance loss occurs between thousands to tens of thousands of DROP TABLE
statements, depending on the system.
You should run a system catalog maintenance procedure regularly to reclaim the space occupied by deleted objects. If a regular procedure has not been run for a long time, you may need to run a more intensive procedure to clear the system catalog. This topic describes both procedures.
It is recommended that you periodically run REINDEX
and VACUUM
on the system catalog to clear the space that deleted objects occupy in the system indexes and tables. If regular database operations include numerous DROP
statements, it is safe and appropriate to run a system catalog maintenance procedure with VACUUM
daily at off-peak hours. You can do this while the system is available.
These are Greenplum Database system catalog maintenance steps.
Perform a REINDEX
on the system catalog tables to rebuild the system catalog indexes. This removes bloat in the indexes and improves VACUUM
performance.
Note: When performing REINDEX
on the system catalog tables, locking will occur on the tables and might have an impact on currently running queries. You can schedule the REINDEX
operation during a period of low activity to avoid disrupting ongoing business operations.
Perform a VACUUM
on the system catalog tables.
Perform an ANALYZE
on the system catalog tables to update the catalog table statistics.
This example script performs a REINDEX
, VACUUM
, and ANALYZE
of a Greenplum Database system catalog. In the script, replace <database-name>
with a database name.
#!/bin/bash
DBNAME="<<database-name>>"
SYSTABLES="' pg_catalog.' || relname || ';' FROM pg_class a, pg_namespace b
WHERE a.relnamespace=b.oid AND b.nspname='pg_catalog' AND a.relkind='r'"
reindexdb --system -d $DBNAME
psql -tc "SELECT 'VACUUM' || $SYSTABLES" $DBNAME | psql -a $DBNAME
analyzedb -s pg_catalog -d $DBNAME
Note: If you are performing catalog maintenance during a maintenance period and you need to stop a process due to time constraints, run the Greenplum Database function pg_cancel_backend(<PID>)
to safely stop the Greenplum Database process.
If system catalog maintenance has not been performed in a long time, the catalog can become bloated with dead space; this causes excessively long wait times for simple metadata operations. A wait of more than two seconds to list user tables, such as with the \d
metacommand from within psql
, is an indication of catalog bloat.
If you see indications of system catalog bloat, you must perform an intensive system catalog maintenance procedure with VACUUM FULL
during a scheduled downtime period. During this period, stop all catalog activity on the system; the VACUUM FULL
system catalog maintenance procedure takes exclusive locks against the system catalog.
Running regular system catalog maintenance procedures can prevent the need for this more costly procedure.
These are steps for intensive system catalog maintenance.
REINDEX
on the system catalog tables to rebuild the system catalog indexes. This removes bloat in the indexes and improves VACUUM
performance.VACUUM FULL
on the system catalog tables. See the following Note.ANALYZE
on the system catalog tables to update the catalog table statistics.Note: The system catalog table pg_attribute
is usually the largest catalog table. If the pg_attribute
table is significantly bloated, a VACUUM FULL
operation on the table might require a significant amount of time and might need to be performed separately. The presence of both of these conditions indicate a significantly bloated pg_attribute
table that might require a long VACUUM FULL
time:
pg_attribute
table contains a large number of records.pg_attribute
is significant amount of bloat
in the gp_toolkit.gp_bloat_diag
view.Greenplum Database uses a cost-based query optimizer that relies on database statistics. Accurate statistics allow the query optimizer to better estimate selectivity and the number of rows that a query operation retrieves. These estimates help it choose the most efficient query plan. The ANALYZE
command collects column-level statistics for the query optimizer.
You can run both VACUUM
and ANALYZE
operations in the same command. For example:
=# VACUUM ANALYZE mytable;
Running the VACUUM ANALYZE command might produce incorrect statistics when the command is run on a table with a significant amount of bloat (a significant amount of table disk space is occupied by deleted or obsolete rows). For large tables, the ANALYZE
command calculates statistics from a random sample of rows. It estimates the number rows in the table by multiplying the average number of rows per page in the sample by the number of actual pages in the table. If the sample contains many empty pages, the estimated row count can be inaccurate.
For a table, you can view information about the amount of unused disk space (space that is occupied by deleted or obsolete rows) in the gp_toolkit view gp_bloat_diag. If the bdidiag
column for a table contains the value significant amount of bloat suspected
, a significant amount of table disk space consists of unused space. Entries are added to the gp_bloat_diag view after a table has been vacuumed.
To remove unused disk space from the table, you can run the command VACUUM FULL on the table. Due to table lock requirements, VACUUM FULL might not be possible until a maintenance period.
As a temporary workaround, run ANALYZE to compute column statistics and then run VACUUM on the table to generate an accurate row count. This example runs ANALYZE and then VACUUM on the cust_info table.
ANALYZE cust_info;
VACUUM cust_info;
Important: If you intend to execute queries on partitioned tables with GPORCA enabled (the default), you must collect statistics on the partitioned table root partition with the ANALYZE command. For information about GPORCA, see Overview of GPORCA.
Note: You can use the Greenplum Database utility analyzedb to update table statistics. Tables can be analyzed concurrently. For append optimized tables, analyzedb updates statistics only if the statistics are not current. See the analyzedb utility.
For B-tree indexes, a freshly-constructed index is slightly faster to access than one that has been updated many times because logically adjacent pages are usually also physically adjacent in a newly built index. Reindexing older indexes periodically can improve access speed. If all but a few index keys on a page have been deleted, there will be wasted space on the index page. A reindex will reclaim that wasted space. In Greenplum Database it is often faster to drop an index (DROP INDEX
) and then recreate it (CREATE INDEX
) than it is to use the REINDEX
command.
For table columns with indexes, some operations such as bulk updates or inserts to the table might perform more slowly because of the updates to the indexes. To enhance performance of bulk operations on tables with indexes, you can drop the indexes, perform the bulk operation, and then re-create the index.
Greenplum Database log output tends to be voluminous, especially at higher debug levels, and you do not need to save it indefinitely. Administrators rotate the log files periodically so new log files are started and old ones are removed.
Greenplum Database has log file rotation enabled on the master and all segment instances. Daily log files are created in the pg_log
subdirectory of the master and each segment data directory using the following naming convention: gpdb-<YYYY>-<MM>-<DD_hhmmss>.csv
. Although log files are rolled over daily, they are not automatically truncated or deleted. Administrators need to implement scripts or programs to periodically clean up old log files in the pg_log
directory of the master and each segment instance.
For information about viewing the database server log files, see Viewing the Database Server Log Files.
Log files for the Greenplum Database management utilities are written to ~/gpAdminLogs
by default. The naming convention for management log files is:
<script_name>_<date>.log
The log entry format is:
<timestamp>:<utility>:<host>:<user>:[INFO|WARN|FATAL]:<message>
The log file for a particular utility execution is appended to its daily log file each time that utility is run.