The most important prerequisite for good query performance is to begin with accurate statistics for the tables. Updating statistics with the
ANALYZE statement enables the query planner to generate optimal query plans. When a table is analyzed, information about the data is stored in the system catalog tables. If the stored information is out of date, the planner can generate inefficient plans.
Running ANALYZE with no arguments updates statistics for all tables in the database. This can be a very long-running process and it is not recommended. You should
ANALYZE tables selectively when data has changed or use the analyzedb utility.
ANALYZE on a large table can take a long time. If it is not feasible to run
ANALYZE on all columns of a very large table, you can generate statistics for selected columns only using
ANALYZE table(column, ...). Be sure to include columns used in joins,
GROUP BY clauses, or
For a partitioned table, you can run
ANALYZE on just partitions that have changed, for example, if you add a new partition. Note that for partitioned tables, you can run
ANALYZE on the parent (main) table, or on the leaf nodes—the partition files where data and statistics are actually stored. The intermediate files for sub-partitioned tables store no data or statistics, so running
ANALYZE on them does not work. You can find the names of the partition tables in the
pg_partitions system catalog:
SELECT partitiontablename from pg_partitions WHERE tablename='parent_table';
There is a trade-off between the amount of time it takes to generate statistics and the quality, or accuracy, of the statistics.
To allow large tables to be analyzed in a reasonable amount of time,
ANALYZE takes a random sample of the table contents, rather than examining every row. To increase the number of sample values for all table columns adjust the
default_statistics_target configuration parameter. The target value ranges from 1 to 1000; the default target value is 100. The
default_statistics_target variable applies to all columns by default, and specifies the number of values that are stored in the list of common values. A larger target may improve the quality of the query planner’s estimates, especially for columns with irregular data patterns.
default_statistics_target can be set at the coordinator/session level and requires a reload.
DELETEoperations that significantly change the underlying data.
ANALYZE requires only a read lock on the table, so it may be run in parallel with other database activity, but do not run
ANALYZE while performing loads,
CREATE INDEX operations.
gp_autostats_mode configuration parameter, together with the
gp_autostats_on_change_threshold parameter, determines when an automatic analyze operation is triggered. When automatic statistics collection is triggered, the planner adds an
ANALYZE step to the query.
on_no_stats, which triggers statistics collection for
CREATE TABLE AS SELECT,
COPY operations invoked by the table owner on any table that has no existing statistics.
on_change triggers statistics collection only when the number of rows affected exceeds the threshold defined by
gp_autostats_on_change_threshold, which has a default value of 2147483647. The following operations invoked on a table by its owner can trigger automatic statistics collection with
CREATE TABLE AS SELECT,
gp_autostats_allow_nonowner server configuration parameter to
true also instructs Greenplum Database to trigger automatic statistics collection on a table when:
gp_autostats_mode=on_changeand the table is modified by a non-owner.
gp_autostats_mode=on_no_statsand the first user to
COPYinto the table is a non-owner.
none deactivates automatics statistics collection.
For partitioned tables, automatic statistics collection is not triggered if data is inserted from the top-level parent table of a partitioned table. But automatic statistics collection is triggered if data is inserted directly in a leaf table (where the data is stored) of the partitioned table.
Parent topic: System Monitoring and Maintenance