About Database Statistics in Greenplum Database

An overview of statistics gathered by the ANALYZE command in Greenplum Database.

Statistics are metadata that describe the data stored in the database. The query optimizer needs up-to-date statistics to choose the best execution plan for a query. For example, if a query joins two tables and one of them must be broadcast to all segments, the optimizer can choose the smaller of the two tables to minimize network traffic.

The statistics used by the optimizer are calculated and saved in the system catalog by the ANALYZE command. There are three ways to initiate an analyze operation:

You can run the ANALYZE command directly.
You can run the analyzedb management utility outside of the database, at the command line.
An automatic analyze operation can be triggered when DML operations are performed on tables that have no statistics or when a DML operation modifies a number of rows greater than a specified threshold.

These methods are described in the following sections. The VACUUM ANALYZE command is another way to initiate an analyze operation, but its use is discouraged because vacuum and analyze are different operations with different purposes.

Calculating statistics consumes time and resources, so Greenplum Database produces estimates by calculating statistics on samples of large tables. In most cases, the default settings provide the information needed to generate correct execution plans for queries. If the statistics produced are not producing optimal query execution plans, the administrator can tune configuration parameters to produce more accurate statistics by increasing the sample size or the granularity of statistics saved in the system catalog. Producing more accurate statistics has CPU and storage costs and may not produce better plans, so it is important to view explain plans and test query performance to ensure that the additional statistics-related costs result in better query performance.

Parent topic: Greenplum Database Concepts

System Statistics

Table Size

The query planner seeks to minimize the disk I/O and network traffic required to run a query, using estimates of the number of rows that must be processed and the number of disk pages the query must access. The data from which these estimates are derived are the pg_class system table columns reltuples and relpages, which contain the number of rows and pages at the time a VACUUM or ANALYZE command was last run. As rows are added or deleted, the numbers become less accurate. However, an accurate count of disk pages is always available from the operating system, so as long as the ratio of reltuples to relpages does not change significantly, the optimizer can produce an estimate of the number of rows that is sufficiently accurate to choose the correct query execution plan.

When the reltuples column differs significantly from the row count returned by SELECT COUNT(*), an analyze should be performed to update the statistics.

When a REINDEX command finishes recreating an index, the relpages and reltuples columns are set to zero. The ANALYZE command should be run on the base table to update these columns.

The pg_statistic System Table and pg_stats View

The pg_statistic system table holds the results of the last ANALYZE operation on each database table. There is a row for each column of every table. It has the following columns:

starelid: The object ID of the table or index the column belongs to.
staattnum: The number of the described column, beginning with 1.
stainherit: If true, the statistics include inheritance child columns, not just the values in the specified relation.
stanullfrac: The fraction of the column's entries that are null.
stawidth: The average stored width, in bytes, of non-null entries.
stadistinct: A positive number is an estimate of the number of distinct values in the column; the number is not expected to vary with the number of rows. A negative value is the number of distinct values divided by the number of rows, that is, the ratio of rows with distinct values for the column, negated. This form is used when the number of distinct values increases with the number of rows. A unique column, for example, has an n_distinct value of -1.0. Columns with an average width greater than 1024 are considered unique.
stakind N: A code number indicating the kind of statistics stored in the Nth slot of the pg_statistic row.
staop N: An operator used to derive the statistics stored in the Nth slot. For example, a histogram slot would show the < operator that defines the sort order of the data.
stanumbers N: float4 array containing numerical statistics of the appropriate kind for the Nth slot, or NULL if the slot kind does not involve numerical values.
stavalues N: Column data values of the appropriate kind for the Nth slot, or NULL if the slot kind does not store any data values. Each array's element values are actually of the specific column's data type, so there is no way to define these columns' types more specifically than anyarray.

The statistics collected for a column vary for different data types, so the pg_statistic table stores statistics that are appropriate for the data type in four slots, consisting of four columns per slot. For example, the first slot, which normally contains the most common values for a column, consists of the columns stakind1, staop1, stanumbers1, and stavalues1.

The stakindN columns each contain a numeric code to describe the type of statistics stored in their slot. The stakind code numbers from 1 to 99 are reserved for core PostgreSQL data types. Greenplum Database uses code numbers 1, 2, 3, 4, 5, and 99. A value of 0 means the slot is unused. The following table describes the kinds of statistics stored for the three codes.

Table 1. Contents of pg_statistic "slots"
stakind Code	Description
1	Most CommonValues (MCV) Slot `staop` contains the object ID of the "=" operator, used to decide whether values are the same or not. `stavalues` contains an array of the `K` most common non-null values appearing in the column. `stanumbers` contains the frequencies (fractions of total row count) of the values in the `stavalues` array. The values are ordered in decreasing frequency. Since the arrays are variable-size, `K` can be chosen by the statistics collector. Values must occur more than once to be added to the `stavalues` array; a unique column has no MCV slot.
2	Histogram Slot – describes the distribution of scalar data. `staop` is the object ID of the "<" operator, which describes the sort ordering. `stavalues` contains `M` (where `M>=2`) non-null values that divide the non-null column data values into `M-1` bins of approximately equal population. The first `stavalues` item is the minimum value and the last is the maximum value. `stanumbers` is not used and should be `NULL`. If a Most Common Values slot is also provided, then the histogram describes the data distribution after removing the values listed in the MCV array. (It is a compressed histogram in the technical parlance). This allows a more accurate representation of the distribution of a column with some very common values. In a column with only a few distinct values, it is possible that the MCV list describes the entire data population; in this case the histogram reduces to empty and should be omitted.
3	Correlation Slot – describes the correlation between the physical order of table tuples and the ordering of data values of this column. `staop` is the object ID of the "<" operator. As with the histogram, more than one entry could theoretically appear. `stavalues` is not used and should be `NULL`. `stanumbers` contains a single entry, the correlation coefficient between the sequence of data values and the sequence of their actual tuple positions. The coefficient ranges from +1 to -1.
4	Most Common Elements Slot - is similar to a Most Common Values (MCV) Slot, except that it stores the most common non-null elements of the column values. This is useful when the column datatype is an array or some other type with identifiable elements (for instance, `tsvector`). `staop` contains the equality operator appropriate to the element type. `stavalues` contains the most common element values. `stanumbers` contains common element frequencies. Frequencies are measured as the fraction of non-null rows the element value appears in, not the frequency of all rows. Also, the values are sorted into the element type's default order (to support binary search for a particular value). Since this puts the minimum and maximum frequencies at unpredictable spots in `stanumbers`, there are two extra members of `stanumbers` that hold copies of the minimum and maximum frequencies. Optionally, there can be a third extra member that holds the frequency of null elements (the frequency is expressed in the same terms: the fraction of non-null rows that contain at least one null element). If this member is omitted, the column is presumed to contain no `NULL` elements. Note: For `tsvector` columns, the `stavalues` elements are of type `text`, even though their representation within `tsvector` is not exactly `text`.
5	Distinct Elements Count Histogram Slot - describes the distribution of the number of distinct element values present in each row of an array-type column. Only non-null rows are considered, and only non-null elements. `staop` contains the equality operator appropriate to the element type. `stavalues` is not used and should be `NULL`. `stanumbers` contains information about distinct elements. The last member of `stanumbers` is the average count of distinct element values over all non-null rows. The preceding `M` (where `M >=2`) members form a histogram that divides the population of distinct-elements counts into `M-1` bins of approximately equal population. The first of these is the minimum observed count, and the last the maximum.
99	Hyperloglog Slot - for leaf partitions of a partitioned table, stores the `hyperloglog_counter` created for the sampled data. The `hyperloglog_counter` data structure is converted into a `bytea` and stored in a `stavalues5` slot of the `pg_statistic` catalog table.

The pg_stats view presents the contents of pg_statistic in a friendlier format. The pg_stats view has the following columns:

schemaname: The name of the schema containing the table.
tablename: The name of the table.
attname: The name of the column this row describes.
inherited: If true, the statistics include inheritance child columns.
null_frac: The fraction of column entries that are null.
avg_width: The average storage width in bytes of the column's entries, calculated as avg(pg_column_size(column_name)).
n_distinct: A positive number is an estimate of the number of distinct values in the column; the number is not expected to vary with the number of rows. A negative value is the number of distinct values divided by the number of rows, that is, the ratio of rows with distinct values for the column, negated. This form is used when the number of distinct values increases with the number of rows. A unique column, for example, has an n_distinct value of -1.0. Columns with an average width greater than 1024 are considered unique.
most_common_vals: An array containing the most common values in the column, or null if no values seem to be more common. If the n_distinct column is -1, most_common_vals is null. The length of the array is the lesser of the number of actual distinct column values or the value of the default_statistics_target configuration parameter. The number of values can be overridden for a column using ALTER TABLE table SET COLUMN column SET STATISTICS N.
most_common_freqs: An array containing the frequencies of the values in the most_common_vals array. This is the number of occurrences of the value divided by the total number of rows. The array is the same length as the most_common_vals array. It is null if most_common_vals is null.
histogram_bounds: An array of values that divide the column values into groups of approximately the same size. A histogram can be defined only if there is a max() aggregate function for the column. The number of groups in the histogram is the same as the most_common_vals array size.
correlation: Greenplum Database computes correlation statistics for both heap and AO/AOCO tables, but the Postgres-based planner uses these statistics only for heap tables.
most_common_elems: An array that contains the most common element values.
most_common_elem_freqs: An array that contains common element frequencies.
elem_count_histogram: An array that describes the distribution of the number of distinct element values present in each row of an array-type column.

Newly created tables and indexes have no statistics. You can check for tables with missing statistics using the gp_stats_missing view, which is in the gp_toolkit schema:

SELECT * from gp_toolkit.gp_stats_missing;

Sampling

When calculating statistics for large tables, Greenplum Database creates a smaller table by sampling the base table. If the table is partitioned, samples are taken from all partitions.

Updating Statistics

Running ANALYZE with no arguments updates statistics for all tables in the database. This could take a very long time, so it is better to analyze tables selectively after data has changed. You can also analyze a subset of the columns in a table, for example columns used in joins, WHERE clauses, SORT clauses, GROUP BY clauses, or HAVING clauses.

Analyzing a severely bloated table can generate poor statistics if the sample contains empty pages, so it is good practice to vacuum a bloated table before analyzing it.

See the SQL Command Reference in the Greenplum Database Reference Guide for details of running the ANALYZE command.

Refer to the Greenplum Database Management Utility Reference for details of running the analyzedb command.

Analyzing Partitioned Tables

When the ANALYZE command is run on a partitioned table, it analyzes each leaf partition, one at a time. You can run ANALYZE on just new or changed partitioned tables to avoid analyzing partitions that have not changed.

The analyzedb command-line utility skips unchanged partitions automatically. It also runs concurrent sessions so it can analyze several partitions concurrently. It runs five sessions by default, but the number of sessions can be set from 1 to 10 with the -p command-line option. Each time analyzedb runs, it saves state information for append-optimized tables and partitions in the db_analyze directory in the coordinator data directory. The next time it runs, analyzedb compares the current state of each table with the saved state and skips analyzing a table or partition if it is unchanged. Heap tables are always analyzed.

If GPORCA is enabled (the default), you also need to run ANALYZE or ANALYZE ROOTPARTITION on the root partitioned table (not a leaf partition) to refresh the root partition statistics. GPORCA requires statistics at the root level for partitioned tables. The Postgres-based planner does not use these statistics.

The time to analyze a partitioned table is similar to the time to analyze a non-partitioned table with the same data. When all the leaf partitions have statistics, performing ANALYZE ROOTPARTITION to generate root partition statistics should be quick (a few seconds depending on the number of partitions and table columns). If some of the leaf partitions do not have statistics, then all the table data is sampled to generate root partition statistics. Sampling table data takes longer and results in lower quality root partition statistics.

The Greenplum Database server configuration parameter optimizer_analyze_root_partition affects when statistics are collected on the root partitioned table. If the parameter is on (the default), the ROOTPARTITION keyword is not required to collect statistics on the root partition when you run ANALYZE. Root partition statistics are collected when you run ANALYZE on the root partition, or when you run ANALYZE on a leaf partition of the partitioned table and the other leaf partitions have statistics. If the parameter is off, you must run ANALYZE ROOTPARTITION to collect root partition statistics.

If you do not intend to run queries on partitioned tables with GPORCA (setting the server configuration parameter optimizer to off), you can also set the server configuration parameter optimizer_analyze_root_partition to off to limit when ANALYZE updates the root partition statistics.

Configuring Statistics

There are several options for configuring Greenplum Database statistics collection.

Statistics Target

The statistics target is the size of the most_common_vals, most_common_freqs, and histogram_bounds arrays for an individual column. By default, the target is 25. The default target can be changed by setting a server configuration parameter and the target can be set for any column using the ALTER TABLE command. Larger values increase the time needed to do ANALYZE, but may improve the quality of the Postgres-based planner estimates.

Set the system default statistics target to a different value by setting the default_statistics_target server configuration parameter. The default value is usually sufficient, and you should only raise or lower it if your tests demonstrate that query plans improve with the new target. For example, to raise the default statistics target from 100 to 150 you can use the gpconfig utility:

gpconfig -c default_statistics_target -v 150

The statistics target for individual columns can be set with the ALTER TABLE command. For example, some queries can be improved by increasing the target for certain columns, especially columns that have irregular distributions. You can set the target to zero for columns that never contribute to query optimization. When the target is 0, ANALYZE ignores the column. For example, the following ALTER TABLE command sets the statistics target for the notes column in the emp table to zero:

ALTER TABLE emp ALTER COLUMN notes SET STATISTICS 0;

The statistics target can be set in the range 0 to 1000, or set it to -1 to revert to using the system default statistics target.

Setting the statistics target on a parent partitioned table affects the child partitions. If you set statistics to 0 on some columns on the parent table, the statistics for the same columns are set to 0 for all children partitions. However, if you later add or exchange another child partition, the new child partition will use either the default statistics target or, in the case of an exchange, the previous statistics target. Therefore, if you add or exchange child partitions, you should set the statistics targets on the new child table.

Automatic Statistics Collection

Greenplum Database can be set to automatically run ANALYZE on a table that either has no statistics or has changed significantly when certain operations are performed on the table. For partitioned tables, automatic statistics collection is only triggered when the operation is run directly on a leaf table, and then only the leaf table is analyzed.

Automatic statistics collection is governed by a server configuration parameter, and has three modes:

none deactivates automatic statistics collection.
on_no_stats triggers an analyze operation for a table with no existing statistics when any of the commands CREATE TABLE AS SELECT, INSERT, or COPY are run on the table by the table owner.
on_change triggers an analyze operation when any of the commands CREATE TABLE AS SELECT, UPDATE, DELETE, INSERT, or COPY are run on the table by the table owner, and the number of rows affected exceeds the threshold defined by the gp_autostats_on_change_threshold configuration parameter.

The automatic statistics collection mode is set separately for commands that occur within a procedural language function and commands that run outside of a function:

The gp_autostats_mode configuration parameter controls automatic statistics collection behavior outside of functions and is set to on_no_stats by default.
The gp_autostats_mode_in_functions parameter controls the behavior when table operations are performed within a procedural language function and is set to none by default.

With the on_change mode, ANALYZE is triggered only if the number of rows affected exceeds the threshold defined by the gp_autostats_on_change_threshold configuration parameter. The default value for this parameter is a very high value, 2147483647, which effectively deactivates automatic statistics collection; you must set the threshold to a lower number to enable it. The on_change mode could trigger large, unexpected analyze operations that could disrupt the system, so it is not recommended to set it globally. It could be useful in a session, for example to automatically analyze a table following a load.

Setting the gp_autostats_allow_nonowner server configuration parameter to true also instructs Greenplum Database to trigger automatic statistics collection on a table when:

gp_autostats_mode=on_change and the table is modified by a non-owner.
gp_autostats_mode=on_no_stats and the first user to INSERT or COPY into the table is a non-owner.

To deactivate automatic statistics collection outside of functions, set the gp_autostats_mode parameter to none:

gpconfigure -c gp_autostats_mode -v none

To enable automatic statistics collection in functions for tables that have no statistics, change gp_autostats_mode_in_functions to on_no_stats:

gpconfigure -c gp_autostats_mode_in_functions -v on_no_stats

Set the log_autostats system configuration parameter to on if you want to log automatic statistics collection operations.