In most traditional databases, indexes can greatly improve data access times. However, in a distributed database such as Greenplum, indexes should be used more sparingly. Greenplum Database performs very fast sequential scans; indexes use a random seek pattern to locate records on disk. Greenplum data is distributed across the segments, so each segment scans a smaller portion of the overall data to get the result. With table partitioning, the total data to scan may be even smaller. Because business intelligence (BI) query workloads generally return very large data sets, using indexes is not efficient.
First try your query workload without adding indexes. Indexes are more likely to improve performance for OLTP workloads, where the query is returning a single record or a small subset of data. Indexes can also improve performance on compressed append-optimized tables for queries that return a targeted set of rows, as the optimizer can use an index access method rather than a full table scan when appropriate. For compressed data, an index access method means only the necessary rows are uncompressed.
Greenplum Database automatically creates PRIMARY KEY
constraints for tables with primary keys. To create an index on a partitioned table, create an index on the partitioned table that you created. The index is propagated to all the child tables created by Greenplum Database. Creating an index on a table that is created by Greenplum Database for use by a partitioned table is not supported.
Note that a UNIQUE CONSTRAINT
(such as a PRIMARY KEY CONSTRAINT
) implicitly creates a UNIQUE INDEX
that must include all the columns of the distribution key and any partitioning key. The UNIQUE CONSTRAINT
is enforced across the entire table, including all table partitions (if any).
Indexes add some database overhead — they use storage space and must be maintained when the table is updated. Ensure that the query workload uses the indexes that you create, and check that the indexes you add improve query performance (as compared to a sequential scan of the table). To determine whether indexes are being used, examine the query EXPLAIN
plans. See Query Profiling.
Consider the following points when you create indexes.
WHERE
clauses are good candidates for indexes.Parent topic: Defining Database Objects
Using the CLUSTER
command to physically reorder a table based on an index can take a long time with very large tables. To achieve the same results much faster, you can manually reorder the data on disk by creating an intermediate table and loading the data in the desired order. For example:
CREATE TABLE new_table (LIKE old_table)
AS SELECT * FROM old_table ORDER BY myixcolumn;
DROP old_table;
ALTER TABLE new_table RENAME TO old_table;
CREATE INDEX myixcolumn_ix ON old_table;
VACUUM ANALYZE old_table;
Greenplum Database supports the Postgres index types B-tree, hash, GiST, SP-GiST, GIN, and BRIN. Each index type uses a different algorithm that is best suited to different types of queries. B-tree indexes fit the most common situations and are the default index type. See Index Types in the PostgreSQL documentation for a description of these types.
NoteGreenplum Database allows unique indexes only if the columns of the index key are the same as (or a superset of) the Greenplum distribution key. On a partitioned table, a unique index cannot be enforced across all child tables; a unique index is supported only within a child partition.
Greenplum Database provides the Bitmap index type. Bitmap indexes are best suited to data warehousing applications and decision support systems with large amounts of data, many ad hoc queries, and few data modification (DML) transactions.
An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of tuple IDs for each key corresponding to the rows with that key value. Bitmap indexes store a bitmap for each key value. Regular indexes can be several times larger than the data in the table, but bitmap indexes provide the same functionality as a regular index and use a fraction of the size of the indexed data.
Each bit in the bitmap corresponds to a possible tuple ID. If the bit is set, the row with the corresponding tuple ID contains the key value. A mapping function converts the bit position to a tuple ID. Bitmaps are compressed for storage. If the number of distinct key values is small, bitmap indexes are much smaller, compress better, and save considerable space compared with a regular index. The size of a bitmap index is proportional to the number of rows in the table times the number of distinct values in the indexed column.
Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE
clause. Rows that satisfy some, but not all, conditions are filtered out before the table is accessed. This improves response time, often dramatically.
Bitmap indexes are best suited to data warehousing applications where users query the data rather than update it. Bitmap indexes perform best for columns that have between 100 and 100,000 distinct values and when the indexed column is often queried in conjunction with other indexed columns. Columns with fewer than 100 distinct values, such as a gender column with two distinct values (male and female), usually do not benefit much from any type of index. On a column with more than 100,000 distinct values, the performance and space efficiency of a bitmap index decline.
Bitmap indexes can improve query performance for ad hoc queries. AND
and OR
conditions in the WHERE
clause of a query can be resolved quickly by performing the corresponding Boolean operations directly on the bitmaps before converting the resulting bitmap to tuple ids. If the resulting number of rows is small, the query can be answered quickly without resorting to a full table scan.
Do not use bitmap indexes for unique columns or columns with high cardinality data, such as customer names or phone numbers. The performance gains and disk space advantages of bitmap indexes start to diminish on columns with 100,000 or more unique values, regardless of the number of rows in the table.
Bitmap indexes are not suitable for OLTP applications with large numbers of concurrent transactions modifying the data.
Use bitmap indexes sparingly. Test and compare query performance with and without an index. Add an index only if query performance improves with indexed columns.
The CREATE INDEX
command defines an index on a table. A B-tree index is the default index type. For example, to create a B-tree index on the column gender in the table employee:
CREATE INDEX gender_idx ON employee (gender);
To create a bitmap index on the column title in the table films:
CREATE INDEX title_bmp_idx ON films USING bitmap (title);
An index column need not be just a column of the underlying table, but can be a function or scalar expression computed from one or more columns of the table. This feature is useful to obtain fast access to tables based on the results of computations.
Index expressions are relatively expensive to maintain, because the derived expressions must be computed for each row upon insertion and whenever it is updated. However, the index expressions are not recomputed during an indexed search, since they are already stored in the index. In both of the following examples, the system sees the query as just WHERE indexedcolumn = 'constant'
and so the speed of the search is equivalent to any other simple index query. Thus, indexes on expressions are useful when retrieval speed is more important than insertion and update speed.
The first example is a common way to do case-insensitive comparisons with the lower
function:
SELECT * FROM test1 WHERE lower(col1) = 'value';
This query can use an index if one has been defined on the result of the lower(col1)
function:
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
This example assumes the following type of query is performed often.
SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';
The query might benefit from the following index.
CREATE INDEX people_names ON people ((first_name || ' ' || last_name));
The syntax of the CREATE INDEX
command normally requires writing parentheses around index expressions, as shown in the second example. The parentheses can be omitted when the expression is just a function call, as in the first example.
Greenplum Database indexes do not require maintenance and tuning. You can check which indexes are used by the real-life query workload. Use the EXPLAIN
command to examine index usage for a query.
The query plan shows the steps or plan nodes that the database will take to answer a query and time estimates for each plan node. To examine the use of indexes, look for the following query plan node types in your EXPLAIN
output:
You have to experiment to determine the indexes to create. Consider the following points.
ANALYZE
after you create or update an index. ANALYZE
collects table statistics. The query optimizer uses table statistics to estimate the number of rows returned by a query and to assign realistic costs to each possible query plan.enable_seqscan
) and nested-loop joins (enable_nestloop
), the most basic plans, to force the system to use a different plan. Time your query with and without indexes and use the EXPLAIN ANALYZE
command to compare the results.Use the REINDEX
command to rebuild a poorly-performing index. REINDEX
rebuilds an index using the data stored in the index's table, replacing the old copy of the index.
REINDEX my_table;
REINDEX my_index;
The DROP INDEX
command removes an index. For example:
DROP INDEX title_idx;
When loading data, it can be faster to drop all indexes, load, then recreate the indexes.
An index column need not be just a column of the underlying table, but can be a function or scalar expression computed from one or more columns of the table. This is useful to obtain fast access to a table based on the results of computations.
For example, a common way to do case-insensitive comparisons is to use the lower()
function:
SELECT * FROM test1 WHERE lower(col1) = 'value';
This query can use an index if one has been defined on the result of the lower(col1)
function:
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
If you declare this index UNIQUE
, it prevents creation of rows whose col1
values differ only in case, as well as rows whose col1
values are actually identical. So, you can use indexes on expressions to enforce constraints that are not definable as simple unique constraints.
As another example, if you often invoke queries like:
SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';
then it might be worth creating an index like this:
CREATE INDEX people_names ON people ((first_name || ' ' || last_name));
The syntax of the CREATE INDEX
command normally requires writing parentheses around index expressions, as shown in the second example. You can omit the parentheses when the expression is just a function call, as in the first example.
Index expressions are relatively expensive to maintain, because Greenplum Database must compute the derived expression(s) for each row insertion and non-HOT update. However, Greenplum does not recompute the index expressions during an indexed search, since they are already stored in the index. In both examples above, Greenplum views the query as just WHERE indexedcolumn = 'constant'
and so the speed of the search is equivalent to any other simple index query. Indexes on expressions are useful when retrieval speed is more important than insertion and update speed.
A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate. There are several situations in which a partial index is particularly useful.
One common reason for using a partial index is to avoid indexing common values. Since a query searching for a common value (one that accounts for more than a few percent of all the table rows) will not use the index anyway, there is no point in keeping those rows in the index at all. This reduces the size of the index, which will speed up those queries that do use the index. It will also speed up many table update operations because the index does not need to be updated in all cases.
Suppose you are storing web server access logs in a database. Most accesses originate from the IP address range of your organization but some are from elsewhere (say, employees on dial-up connections). If your searches by IP are primarily for outside accesses, you probably do not need to index the IP range that corresponds to your organization's subnet.
Assume a table defined as such:
CREATE TABLE access_log (
url varchar,
client_ip inet,
...
);
To create a partial index that suits the example scenario, use the following command:
CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
WHERE NOT (client_ip > inet '192.168.100.0' AND
client_ip < inet '192.168.100.255');
A typical query that can use this index follows:
SELECT *
FROM access_log
WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
Here the query's IP address is covered by the partial index. The following query cannot use the partial index, as it uses an IP address that is excluded from the index:
SELECT *
FROM access_log
WHERE url = '/index.html' AND client_ip = inet '192.168.100.23';
Observe that this kind of partial index requires that the common values be predetermined, so such partial indexes are best used for data distributions that do not change. Such indexes can be recreated occasionally to adjust for new data distributions, adding to the maintenance effort.
Another use for a partial index is to exclude values from the index that the typical query workload is not interested in. This results in the same advantages as listed above, but it prevents the "uninteresting" values from being accessed via that index, even if an index scan might be profitable in that case. Setting up partial indexes for this kind of scenario requires a lot of care and experimentation.
If you have a table that contains both billed and unbilled orders, where the unbilled orders take up a small fraction of the total table and yet those are the most-accessed rows, you can improve performance by creating an index on just the unbilled rows. The following command creates the index:
CREATE INDEX orders_unbilled_index ON orders (order_nr)
WHERE billed is not true;
A query to use this index follows:
SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
However, you can also use the index in a query that does not involve order_nr
at all, for example:
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
This is not as efficient as a partial index on the amount
column, since Greenplum must scan the entire index. Yet, if there are relatively few unbilled orders, using this partial index just to find the unbilled orders could be a win.
Note that the following query cannot use this index:
SELECT * FROM orders WHERE order_nr = 3501;
The order 3501
might be among the billed or unbilled orders.
This example also illustrates that the indexed column and the column used in the predicate do not need to match. Greenplum Database supports partial indexes with arbitrary predicates, so long as only columns of the table being indexed are involved. Keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, you can use a partial index in a query only if Greenplum can recognize that the WHERE
condition of the query mathematically implies the predicate of the index. Greenplum cannot recognize mathematically equivalent expressions that are written in different forms. Greenplum can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query's WHERE
condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index. For example a prepared query with a parameter might specify "x < ?" which will never imply "x < 2" for all possible values of the parameter.
A third possible use for partial indexes does not require the index to be used in queries at all. The idea here is to create a unique index over a subset of a table. This enforces uniqueness among the rows that satisfy the index predicate, without constraining those that do not.
Suppose that you have a table describing test outcomes. You want to ensure that there is only one "successful" entry for a given subject and target combination, but there might be any number of "unsuccessful" entries. Here is one way to satisfy those conditions:
CREATE TABLE tests (
subject text,
target text,
success boolean,
...
);
CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
WHERE success;
This is a particularly efficient approach when there are few successful tests and many unsuccessful ones. It is also possible to allow only one null in a column by creating a unique partial index with an IS NULL
restriction.
Finally, a partial index can also be used to override Greenplum's query plan choices. Also, data sets with peculiar distributions might cause the system to use an index when it really should not. In that case the index can be set up so that it is not available for the offending query. Normally, Greenplum makes reasonable choices about index usage (it avoids them when retrieving common values, so the earlier example really only saves index size, it is not required to avoid index usage), and grossly incorrect plan choices are cause for a bug report.
Keep in mind that setting up a partial index indicates that you know at least as much as the query planner knows, in particular you know when an index might be profitable. Forming this knowledge requires experience and understanding of how indexes in Greenplum Database work. In most cases, the advantage of a partial index over a regular index will be minimal. There are cases where they are quite counterproductive.
Suppose you create a large set of non-overlapping partial indexes, for example:
CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1;
CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2;
CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3;
...
CREATE INDEX mytable_cat_N ON mytable (data) WHERE category = N;
This is a bad idea! Almost certainly, you are better off with a single non-partial index, declared like:
CREATE INDEX mytable_cat_data ON mytable (category, data);
While a search in this larger index might have to descend through a couple more tree levels than a search in a smaller index, that's almost certainly going to be cheaper than the planner effort needed to select the appropriate one of the partial indexes. The core of the problem is that Greenplum does not understand the relationship among the partial indexes, and tests each one to see if it is applicable to the current query.
If your table is large enough that a single index really is a bad idea, you should look into using partitioning instead. With that mechanism, Greenplum Database does understand that the tables and indexes are non-overlapping, so far better performance is possible.
NoteGreenplum Database selects index-only and covering index scan options for a query plan only for new tables that you create in Greenplum 7. Greenplum does not select these plan types for tables that you have upgraded from Greenplum 6.
All indexes in Greenplum Database are secondary indexes, meaning that each index is stored separately from the table's main data area (which is called the table's heap). In an ordinary index scan, each row retrieval requires fetching data from both the index and the heap. While the index entries that match a given indexable WHERE
condition are often close together in the index, the table rows they reference might reside anywhere in the heap. The heap-access portion of an index scan can involve a lot of random access into the heap, which can be slow, particularly on traditional rotating media. Bitmap scans try to alleviate this cost by doing the heap accesses in sorted order, but that only goes so far.
Greenplum Database supports index-only scans to address the performance issue. Index-only scans can answer queries from an index alone without any heap access. Greenplum returns values directly out of each index entry instead of consulting the associated heap entry. There are two fundamental restrictions on when Greenplum can use this method:
The index type must support index-only scans. B-tree indexes always do. GiST and SP-GiST indexes support index-only scans for some operator classes but not others. Other index types have no support. The underlying requirement is that the index must physically store, or else be able to reconstruct, the original data value for each index entry. As a counterexample, GIN indexes cannot support index-only scans because each index entry typically holds only part of the original data value.
The query must reference only columns stored in the index. For example, given an index on columns x
and y
of a table that also has a column z
, these queries could use index-only scans:
SELECT x, y FROM tab WHERE x = 'key';
SELECT x FROM tab WHERE x = 'key' AND y < 42;
but these queries could not:
SELECT x, z FROM tab WHERE x = 'key';
SELECT x FROM tab WHERE x = 'key' AND z < 42;
(Expression indexes and partial indexes complicate this rule, as discussed below.)
If these two fundamental requirements are met, then all the data values required by the query are available from the index, so an index-only scan is physically possible. But there is an additional requirement for any table scan in Greenplum Database: it must verify that each retrieved row be "visible" to the query's MVCC snapshot. Because visibility information is not stored in index entries, only in heap entries, it might seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently. However, for seldom-changing data there is a way around this problem. Greenplum tracks, for each page in a table's heap, whether all rows stored in that page are old enough to be visible to all current and future transactions. This information is stored in a bit in the table's visibility map. An index-only scan, after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If it is set, the row is known visible and so Greenplum can return the data with no further work. If it is not set, Greenplum must visit the heap entry to find out whether it is visible, so no performance advantage is gained over a standard index scan. Even in the successful case, this approach trades visibility map accesses for heap accesses; but since the visibility map is four orders of magnitude smaller than the heap it describes, far less physical I/O is required to access it. In most situations the visibility map remains cached in memory all the time.
To summarize, while an index-only scan is possible given the two fundamental requirements, it will be a win only if a significant fraction of the table's heap pages have their all-visible map bits set. Because tables in which a large fraction of the rows are unchanging are common enough, this type of scan is very useful in practice.
To make effective use of the index-only scan feature, you may choose to create a covering index, which is an index specifically designed to include the columns required by a particular type of query that you run frequently. Since queries typically need to retrieve more columns than just the ones they search on, Greenplum Database allows you to create an index in which some columns are just "payload" and are not part of the search key. You signal this by adding an INCLUDE
clause that lists the extra columns. For example, if you commonly run queries like:
SELECT y FROM tab WHERE x = 'key';
the traditional approach to speeding up such queries would be to create an index on x
only. However, an index defined as:
CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);
could handle these queries as index-only scans, because y
can be obtained from the index without visiting the heap.
Because column y
is not part of the index's search key, it does not have to be of a data type that the index can handle; it is merely stored in the index and is not interpreted by the index machinery. Also, if the index is a unique index, that is:
CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);
the uniqueness condition applies to just column x
, not to the combination of x
and y
. (An INCLUDE
clause can also be written in UNIQUE
and PRIMARY KEY
constraints, providing alternative syntax for setting up an index like this.)
Be conservative about adding non-key payload columns to an index, especially wide columns. If an index tuple exceeds the maximum size allowed for the index type, data insertion fails. Because non-key columns duplicate data from the index's table and bloat the size of the index, they can potentially slow searches. And remember that there is little point in including payload columns in an index unless the table changes slowly enough that an index-only scan is likely to not need to access the heap. If the heap tuple must be visited anyway, it costs nothing more to get the column's value from there. Other restrictions are that expressions are not currently supported as included columns, and that only B-tree and GiST indexes currently support included columns.
Before Greenplum Database supported the INCLUDE feature, covering indexes were created by including the payload columns as ordinary index columns, for example:
CREATE INDEX tab_x_y ON tab(x, y);
even though they had no intention of ever using y
as part of a WHERE
clause. This works fine as long as the extra columns are trailing columns. However, this method doesn't support the case where you want the index to enforce uniqueness on the key column(s).
Suffix truncation always removes non-key columns from upper B-Tree levels. As payload columns, they are never used to guide index scans. The truncation process also removes one or more trailing key column(s) when the remaining prefix of key column(s) happens to be sufficient to describe tuples on the lowest B-Tree level. In practice, covering indexes without an INCLUDE
clause often avoid storing columns that are effectively payload in the upper levels. However, explicitly defining payload columns as non-key columns reliably keeps the tuples in upper levels small.
In principle, index-only scans can be used with expression indexes. For example, given an index on f(x)
where x
is a table column, it should be possible to execute
SELECT f(x) FROM tab WHERE f(x) < 1;
as an index-only scan; and this is very attractive if f()
is an expensive-to-compute function. However, Greenplum Database's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns required by the query are available from the index. In this example, x
is not needed except in the context f(x)
, but the planner does not notice that and concludes that an index-only scan is not possible. If an index-only scan seems sufficiently worthwhile, you can work around this by adding x
as an included column, for example:
CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);
An additional caveat, if the goal is to avoid recalculating f(x)
, is that the planner won't necessarily match uses of f(x)
that aren't in indexable WHERE
clauses to the index column. It will usually get this right in simple queries such as shown above, but not in queries that involve joins.
Partial indexes also have interesting interactions with index-only scans. Consider the following partial index:
CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
WHERE success;
In principle, Greenplum could perform an index-only scan on this index to satisfy a query like:
SELECT target FROM tests WHERE subject = 'some-subject' AND success;
But there is a problem: the WHERE
clause refers to success which is not available as a result column of the index. Nonetheless, an index-only scan is possible because the plan does not need to recheck that part of the WHERE
clause at run time: all entries found in the index necessarily have success = true
so this need not be explicitly checked in the plan. Greenplum version 7 recognizes such cases and generates index-only scans, but older versions do not.