Spill Files

Greenplum Command Center reports the total size for all spill files created for a query.

VMware Greenplum creates spill files, also called workfiles, to save intermediate results when there is insufficient memory to execute a query in memory. Disk I/O is much slower than memory access, so a query that creates spill files will take longer to complete.

Investigating Spill File Usage

The gp_toolkit schema contains views you can use to see details about spill file usage for current queries. You can see the number and sizes of spill files created for each operator in a query execution plan, and totals by query and segment. This is useful information to detect data skew and to help tune queries.

See the gp_toolkit Administrative Schema reference for descriptions of these views.

Eliminating or Reducing Spill Files

You can work to eliminate spill files by increasing the amount of memory available to the query or by optimizing the query to use the memory available more efficiently.

You may be able to revise the query to prevent spilling by eliminating or postponing memory-intensive operators.

Following are some ways to increase memory available to queries when resource group resource management is enabled in VMware Greenplum.

Decrease the resource group’s concurrency so that each query’s share of memory increases.
Increase the resource group’s MEMORY_SHARED_QUOTA parameter to increase the amount of resource group shared memory.
Decrease the percentage of memory allocated to all resource groups to increase the amount of global shared memory.

When resource queue resource management is active, VMware Greenplum can detect and terminate “runaway” queries that consume a high percentage of available memory. You can prevent runaway queries by limiting the number of spill files created or the total size of spill files created. See the gp_workfile_limit* configuration parameters below for more information.

If you cannot prevent queries from spilling, it is important to ensure that the number of spill files created is minimized and that problems such as CPU or data skew are found and corrected. Skew can create excessive numbers of spill files on one or more segments.

To minimize disk usage and I/O when spill files are created, make sure the gp_workfile_compresson configuration parameter is set toon. When enabled, VMware Greenplum uses Zstandard compression for spill files.

Limiting Spill Files with Server Configuration Parameters

VMware Greenplum by default limits the number of spill files allowed per query for each segment to 100,000. You can raise or lower this limit, and you can also limit the number of spill files for all queries on a segment, and limit the disk space consumed by spill files per query and per segment. Use the following VMware Greenplum server configuration parameters to manage spill files.

gp_workfile_limit_files_per_query

Sets the maximum number of spill files allowed per query per segment. Default is 100,000.

gp_workfile_limit_per_query

Sets the maximum disk size an individual query is allowed to use for spill files at each segment. The default value is 0, which means no limit is enforced.

gp_workfile_limit_per_segment

Sets the maximum total disk size that all running queries are allowed to use for creating spill files at each segment. The default value is 0, which means a limit is not enforced.

gp_workfile_compress_algorithm

Specifies the compression algorithm to use for spill files when a hash aggregation or hash join operation spills to disk during query processing. The default is 'none'. Set to 'zlib' to enable compression. Using compression reduces the number of I/O operations at the expense of increased CPU.