The gp_sparse_vector
module implements a Greenplum Database data type and associated functions that use compressed storage of zeros to make vector computations on floating point numbers faster.
The gp_sparse_vector
module is a Greenplum Database extension.
The gp_sparse_vector
module is installed when you install Greenplum Database. Before you can use any of the functions defined in the module, you must register the gp_sparse_vector
extension in each database where you want to use the functions. Refer to Installing Additional Supplied Modules for more information.
When you use arrays of floating point numbers for various calculations, you will often have long runs of zeros. This is common in scientific, retail optimization, and text processing applications. Each floating point number takes 8 bytes of storage in memory and/or disk. Saving those zeros is often impractical. There are also many computations that benefit from skipping over the zeros.
For example, suppose the following array of double
s is stored as a float8[]
in Greenplum Database:
'{0, 33, <40,000 zeros>, 12, 22 }'::float8[]
This type of array arises often in text processing, where a dictionary may have 40-100K terms and the number of words in a particular document is stored in a vector. This array would occupy slightly more than 320KB of memory/disk, most of it zeros. Any operation that you perform on this data works on 40,001 fields that are not important.
The Greenplum Database built-in array
datatype utilizes a bitmap for null values, but it is a poor choice for this use case because it is not optimized for float8[]
or for long runs of zeros instead of nulls, and the bitmap is not run-length-encoding- (RLE) compressed. Even if each zero were stored as a NULL
in the array, the bitmap for nulls would use 5KB to mark the nulls, which is not nearly as efficient as it could be.
The Greenplum Database gp_sparse_vector
module defines a data type and a simple RLE-based scheme that is biased toward being efficient for zero value bitmaps. This scheme uses only 6 bytes for bitmap storage.
NoteThe sparse vector data type defined by the
gp_sparse_vector
module is namedsvec
.svec
supports onlyfloat8
vector values.
You can construct an svec
directly from a float array as follows:
SELECT ('{0, 13, 37, 53, 0, 71 }'::float8[])::svec;
The gp_sparse_vector
module supports the vector operators <
, >
, *
, **
, /
, =
, +
, sum()
, vec_count_nonzero()
, and so on. These operators take advantage of the efficient sparse storage format, making computations on svec
s faster.
The plus (+
) operator adds each of the terms of two vectors of the same dimension together. For example, if vector a = {0,1,5}
and vector b = {4,3,2}
, you would compute the vector addition as follows:
SELECT ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
float8
---------
{4,4,7}
A vector dot product (%*%
) between vectors a
and b
returns a scalar result of type float8
. Compute the dot product ((0*4+1*3+5*2)=13
) as follows:
SELECT '{0,1,5}'::float8[]::svec %*% '{4,3,2}'::float8[]::svec;
?column?
----------
13
Special vector aggregate functions are also useful. sum()
is self explanatory. vec_count_nonzero()
evaluates the count of non-zero terms found in a set of svec
and returns an svec
with the counts. For instance, for the set of vectors {0,1,5},{10,0,3},{0,0,3},{0,1,0}
, the count of non-zero terms would be {1,2,3}
. Use vec_count_nonzero()
to compute the count of these vectors:
CREATE TABLE listvecs( a svec );
INSERT INTO listvecs VALUES ('{0,1,5}'::float8[]),
('{10,0,3}'::float8[]),
('{0,0,3}'::float8[]),
('{0,1,0}'::float8[]);
SELECT vec_count_nonzero( a )::float8[] FROM listvecs;
count_vec
-----------
{1,2,3}
(1 row)
A text classification example that describes a dictionary and some documents follows. You will create Greenplum Database tables representing a dictionary and some documents. You then perform document classification using vector arithmetic on word counts and proportions of dictionary words in each document.
Suppose that you have a dictionary composed of words in a text array. Create a table to store the dictionary data and insert some data (words) into the table. For example:
CREATE TABLE features (dictionary text[][]) DISTRIBUTED RANDOMLY;
INSERT INTO features
VALUES ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,'
'one,really,second,the,third,this,until}');
You have a set of documents, also defined as an array of words. Create a table to represent the documents and insert some data into the table:
CREATE TABLE documents(docnum int, document text[]) DISTRIBUTED RANDOMLY;
INSERT INTO documents VALUES
(1,'{this,is,one,document,in,the,corpus}'),
(2,'{i,am,the,second,document,in,the,corpus}'),
(3,'{being,third,never,really,bothered,me,until,now}'),
(4,'{the,document,before,me,is,the,third,document}');
Using the dictionary and document tables, find the dictionary words that are present in each document. To do this, you first prepare a Sparse Feature Vector, or SFV, for each document. An SFV is a vector of dimension N, where N is the number of dictionary words, and each SFV contains a count of each dictionary word in the document.
You can use the gp_extract_feature_histogram()
function to create an SFV from a document. gp_extract_feature_histogram()
outputs an svec
for each document that contains the count of each of the dictionary words in the ordinal positions of the dictionary.
SELECT gp_extract_feature_histogram(
(SELECT dictionary FROM features LIMIT 1), document)::float8[], document
FROM documents ORDER BY docnum;
gp_extract_feature_histogram | document
-----------------------------------------+--------------------------------------------------
{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
SELECT * FROM features;
dictionary
--------------------------------------------------------------------------------------------------------
{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}
The SFV of the second document, "i am the second document in the corpus", is {1,3*0,1,1,1,1,6*0,1,2}
. The word "am" is the first ordinate in the dictionary, and there is 1
instance of it in the SFV. The word "before" has no instances in the document, so its value is 0
; and so on.
gp_extract_feature_histogram()
is very speed optimized - it is a single routine version of a hash join that processes large numbers of documents into their SFVs in parallel at the highest possible speeds.
For the next part of the processing, generate a sparse vector of the dictionary dimension (19). The vectors that you generate for each document are referred to as the corpus.
CREATE table corpus (docnum int, feature_vector svec) DISTRIBUTED RANDOMLY;
INSERT INTO corpus
(SELECT docnum,
gp_extract_feature_histogram(
(select dictionary FROM features LIMIT 1), document) from documents);
Count the number of times each feature occurs at least once in all documents:
SELECT (vec_count_nonzero(feature_vector))::float8[] AS count_in_document FROM corpus;
count_in_document
-----------------------------------------
{1,1,1,1,2,3,1,2,2,2,1,1,1,1,1,3,2,1,1}
Count all occurrences of each term in all documents:
SELECT (sum(feature_vector))::float8[] AS sum_in_document FROM corpus;
sum_in_document
-----------------------------------------
{1,1,1,1,2,4,1,2,2,2,1,1,1,1,1,5,2,1,1}
The remainder of the classification process is vector math. The count is turned into a weight that reflects Term Frequency / Inverse Document Frequency (tf/idf). The calculation for a given term in a given document is:
#_times_term_appears_in_this_doc * log( #_docs / #_docs_the_term_appears_in )
#_docs
is the total number of documents (4 in this case). Note that there is one divisor for each dictionary word and its value is the number of times that word appears in the document.
For example, the term "document" in document 1 would have a weight of 1 * log( 4/3 )
. In document 4, the term would have a weight of 2 * log( 4/3 )
. Terms that appear in every document would have weight 0
.
This single vector for the whole corpus is then scalar product multiplied by each document SFV to produce the tf/idf.
Calculate the tf/idf:
SELECT docnum, (feature_vector*logidf)::float8[] AS tf_idf
FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector)) AS logidf FROM corpus)
AS foo, corpus ORDER BY docnum;
docnum | tf_idf
--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
1 | {0,0,0,0,0.693147180559945,0.287682072451781,0,0.693147180559945,0.693147180559945,0,0,0,1.38629436111989,0,0,0.287682072451781,0,1.38629436111989,0}
2 | {1.38629436111989,0,0,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,0,0,0,0,0,1.38629436111989,0.575364144903562,0,0,0}
3 | {0,0,1.38629436111989,1.38629436111989,0,0,0,0,0,0.693147180559945,1.38629436111989,1.38629436111989,0,1.38629436111989,0,0,0.693147180559945,0,1.38629436111989
}
4 | {0,1.38629436111989,0,0,0,0.575364144903562,0,0,0.693147180559945,0.693147180559945,0,0,0,0,0,0.575364144903562,0.693147180559945,0,0}
You can determine the angular distance between one document and the rest of the documents using the ACOS of the dot product of the document vectors:
CREATE TABLE weights AS
(SELECT docnum, (feature_vector*logidf) tf_idf
FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector))
AS logidf FROM corpus) foo, corpus ORDER BY docnum)
DISTRIBUTED RANDOMLY;
Calculate the angular distance between the first document and every other document:
SELECT docnum, trunc((180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(l2norm(tf_idf)*l2norm(testdoc))))/(4.*ATAN(1.))))::numeric,2)
AS angular_distance FROM weights,
(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
ORDER BY 1;
docnum | angular_distance
--------+------------------
1 | 0.00
2 | 78.82
3 | 90.00
4 | 80.02
You can see that the angular distance between document 1 and itself is 0 degrees, and between document 1 and 3 is 90 degrees because they share no features at all.