Natural Language Processing with VMware Greenplum Text Indexes

VMware Greenplum Text includes Apache OpenNLP components to allow you to use named entity recognition (NER). Named entities include the names of people, organizations, and locations. OpenNLP also recognizes parts of speech (POS). The OpenNLP libraries and models required for English language recognition are included with VMware Greenplum Text. For non-English language documents, you can upload to ZooKeeper any of the other models available from the OpenNLP project.

A VMware Greenplum Text index that includes NER and POS tagging must have terms enabled, using the gptext.enable_terms() function. You add a text field definition to the index's configuration, adding POS and NER filters to the analysis chain after the tokenizer. The filters use the OpenNLP models you specify to recognize entities in documents and classify parts of speech. Tokens recognized are tagged and saved as terms in the field's term vector.

NER-tagged terms have the format _ner_<entity-type>_<token>, where <entity-type> is the type of entity, for example person or location, and <token> is the text of the token, produced by the tokenizer. Terms are not case-sensitive. Examples of NER-tagged terms are _ner_person, _ner_person_Alan, and _ner_location_boston. A term like _ner_person matches any person, including a more specific term like _ner_person_alan.

The POS English language model uses part-of-speech tags from the University of Penn Treebank project. POS-tagged terms have the format _pos_<tag>, where <tag> is a Penn Treebank part-of-speech tag. Examples of POS-tagged terms are _pos_nn, _pos_vb, and _pos_rb, for nouns, verbs, and adverbs, respectively.

Enabling NER for VMware Greenplum Text Indexes

The example in this section shows how to add NER support to a VMware Greenplum Text index. The example works with a table named news_demo in the VMware Greenplum demo database.

Download the CSV data file for the table to the gpadmin home directory from this link: news_demo.csv.tgz. Extract the news_demo.csv file from the downloaded file with the following command.
```
$ tar xvf news_demo.csv.tgz
```

=# CREATE TABLE news_demo(
    id bigint, 
    articleid varchar(50),
    news_date date, 
    headline text, 
    content text) 
DISTRIBUTED BY (id);

Load data into the table from the news_demo.csv data file.

=# COPY news_demo from '/home/gpadmin/news_demo.csv' with csv header;

Create the VMware Greenplum Text index and enable terms for the content field.

=# SELECT * FROM gptext.create_index('public', 'news_demo', 'id', 'content');
=# SELECT * FROM gptext.enable_terms('demo.public.news_demo', 'content');

Edit the managed-schema file for the demo.public.news_demo index using the gptext-config utility.

$ gptext-config edit -i demo.public.news_demo -f managed-schema

Add the following text_opennlp field type definition to the list of <fieldType> elements.

<fieldType name="text_opennlp" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.OpenNLPTokenizerFactory"
       sentenceModel="en-sent.bin"
       tokenizerModel="en-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
    <filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
      nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>
    <filter class="solr.StopFilterFactory" words="stopwords-ner.txt" ignoreCase="true"/>
    <filter class="com.emc.solr.analysis.opennlp.NERAndTypeAttributeAsSynonymFilterFactory" extractType="true" typePrefix="_pos_"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" pattern="^(_ner_|_pos_).+$" />
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Find the content field and change the type attribute to "text_opennlp".

<field name="content" type="text_opennlp" indexed="true" termOffsets="true" 
       stored="false" termPositions="true" termPayloads="true" termVectors="true"/>

Index the documents and commit the index.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM news_demo), 'demo.public.news_demo');
=# SELECT * FROM gptext.commit_index('demo.public.news_demo');

Example Search Queries for NER-Enabled Indexes

Following are example queries that search for NER-tagged terms in the demo.public.news_demo index.

Retrieve NER person offsets

This query retrieves an array of locations for NER person terms in documents that contain NER persons.

=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '_ner_person', NULL, 'hl=true&hl.fl=content&rows=10&sort=score desc');

Following are results from this search (with some rows omitted for space).

    id     |   score    |                                                                                                   
                                                                                                                            
    hs                                                                                                                      
                                                                                                              | rf 
-----------+------------+---------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------+----
 842613544 |  0.7074248 | {"fieldOffsets":[{"field":"content","offsets":[{"start":14,"end":28},{"start":40,"end":44},{"start
":251,"end":261},{"start":726,"end":736},{"start":896,"end":909},{"start":1093,"end":1103},{"start":1118,"end":1133},{"start
":1184,"end":1187},{"start":1188,"end":1194},{"start":1253,"end":1258}]}]}                                                  
                                                                                                              | 
 842613572 |  0.7059102 | {"fieldOffsets":[{"field":"content","offsets":[{"start":61,"end":65},{"start":500,"end":512},{"sta
rt":547,"end":563},{"start":711,"end":715},{"start":883,"end":896},{"start":965,"end":969},{"start":1065,"end":1078}]}]}    
                                                                                                                            
                                                                                                              | 

(ROWS OMITTED)

 842613594 |  0.5854553 | {"fieldOffsets":[{"field":"content","offsets":[{"start":520,"end":533},{"start":559,"end":564},{"s
tart":968,"end":982},{"start":987,"end":1000},{"start":1509,"end":1512}]}]}                                                 
                                                                                                                            
                                                                                                              | 
 842614457 |  0.5810676 | {"fieldOffsets":[{"field":"content","offsets":[{"start":400,"end":423},{"start":723,"end":733},{"s
tart":812,"end":827},{"start":963,"end":970},{"start":1181,"end":1188}]}]}                                                  
                                                                                                                            
                                                                                                              | 
(40 rows)

Retrieve documents containing an NER person term

This query retrieves the content of documents in the news_demo table with terms tagged _ner_person highlighted.

=# SELECT 
     news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
     s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '{!gptextqp} _ner_person', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Following are three rows of the output from this command with the headings and empty lines omitted.

842613544 | WASHINGTON -- <em>William Taylor</em>, President <em>Bush</em>'s nominee to run the nation's deposit insurance 
system, said bank regulators could have done a better job policing the Bank of Credit & Commerce International. "I think we 
have learned a series of things," <em>Mr. Taylor</em>, currently the head of bank supervision at the Federal Reserve, told t
he Senate Banking Committee at his confirmation hearing. "You shouldn't allow someone in the country that doesn't have super
vision from a strong home-country supervisor," he said. BCCI, with operations in the Middle East, Africa, Europe and the U.S
., fraudulently hid huge losses for months from regulators around the world. No U.S. depositors lost money. Questions about 
BCCI's actions during <em>Mr. Taylor</em>'s tenure at the Federal Reserve were expected to be the only serious hurdle to his
confirmation. But committee members seemed satisfied with his remarks. Sen. <em>Donald Riegle</em> (D., Mich.), chairman of
the committee, said that he expects the committee will recommend the confirmation and that the Senate will vote within a fe
w weeks. If confirmed as expected, <em>Mr. Taylor</em> would succeed <em>William Seidman</em>, whose term expires next month
. In his testimony, <em>Mr.</em> <em>Taylor</em> said he remains troubled by lingering questions involving <em>BCCI.</em> In
the U.S., BCCI was able to evade government prohibitions from purchasing stock in First American Bankshares Inc. by using f
rontmen. "I really have difficulty in knowing how we're going to uncover (such) arrangements anywhere," he said. "I really t
hink it's difficult to determine when two people conspire to change the control of an organization."                        
|  0.7074248
842613572 | WASHINGTON -- The House, in a stunning victory for President <em>Bush</em>, agreed to cut the tax on capital ga
ins, soundly rejecting an alternative proposed by Democratic leaders. After weeks of intense lobbying by both sides, the lea
dership's plan was defeated by a larger-than-expected 239-190 vote. The convincing margin increases the likelihood that a ca
pital gains cut of some sort could become law this year. The vote was a blow to the House's newly elected Democratic leaders
hip, particularly Speaker <em>Thomas Foley</em> of Washington and Majority Leader <em>Richard Gephardt</em> of Missouri. Bot
h had put their personal prestige on the line to defeat the tax-cut measure, which represented their first major showdown wi
th the <em>Bush</em> administration. Still, fully one-quarter of their membership -- 64 Democrats -- deserted them and sided
with a near-solid phalanx of Republicans. Only one Republican, <em>Doug Bereuter</em> of Nebraska, broke ranks and voted, a
gainst the wishes of President <em>Bush</em>, for the Democratic alternative. "This was a watershed for us," glowed House Re
publican Leader <em>Robert Michel</em> of Illinois.                                                                         
|  0.7059102
842613885 | <em>John Kerry</em>, seizing the chance to define his candidacy before a national television audience with his 
presidential nomination acceptance speech, took the fight straight to the two areas where President <em>Bush</em> has enjoye
d his greatest political strengths: national security and social values. Rather than shying away from ground that has someti
mes been shaky for Democrats, Mr. <em>Kerry</em> planted his own flag in a forceful and at times combative speech. "Let ther
e be no mistake: I will never hesitate to use force when it is required," the Massachusetts senator told 4,000 cheering dele
gates on the final night of the Democratic convention in Boston. "Any attack will be met with a swift and certain response,"
he continued, attempting to meet widespread and persistent voter questions about whether a Democrat, even a war veteran, is
tough enough to lead the country in fighting terrorism. At one point, Mr. <em>Kerry</em> appeared to belittle Mr. <em>Bush<
/em>'s record as commander in chief, especially his justification for the war in Iraq. "Now I know there are those who criti
cize me for seeing complexities -- and I do -- because some issues just aren't all that simple," he said. "Saying there are 
weapons of mass destruction in Iraq doesn't make it so." It was one of several oblique shots Mr. <em>Kerry</em> took at the 
president and his advisers, even as he also called directly on President <em>Bush</em> to run a positive campaign. Confronti
ng another of his party's vulnerabilities -- a perception that Democrats are out of the cultural mainstream -- Mr. <em>Kerry
</em>'s 45- minute speech tackled President <em>Bush</em> on social issues. "It's time for those who talk about family value
s to start valuing families," he said.                                                                                      
|  0.7014734

Retrieve documents containing specified NER person terms

This search returns the content of documents that contain the persons "Alan" and "Bush", with the names highlighted.

=# SELECT 
    news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '_ner_person_Alan AND _ner_person_Bush', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Retrieve documents containing both NER organization and time terms

This search finds documents that contain both NER organization and time terms, with the terms highlighted.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '_ner_organization AND _ner_time', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Following is an example row returned by this query.

842613848 | NEW YORK--U.S. oil futures declined Tuesday as traders were reluctant to place big bets while <em>Federal Reser
ve</em> officials debated the future of the central bank's key economic stimulus program. Light, sweet crude for January del
ivery settled 26 cents, or 0.3%, lower at $97.22 a barrel on the <em>New York Mercantile Exchange</em>. Nymex prices traded 
in a narrow range for most of the session as market participants chose to wait until Wednesday <em>afternoon</em> for potent
ial clarity on the <em>Fed</em>'s easy-money policies. "It's a directionless trade," said John Kilduff, founding partner of 
<em>Again Capital LLC</em>, a New York hedge fund that focuses on energy, referring to the lack of significant price movemen
t. He added, "You can make a strong argument on both sides, and there's a lot of room for the <em>Fed</em> to surprise us ei
ther way." Many traders expect the <em>Fed</em> to begin scaling back its so-called quantitative-easing program, in which it
buys $85 billion each month in mortgage-backed securities and longer-term <em>Treasury</em> bonds, in the near future. The 
program has boosted oil prices by weakening the dollar, making crude cheaper to buy with other currencies.

Retrieve documents containing NER person or organization terms

This search returns documents containing an NER person term or an NER organization term, or both, with the terms highlighted.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER by 1), 'demo.public.news_demo', 
      '_ner_person _ner_organization', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;

Retrieve documents containing NER person and time terms (forward proximity search)

This query performs a proximity search to find documents with a person term followed by a time term within the next seven terms.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '{!gptextqp} (_ner_person 7W _ner_time)', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Retrieve documents with a specified NER person and any NER person (unordered proximity search)

Like the previous example, this query performs a proximity search, but the terms can appear in the document in either order and must be within ten terms of each other.

=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content, 
    s.score 
FROM news_demo, 
  gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo', 
      '{!gptextqp} (_ner_person_Taylor 10N _ner_person)', Null, 
      'hl=true&hl.fl=content&rows=10&sort=score desc') s 
WHERE news_demo.id = s.id::bigint 
ORDER BY s.score desc;

Adding Lemmatization to an NER Index

VMware Greenplum Text lemmatization uses Apache POS (parts of speech) labeling and an English language WordNet® Dictionary¹ to find the root (unconjugated) forms of verbs and uninflected forms of nouns, adjectives, and adverbs. The VMware Greenplum Text index stores the lemmatized terms for documents so that a search query can match documents that use the different inflections of the query terms.

To enable lemmatization, you define a field type in the managed-schema configuration file for the index that includes the VMware Greenplum Text WordNetLemmatizer filter in the type's analyzer chain, and then assign the field type to fields you want to lemmatize. The field type you define must use the Solr OpenNLP tokenizer and POS filter, and the POS filter must appear before the VMware Greenplum Text WordNetLemmatizer filter in the analysis chain.

This example defines the type text_lemm with lemmatization:

Edit the managed-schema file for the index.

$ gptext-config edit -i demo.public.news_demo -f managed-schema

Add a field type including the lemmatization filter.

<fieldType name="text_lemm" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="en-token.bin"  sentenceModel="en-sent.bin"/>
        <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
        <filter class="com.emc.solr.analysis.wordnet.WordNetLemmatizerFilterFactory"/>
        <filter class="solr.TypeAsSynonymFilterFactory" prefix="@"/>
    </analyzer>
</fieldType>

Change the field type on fields you want to index with lemmatization and save the edited managed-schema file.

 <field name="content" type="text_lemm" indexed="true" termOffsets="true" stored="false"
      termPositions="true" termPayloads="true" termVectors="true"/>

Test the output of the text_lemm analysis chain using the gptext.analyze() function. For example:

=# SELECT * FROM gptext.analyzer('demo.public.news_demo', 'index', 'text_lemm',
    'The elves spoke of geese migration and the better Vivaldi libretti.');
          class          |                                                                                                              tokens

-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------
 OpenNLPTokenizer        | {{"The"},{"elves"},{"spoke"},{"of"},{"geese"},{"migration"},{"and"},{"the"},{"better"},{"Vivaldi"},{"libretti"},{"."},{},{}}
 OpenNLPPOSFilter        | {{"The"},{"elves"},{"spoke"},{"of"},{"geese"},{"migration"},{"and"},{"the"},{"better"},{"Vivaldi"},{"libretti"},{"."},{},{}}
 WordNetLemmatizerFilter | {{"the"},{"elf"},{"elves"},{"speak"},{"of"},{"goose"},{"migration"},{"and"},{"the"},{"better"},{"good"},{"vivaldi"},{"libretto"},{"."}}
 TypeAsSynonymFilter     | {{"the","@DT"},{"elf","@NNS"},{"elves","@NNS"},{"speak","@VBD"},{"of","@IN"},{"goose","@NN"},{"migration","@NN"},{"and","@CC"},{"the","@DT"},{"better","@JJR"},
{"good","@JJR"},{"vivaldi","@NNP"},{"libretto","@NNS"},{".","@."}}
(4 rows)

The OpenNLPPOSFilter determines the part of speech for each term and the WordNetLemmatizerFilter looks up the root form of the term. Depending on the part of speech, there can be more than one root form for a term. In these cases, the filter adds an additional term to the index for each root form.

Reindex the documents.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM news_demo), 'demo.public.news_demo');
=# SELECT * FROM gptext.commit_index('demo.public.news_demo');

Lemmatization Example

The following example demonstrates how lemmatization can improve search results. The table lemm_test contains two documents that use the word "meeting".

Create a new table in the demo database.

=# CREATE TABLE lemm_test (id integer, content text) DISTRIBUTED BY (id);
=# INSERT INTO lemm_test 
   VALUES (1, 'They are meeting in London.'),
          (2, 'He held a meeting in London to discuss the problem.');

Create a VMware Greenplum Text index for the lemm_test table and add the documents to the index. The content column, in this case, will be configured with the default text_intl field type.

=# SELECT gptext.create_index('public', 'lemm_test', 'id', 'content');
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM lemm_test), 'demo.public.lemm_test');
=# SELECT gptext.commit_index('demo.public.lemm_test');

Search the index for the string "the meeting in London".

=# SELECT lt.id, q.score, lt.content 
FROM lemm_test lt,  
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.lemm_test',
     E'\"the meeting in London\"', null, nulll) q
WHERE q.id::int8 = lt.id;

The search finds both documents:

 id |   score   |                         content
----+-----------+---------------------------------------------------------
  2 | 0.5753642 | He held a meeting in London to discuss the problem.
  1 | 0.5753642 | They are meeting in London.
(2 rows)

Using lemmatization during indexing and searching, the result can be more accurate. The row with ID 1 uses "meeting" as the present participle of the verb "meet". The second record uses "meeting" as a noun. The search query, "the meeting in London", also uses "meeting" as a noun.

Edit the managed-schema file for the index.

$ gptext-config edit -i demo.public.lemm_test -f managed-schema

Add the text_ner field type.

<fieldType name="text_ner" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="en-token.bin" sentenceModel="en-sent.bin"/>
      <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
      <filter class="com.emc.solr.analysis.wordnet.WordNetLemmatizerFilterFactory"/>
      <filter class="solr.TypeAsSynonymFilterFactory" prefix="@"/>
    </analyzer>
  </fieldType>

Find the field element for the content field and change the type attribute to text_ner.

<field name="__temp_field" type="text" indexed="true" stored="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="id" stored="true" type="int" indexed="true"/>
<field name="__pk" stored="true" indexed="true" type="int"/>
<field name="content" stored="false" type="text_ner" indexed="true"/>

Save the edited managed-schema file and then reindex the lemm_test table.

=# SELECT * FROM gptext.index(TABLE(SELECT * FROM lemm_test), 'demo.public.lemm_test');
=# SELECT gptext.commit_index('demo.public.lemm_test');

Repeat the same search query.

=# SELECT lt.id, q.score, lt.content 
FROM lemm_test lt,  
     gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.lemm_test',
     E'\"the meeting in London\"', null) q
WHERE q.id::int8 = lt.id;

This time the search finds only the second document:

 id |   score   | hs | rf
----+-----------+----+----
 2  | 3.6823308 |    |
(1 row)

Customizing NER Field Types

VMware Greenplum Text includes the following English language models.

en-ner-date.bin
en-ner-location.bin
en-ner-money.bin
en-ner-organization.bin
en-ner-percentage.bin
en-ner-person.bin
en-ner-time.bin

To specify the models you want to use for an index, edit the managed-schema file for the index and set the nerTaggerModels attribute of the OpenNLPNERFilterFactory filter element in the field type definition.

<filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
  nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>

You can download models for other languages at Models for 1.5 Series. Upload the model to ZooKeeper using the gptext-config upload command and then update the nerTaggerModels attribute as shown. For example, to add the Spanish person model:

Download the es-ner-person.bin file from Models for 1.5 Series.

Upload the es-ner-person.bin file to ZooKeeper.

$ gptext-config upload -i demo.public.news_demo -l es-ner-person.bin -f es-ner-person.bin

Edit the managed-schema file for the index.

$ gptext-config edit -i demo.public.news_demo -f managed-schema

Add the es_ner_person model to the OpenNLPNERFilterFactory filter for the field. Spanish names will be recognized first, and then English names.

<filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
  nerTaggerModels="es-ner-person.bin,en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>

Save the managed-schema file changes and reindex the documents.

¹ Princeton University "About WordNet." [WordNet](https://www.princeton.edu/wordnet). Princeton University. 2010.