Tanzu Greenplum text includes Apache OpenNLP components to allow you to use named entity recognition (NER). Named entities include the names of people, organizations, and locations. OpenNLP also recognizes parts of speech (POS). The OpenNLP libraries and models required for English language recognition are included with Tanzu Greenplum text. For non-English language documents, you can upload to ZooKeeper any of the other models available from the OpenNLP project.
A Tanzu Greenplum text index that includes NER and POS tagging must have terms enabled, using the gptext.enable_terms()
function. You add a text field definition to the index's configuration, adding POS and NER filters to the analysis chain after the tokenizer. The filters use the OpenNLP models you specify to recognize entities in documents and classify parts of speech. Tokens recognized are tagged and saved as terms in the field's term vector.
NER-tagged terms have the format _ner_<entity-type>_<token>
, where <entity-type>
is the type of entity, for example person
or location
, and <token>
is the text of the token, produced by the tokenizer. Terms are not case-sensitive. Examples of NER-tagged terms are _ner_person
, _ner_person_Alan
, and _ner_location_boston
. A term like _ner_person
matches any person, including a more specific term like _ner_person_alan
.
The POS English language model uses part-of-speech tags from the University of Penn Treebank project. POS-tagged terms have the format _pos_<tag>
, where <tag>
is a Penn Treebank part-of-speech tag. Examples of POS-tagged terms are _pos_nn
, _pos_vb
, and _pos_rb
, for nouns, verbs, and adverbs, respectively.
The example in this section shows how to add NER support to a Tanzu Greenplum text index. The example works with a table named news_demo
in the VMware Greenplum demo
database.
Download the CSV data file for the table to the gpadmin home directory from this link: news_demo.csv.tgz. Extract the news_demo.csv
file from the downloaded file with the following command.
$ tar xvf news_demo.csv.tgz
Log in to the demo database with psql
and create and load the news_demo
table.
=# CREATE TABLE news_demo(
id bigint,
articleid varchar(50),
news_date date,
headline text,
content text)
DISTRIBUTED BY (id);
Load data into the table from the news_demo.csv
data file.
=# COPY news_demo from '/home/gpadmin/news_demo.csv' with csv header;
Create the Tanzu Greenplum text index and enable terms for the content
field.
=# SELECT * FROM gptext.create_index('public', 'news_demo', 'id', 'content');
=# SELECT * FROM gptext.enable_terms('demo.public.news_demo', 'content');
Edit the managed-schema
file for the demo.public.news_demo
index using the gptext-config
utility.
$ gptext-config edit -i demo.public.news_demo -f managed-schema
Add the following text_opennlp
field type definition to the list of <fieldType>
elements.
<fieldType name="text_opennlp" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-token.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>
<filter class="solr.StopFilterFactory" words="stopwords-ner.txt" ignoreCase="true"/>
<filter class="com.emc.solr.analysis.opennlp.NERAndTypeAttributeAsSynonymFilterFactory" extractType="true" typePrefix="_pos_"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" pattern="^(_ner_|_pos_).+$" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Find the content
field and change the type
attribute to "text_opennlp"
.
<field name="content" type="text_opennlp" indexed="true" termOffsets="true"
stored="false" termPositions="true" termPayloads="true" termVectors="true"/>
Index the documents and commit the index.
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM news_demo), 'demo.public.news_demo');
=# SELECT * FROM gptext.commit_index('demo.public.news_demo');
Following are example queries that search for NER-tagged terms in the demo.public.news_demo
index.
This query retrieves an array of locations for NER person terms in documents that contain NER persons.
=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo',
'_ner_person', NULL, 'hl=true&hl.fl=content&rows=10&sort=score desc');
Following are results from this search (with some rows omitted for space).
id | score |
hs
| rf
-----------+------------+---------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------+----
842613544 | 0.7074248 | {"fieldOffsets":[{"field":"content","offsets":[{"start":14,"end":28},{"start":40,"end":44},{"start
":251,"end":261},{"start":726,"end":736},{"start":896,"end":909},{"start":1093,"end":1103},{"start":1118,"end":1133},{"start
":1184,"end":1187},{"start":1188,"end":1194},{"start":1253,"end":1258}]}]}
|
842613572 | 0.7059102 | {"fieldOffsets":[{"field":"content","offsets":[{"start":61,"end":65},{"start":500,"end":512},{"sta
rt":547,"end":563},{"start":711,"end":715},{"start":883,"end":896},{"start":965,"end":969},{"start":1065,"end":1078}]}]}
|
(ROWS OMITTED)
842613594 | 0.5854553 | {"fieldOffsets":[{"field":"content","offsets":[{"start":520,"end":533},{"start":559,"end":564},{"s
tart":968,"end":982},{"start":987,"end":1000},{"start":1509,"end":1512}]}]}
|
842614457 | 0.5810676 | {"fieldOffsets":[{"field":"content","offsets":[{"start":400,"end":423},{"start":723,"end":733},{"s
tart":812,"end":827},{"start":963,"end":970},{"start":1181,"end":1188}]}]}
|
(40 rows)
This query retrieves the content of documents in the news_demo
table with terms tagged _ner_person
highlighted.
=# SELECT
news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
s.score
FROM news_demo,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo',
'{!gptextqp} _ner_person', Null,
'hl=true&hl.fl=content&rows=10&sort=score desc') s
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;
Following are three rows of the output from this command with the headings and empty lines omitted.
842613544 | WASHINGTON -- <em>William Taylor</em>, President <em>Bush</em>'s nominee to run the nation's deposit insurance
system, said bank regulators could have done a better job policing the Bank of Credit & Commerce International. "I think we
have learned a series of things," <em>Mr. Taylor</em>, currently the head of bank supervision at the Federal Reserve, told t
he Senate Banking Committee at his confirmation hearing. "You shouldn't allow someone in the country that doesn't have super
vision from a strong home-country supervisor," he said. BCCI, with operations in the Middle East, Africa, Europe and the U.S
., fraudulently hid huge losses for months from regulators around the world. No U.S. depositors lost money. Questions about
BCCI's actions during <em>Mr. Taylor</em>'s tenure at the Federal Reserve were expected to be the only serious hurdle to his
confirmation. But committee members seemed satisfied with his remarks. Sen. <em>Donald Riegle</em> (D., Mich.), chairman of
the committee, said that he expects the committee will recommend the confirmation and that the Senate will vote within a fe
w weeks. If confirmed as expected, <em>Mr. Taylor</em> would succeed <em>William Seidman</em>, whose term expires next month
. In his testimony, <em>Mr.</em> <em>Taylor</em> said he remains troubled by lingering questions involving <em>BCCI.</em> In
the U.S., BCCI was able to evade government prohibitions from purchasing stock in First American Bankshares Inc. by using f
rontmen. "I really have difficulty in knowing how we're going to uncover (such) arrangements anywhere," he said. "I really t
hink it's difficult to determine when two people conspire to change the control of an organization."
| 0.7074248
842613572 | WASHINGTON -- The House, in a stunning victory for President <em>Bush</em>, agreed to cut the tax on capital ga
ins, soundly rejecting an alternative proposed by Democratic leaders. After weeks of intense lobbying by both sides, the lea
dership's plan was defeated by a larger-than-expected 239-190 vote. The convincing margin increases the likelihood that a ca
pital gains cut of some sort could become law this year. The vote was a blow to the House's newly elected Democratic leaders
hip, particularly Speaker <em>Thomas Foley</em> of Washington and Majority Leader <em>Richard Gephardt</em> of Missouri. Bot
h had put their personal prestige on the line to defeat the tax-cut measure, which represented their first major showdown wi
th the <em>Bush</em> administration. Still, fully one-quarter of their membership -- 64 Democrats -- deserted them and sided
with a near-solid phalanx of Republicans. Only one Republican, <em>Doug Bereuter</em> of Nebraska, broke ranks and voted, a
gainst the wishes of President <em>Bush</em>, for the Democratic alternative. "This was a watershed for us," glowed House Re
publican Leader <em>Robert Michel</em> of Illinois.
| 0.7059102
842613885 | <em>John Kerry</em>, seizing the chance to define his candidacy before a national television audience with his
presidential nomination acceptance speech, took the fight straight to the two areas where President <em>Bush</em> has enjoye
d his greatest political strengths: national security and social values. Rather than shying away from ground that has someti
mes been shaky for Democrats, Mr. <em>Kerry</em> planted his own flag in a forceful and at times combative speech. "Let ther
e be no mistake: I will never hesitate to use force when it is required," the Massachusetts senator told 4,000 cheering dele
gates on the final night of the Democratic convention in Boston. "Any attack will be met with a swift and certain response,"
he continued, attempting to meet widespread and persistent voter questions about whether a Democrat, even a war veteran, is
tough enough to lead the country in fighting terrorism. At one point, Mr. <em>Kerry</em> appeared to belittle Mr. <em>Bush<
/em>'s record as commander in chief, especially his justification for the war in Iraq. "Now I know there are those who criti
cize me for seeing complexities -- and I do -- because some issues just aren't all that simple," he said. "Saying there are
weapons of mass destruction in Iraq doesn't make it so." It was one of several oblique shots Mr. <em>Kerry</em> took at the
president and his advisers, even as he also called directly on President <em>Bush</em> to run a positive campaign. Confronti
ng another of his party's vulnerabilities -- a perception that Democrats are out of the cultural mainstream -- Mr. <em>Kerry
</em>'s 45- minute speech tackled President <em>Bush</em> on social issues. "It's time for those who talk about family value
s to start valuing families," he said.
| 0.7014734
This search returns the content of documents that contain the persons "Alan" and "Bush", with the names highlighted.
=# SELECT
news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
s.score
FROM news_demo,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo',
'_ner_person_Alan AND _ner_person_Bush', Null,
'hl=true&hl.fl=content&rows=10&sort=score desc') s
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;
This search finds documents that contain both NER organization and time terms, with the terms highlighted.
=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
s.score
FROM news_demo,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo',
'_ner_organization AND _ner_time', Null,
'hl=true&hl.fl=content&rows=10&sort=score desc') s
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;
Following is an example row returned by this query.
842613848 | NEW YORK--U.S. oil futures declined Tuesday as traders were reluctant to place big bets while <em>Federal Reser
ve</em> officials debated the future of the central bank's key economic stimulus program. Light, sweet crude for January del
ivery settled 26 cents, or 0.3%, lower at $97.22 a barrel on the <em>New York Mercantile Exchange</em>. Nymex prices traded
in a narrow range for most of the session as market participants chose to wait until Wednesday <em>afternoon</em> for potent
ial clarity on the <em>Fed</em>'s easy-money policies. "It's a directionless trade," said John Kilduff, founding partner of
<em>Again Capital LLC</em>, a New York hedge fund that focuses on energy, referring to the lack of significant price movemen
t. He added, "You can make a strong argument on both sides, and there's a lot of room for the <em>Fed</em> to surprise us ei
ther way." Many traders expect the <em>Fed</em> to begin scaling back its so-called quantitative-easing program, in which it
buys $85 billion each month in mortgage-backed securities and longer-term <em>Treasury</em> bonds, in the near future. The
program has boosted oil prices by weakening the dollar, making crude cheaper to buy with other currencies.
This search returns documents containing an NER person term or an NER organization term, or both, with the terms highlighted.
=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
s.score
FROM news_demo,
gptext.search(TABLE(SELECT 1 SCATTER by 1), 'demo.public.news_demo',
'_ner_person _ner_organization', Null,
'hl=true&hl.fl=content&rows=10&sort=score desc') s
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;
This query performs a proximity search to find documents with a person term followed by a time term within the next seven terms.
=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
s.score
FROM news_demo,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo',
'{!gptextqp} (_ner_person 7W _ner_time)', Null,
'hl=true&hl.fl=content&rows=10&sort=score desc') s
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;
Like the previous example, this query performs a proximity search, but the terms can appear in the document in either order and must be within ten terms of each other.
=# SELECT news_demo.id, gptext.highlight(news_demo.content, 'content', hs) AS content,
s.score
FROM news_demo,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.news_demo',
'{!gptextqp} (_ner_person_Taylor 10N _ner_person)', Null,
'hl=true&hl.fl=content&rows=10&sort=score desc') s
WHERE news_demo.id = s.id::bigint
ORDER BY s.score desc;
Tanzu Greenplum text lemmatization uses Apache POS (parts of speech) labeling and an English language WordNet® Dictionary1 to find the root (unconjugated) forms of verbs and uninflected forms of nouns, adjectives, and adverbs. The Tanzu Greenplum text index stores the lemmatized terms for documents so that a search query can match documents that use the different inflections of the query terms.
To enable lemmatization, you define a field type in the managed-schema
configuration file for the index that includes the Tanzu Greenplum text WordNetLemmatizer filter in the type's analyzer chain, and then assign the field type to fields you want to lemmatize. The field type you define must use the Solr OpenNLP tokenizer and POS filter, and the POS filter must appear before the Tanzu Greenplum text WordNetLemmatizer filter in the analysis chain.
This example defines the type text_lemm
with lemmatization:
Edit the managed-schema
file for the index.
$ gptext-config edit -i demo.public.news_demo -f managed-schema
Add a field type including the lemmatization filter.
<fieldType name="text_lemm" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="en-token.bin" sentenceModel="en-sent.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="com.emc.solr.analysis.wordnet.WordNetLemmatizerFilterFactory"/>
<filter class="solr.TypeAsSynonymFilterFactory" prefix="@"/>
</analyzer>
</fieldType>
Change the field type on fields you want to index with lemmatization and save the edited managed-schema
file.
<field name="content" type="text_lemm" indexed="true" termOffsets="true" stored="false"
termPositions="true" termPayloads="true" termVectors="true"/>
Test the output of the text_lemm
analysis chain using the gptext.analyze()
function. For example:
=# SELECT * FROM gptext.analyzer('demo.public.news_demo', 'index', 'text_lemm',
'The elves spoke of geese migration and the better Vivaldi libretti.');
class | tokens
-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------
OpenNLPTokenizer | {{"The"},{"elves"},{"spoke"},{"of"},{"geese"},{"migration"},{"and"},{"the"},{"better"},{"Vivaldi"},{"libretti"},{"."},{},{}}
OpenNLPPOSFilter | {{"The"},{"elves"},{"spoke"},{"of"},{"geese"},{"migration"},{"and"},{"the"},{"better"},{"Vivaldi"},{"libretti"},{"."},{},{}}
WordNetLemmatizerFilter | {{"the"},{"elf"},{"elves"},{"speak"},{"of"},{"goose"},{"migration"},{"and"},{"the"},{"better"},{"good"},{"vivaldi"},{"libretto"},{"."}}
TypeAsSynonymFilter | {{"the","@DT"},{"elf","@NNS"},{"elves","@NNS"},{"speak","@VBD"},{"of","@IN"},{"goose","@NN"},{"migration","@NN"},{"and","@CC"},{"the","@DT"},{"better","@JJR"},
{"good","@JJR"},{"vivaldi","@NNP"},{"libretto","@NNS"},{".","@."}}
(4 rows)
The OpenNLPPOSFilter determines the part of speech for each term and the WordNetLemmatizerFilter looks up the root form of the term. Depending on the part of speech, there can be more than one root form for a term. In these cases, the filter adds an additional term to the index for each root form.
Reindex the documents.
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM news_demo), 'demo.public.news_demo');
=# SELECT * FROM gptext.commit_index('demo.public.news_demo');
The following example demonstrates how lemmatization can improve search results. The table lemm_test
contains two documents that use the word "meeting".
Create a new table in the demo database.
=# CREATE TABLE lemm_test (id integer, content text) DISTRIBUTED BY (id);
=# INSERT INTO lemm_test
VALUES (1, 'They are meeting in London.'),
(2, 'He held a meeting in London to discuss the problem.');
Create a Tanzu Greenplum text index for the lemm_test
table and add the documents to the index. The content
column, in this case, will be configured with the default text_intl
field type.
=# SELECT gptext.create_index('public', 'lemm_test', 'id', 'content');
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM lemm_test), 'demo.public.lemm_test');
=# SELECT gptext.commit_index('demo.public.lemm_test');
Search the index for the string "the meeting in London"
.
=# SELECT lt.id, q.score, lt.content
FROM lemm_test lt,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.lemm_test',
E'\"the meeting in London\"', null, nulll) q
WHERE q.id::int8 = lt.id;
The search finds both documents:
id | score | content
----+-----------+---------------------------------------------------------
2 | 0.5753642 | He held a meeting in London to discuss the problem.
1 | 0.5753642 | They are meeting in London.
(2 rows)
Using lemmatization during indexing and searching, the result can be more accurate. The row with ID 1 uses "meeting" as the present participle of the verb "meet". The second record uses "meeting" as a noun. The search query, "the meeting in London"
, also uses "meeting" as a noun.
Edit the managed-schema
file for the index.
$ gptext-config edit -i demo.public.lemm_test -f managed-schema
Add the text_ner
field type.
<fieldType name="text_ner" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory" tokenizerModel="en-token.bin" sentenceModel="en-sent.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="com.emc.solr.analysis.wordnet.WordNetLemmatizerFilterFactory"/>
<filter class="solr.TypeAsSynonymFilterFactory" prefix="@"/>
</analyzer>
</fieldType>
Find the field
element for the content
field and change the type attribute to text_ner
.
<field name="__temp_field" type="text" indexed="true" stored="false" multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="id" stored="true" type="int" indexed="true"/>
<field name="__pk" stored="true" indexed="true" type="int"/>
<field name="content" stored="false" type="text_ner" indexed="true"/>
Save the edited managed-schema
file and then reindex the lemm_test
table.
=# SELECT * FROM gptext.index(TABLE(SELECT * FROM lemm_test), 'demo.public.lemm_test');
=# SELECT gptext.commit_index('demo.public.lemm_test');
Repeat the same search query.
=# SELECT lt.id, q.score, lt.content
FROM lemm_test lt,
gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'demo.public.lemm_test',
E'\"the meeting in London\"', null) q
WHERE q.id::int8 = lt.id;
This time the search finds only the second document:
id | score | hs | rf
----+-----------+----+----
2 | 3.6823308 | |
(1 row)
Tanzu Greenplum text includes the following English language models.
To specify the models you want to use for an index, edit the managed-schema
file for the index and set the nerTaggerModels
attribute of the OpenNLPNERFilterFactory
filter element in the field type definition.
<filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
nerTaggerModels="en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>
You can download models for other languages at Models for 1.5 Series. Upload the model to ZooKeeper using the gptext-config upload
command and then update the nerTaggerModels
attribute as shown. For example, to add the Spanish person model:
Download the es-ner-person.bin
file from Models for 1.5 Series.
Upload the es-ner-person.bin
file to ZooKeeper.
$ gptext-config upload -i demo.public.news_demo -l es-ner-person.bin -f es-ner-person.bin
Edit the managed-schema
file for the index.
$ gptext-config edit -i demo.public.news_demo -f managed-schema
Add the es_ner_person
model to the OpenNLPNERFilterFactory
filter for the field. Spanish names will be recognized first, and then English names.
<filter class="com.emc.solr.analysis.opennlp.OpenNLPNERFilterFactory"
nerTaggerModels="es-ner-person.bin,en-ner-person.bin,en-ner-organization.bin,en-ner-time.bin"/>
Save the managed-schema
file changes and reindex the documents.