schema_extra_types.xml
Leverage the power of Natural Language Processing (NLP) right inside Solr!
With built-in support for OpenNLP models, you can add advanced tokenization, part-of-speech tagging, named entity recognition, and much more—no PhD required.
Integrating NLP in your schema allows you to:
In short: your Solr becomes smarter and your users get better search results.
Here’s a typical fieldType
in your schema_extra_types.xml
using OpenNLP:
<fieldType name="text_edge_nouns_nl" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"/>
<filter class="solr.TypeTokenFilterFactory" types="pos_edge_nouns_nl.txt" useWhitelist="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"/>
<filter class="solr.TypeTokenFilterFactory" types="pos_edge_nouns_nl.txt" useWhitelist="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_edge_nouns_nl.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Model Paths:
Always reference the full absolute path for NLP model files. For example:
sentenceModel="/opt/nlp/nl-sent.bin"
tokenizerModel="/opt/nlp/nl-token.bin"
posTaggerModel="/opt/nlp/nl-pos-maxent.bin"
This ensures Solr always finds your precious language models—no “file not found” drama!
Type Token Filtering:
The TypeTokenFilterFactory
with useWhitelist="true"
will only keep tokens matching the allowed parts of speech (like nouns, verbs, etc.), as defined in pos_edge_nouns_nl.txt
. This keeps your index tight and focused.
Synonym Graphs:
Add SynonymGraphFilterFactory
to enable query-side expansion. This is great for handling multiple word forms, synonyms, and local lingo.
RemoveDuplicatesTokenFilterFactory
to keep things clean and efficient.You can set up similar analyzers for English, undefined language, or anything you like. For example:
<fieldType name="text_nouns_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/en-sent.bin" tokenizerModel="/opt/nlp/en-token.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/en-pos-maxent.bin"/>
<filter class="solr.TypeTokenFilterFactory" types="pos_nouns_en.txt" useWhitelist="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/en-sent.bin" tokenizerModel="/opt/nlp/en-token.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/en-pos-maxent.bin"/>
<filter class="solr.TypeTokenFilterFactory" types="pos_nouns_en.txt" useWhitelist="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_nouns_en.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
/opt/nlp/
), and keep a README so you know what’s what.Using NLP models in your Solr analyzers will supercharge your search, make autocomplete smarter, and help users find what they’re actually looking for (even if they type like my cat walks on a keyboard).
Need more examples?
Check out the Solr Reference Guide - OpenNLP Integration or Opensolr documentation.
Happy indexing, and may your tokens always be well-typed! 😸🤓