🧠 Using NLP Models in Your Solr schema_extra_types.xml
Leverage the power of Natural Language Processing (NLP) right inside Solr!
With built-in support for OpenNLP models, you can add advanced tokenization, part-of-speech tagging, named entity recognition, and much more—no PhD required.
🚀 Why Use NLP Models in Solr?
Integrating NLP in your schema allows you to:
- Extract nouns, verbs, or any part-of-speech you fancy.
- Perform more relevant searches by filtering, stemming, and synonymizing.
- Create blazing-fast autocomplete and suggestion features via EdgeNGrams.
- Support multi-language, linguistically smart queries.
In short: your Solr becomes smarter and your users get better search results.
⚙️ Example: Dutch Edge NGram Nouns Field
Here’s a typical fieldType in your schema_extra_types.xml using OpenNLP:
<fieldType name="text_edge_nouns_nl" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin"/> <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"/> <filter class="solr.TypeTokenFilterFactory" types="pos_edge_nouns_nl.txt" useWhitelist="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin"/> <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"/> <filter class="solr.TypeTokenFilterFactory" types="pos_edge_nouns_nl.txt" useWhitelist="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_edge_nouns_nl.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
🔎 Important Details
-
Model Paths:
Always reference the full absolute path for NLP model files. For example:sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"
This ensures Solr always finds your precious language models—no “file not found” drama!
-
Type Token Filtering:
TheTypeTokenFilterFactorywithuseWhitelist="true"will only keep tokens matching the allowed parts of speech (like nouns, verbs, etc.), as defined inpos_edge_nouns_nl.txt. This keeps your index tight and focused. -
Synonym Graphs:
AddSynonymGraphFilterFactoryto enable query-side expansion. This is great for handling multiple word forms, synonyms, and local lingo.
🧑🔬 Best Practices & Gotchas
- Keep your NLP model files up to date and tested for your language version!
- If using multiple languages, make sure you have the right models for each language. (No, Dutch models won’t help with Klingon. Yet.)
- EdgeNGram and NGram fields are fantastic for autocomplete—but don’t overdo it, as they can bloat your index if not tuned.
- Use
RemoveDuplicatesTokenFilterFactoryto keep things clean and efficient.
🌍 Not Just for Dutch!
You can set up similar analyzers for English, undefined language, or anything you like. For example:
<fieldType name="text_nouns_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/en-sent.bin" tokenizerModel="/opt/nlp/en-token.bin"/> <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/en-pos-maxent.bin"/> <filter class="solr.TypeTokenFilterFactory" types="pos_nouns_en.txt" useWhitelist="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/en-sent.bin" tokenizerModel="/opt/nlp/en-token.bin"/> <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/en-pos-maxent.bin"/> <filter class="solr.TypeTokenFilterFactory" types="pos_nouns_en.txt" useWhitelist="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_nouns_en.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
📦 Keep It Organized
- Store all model files in a single, logical directory (like
/opt/nlp/), and keep a README so you know what’s what. - Protect those models! They’re your “brains” for language tasks.
🛠️ Wrap-up
Using NLP models in your Solr analyzers will supercharge your search, make autocomplete smarter, and help users find what they’re actually looking for (even if they type like my cat walks on a keyboard).
Need more examples?
Check out the Solr Reference Guide - OpenNLP Integration or Opensolr documentation.
Happy indexing, and may your tokens always be well-typed! 😸🤓

