Documentation > How to use OpenNLP (NER) with Opensolr

Here are the steps to follow, in order to use the AI OpenNLP models, with Opensolr.
In this example, we will cover how to extract named entities (NER), using the OpenNLP default models.

Please note that at this time, the models are only enabled in the Germany, Solr Version 9 environment. So you should perhaps try to create your index there, when on the Add New Index page in your Opensolr Control Panel.
We can however setup your models in any region, including your own dedicated Opensolr Infrastructure, for Corporate accounts. Simply drop us a note via the Support Helpdesk system, and we'll be happy to enable any models for you.

Add New Opensolr Index

You can download the OpenNLP default models, from the OpenNLP website, or from Opensolr, here.

  • Create your Opensolr Index, in any region, using Solr Version 7+ (7,8,9)
  • IMPORTANT: If you create your index in the Opensolr Germany Solr 9 Web Crawler Environment, you will not have to follow any of the steps below. Otherwise, you will have to add the snippets to schema.xml and solrconfig.xml and also submit a support ticket, with your Opensolr Index Name, to ask for the NLP models to be enabled for your Opensolr Index.
  • In your schema.xml, define a destination field, where the recognized entities will be stored, and the NLP tokenizer fieldType.
    • You can do that by going to your Opensolr Control Panel, click on your Index Name, head on to the Configuration tab, and select schema.xml from the drop down menu, in order to edit your schema.xml file
      • Edit schema.xml
    • Now simply add the following snippets to your schema.xml, first, where the other field definitions are, and second where the other fieldType definitions are.
      • <dynamicField name="*_s" type="string" multiValued="true" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" storeOffsetsWithPositions="true" />
      • <fieldType name="text_nlp" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.OpenNLPTokenizerFactory"
        sentenceModel="en-sent.bin"
        tokenizerModel="en-token.bin"/>
        <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
        <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="en-chunker.bin"/>
        <filter class="solr.TypeAsPayloadFilterFactory"/>
        </analyzer>
        </fieldType> 
    • Please note that it is very important that you do not add your dynamic filed as a text_nlp field type. The text_nlp field type is only used by the updateProcessor, in order to use the NLP models, as we will see in the next steps.
  • Now click SAVE, and then select solrconfig.xml from the drop down, where we will define the updateProcessorChain.
    • Save schema.xml
  • In solrconfig.xml add the following snippet. The comments are pretty self-explanatory about what everything does, so we won't go into much detail here.
    • <requestHandler name="/update" class="solr.UpdateRequestHandler" >
          <lst name="defaults">
              <str name="update.chain">nlp</str>
          </lst>
      </requestHandler>
      <updateRequestProcessorChain name="nlp">
          <!--
          Extract English Language People Names from the fields: title and description, 
          and put them in the people_s multivalued string field
          -->
          <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
              <str name="modelFile">en-ner-person.bin</str>
              <str name="analyzerFieldType">text_nlp</str>
              <arr name="source">
                  <str>title</str>
                  <str>description</str>
              </arr>
              <str name="dest">people_s</str>
          </processor>
          <!--
          Extract Spanish Languange People Names from the fields: title and description, 
          and put them in the people_s multivalued string field
          -->
          <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
              <str name="modelFile">es-ner-person.bin</str>
              <str name="analyzerFieldType">text_nlp</str>
              <arr name="source">
                  <str>title</str>
                  <str>description</str>
              </arr>
              <str name="dest">people_s</str>
          </processor>
          <!--Extract Locations-->
          <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
              <str name="modelFile">en-ner-location.bin</str>
              <str name="analyzerFieldType">text_nlp</str>
              <arr name="source">
                  <str>title</str>
                  <str>description</str>
              </arr>
              <str name="dest">location_s</str>
          </processor>
          <!--Extract Organizations-->
          <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
              <str name="modelFile">en-ner-organization.bin</str>
              <str name="analyzerFieldType">text_nlp</str>
              <arr name="source">
                  <str>title</str>
                  <str>description</str>
              </arr>
              <str name="dest">organization_s</str>
          </processor>
          <!--
          Detect the language of each Solr document, based on the data in the
          title, text, and description fields, using the NLP model: langdetect-183.bin
          -->
          <processor class="org.apache.solr.update.processor.OpenNLPLangDetectUpdateProcessorFactory">
              <str name="langid.fl">title,text,description</str>
              <str name="langid.langField">language_s</str>
              <str name="langid.model">langdetect-183.bin</str>
          </processor>
          <!--
          Run a De-Dupicator on each target string field, so that we won't
          with duplicate extracted names organizations, locations, etc, in our 
          string target field.
          -->
          <processor class="solr.UniqFieldsUpdateProcessorFactory">
              <str name="fieldRegex">.*_s</str>
          </processor>
          <processor class="solr.RunUpdateProcessorFactory" />
      </updateRequestProcessorChain>
      
  • Now, all we need is data.
    Since we're using the Opensolr Web Crawler Environment in Germany, running Solr 9.0, we can simply run the Web Crawler to Crawl and index your website's public HTML content, and see if we can extract that information using the OpenNLP models.
  • However, for this example, we will just populate some sample data, by going to the Solr Admin tab, and just inserting test data in a JSON format, there.
    • Solr Admin Panel
    • Solr Admin Panel
  • Here's the sample data, that you can paste yourself in your index:
    • {
          "id": "1",
          "title": "Jack Sparrow was a pirate. Many feared him. He used to live in downtown Las Vegas.",
          "description": "Jack Sparrow and Janette Sparrowa, are now on their way to Monte Carlo for the summer vacation, after working hard for Microsoft, creating the new and exiciting Windows 11 which everyone now loves. :)",
          "text": "The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.Learn more about how you can get involved."
      }
  • Now, surely enough, if we head over to the query tab, and try to look at our data, we can see the the OpenNLP models have succesfully extracted the names, organizations, and locations mentioned in the data above.
  • Solr Query
  • Opensolr NLP End Result
  • If you need any help, please don't hesitate to contact us.