Select a category on the left, to get your answers quickly
Our new AI-powered Web Crawler does the heavy lifting: it crawls your site, extracts structured data, applies NLP + NER, and feeds everything straight into Solr — fully indexed and ready to search.
No manual config. No fiddling with schemas. Just point it at your site and go.
Because if your site’s content is smart, your search should be too. đ§
schema.xml
(works with):<!--VECTORS-->
<field name="embeddings" type="vector" indexed="true" stored="true" multiValued="false" required="false" />
<fieldType name="vector" class="solr.DenseVectorField" vectorDimension="384" similarityFunction="cosine"/>
â ď¸ Pay very close attention to the vectorDimension, as it has to match the embeddings that you are creating with your LLM Model. If using the Opensolr Index Embedding API, this has to be exactly: 384.
This works with the Opensolr Embed API Endpoint which uses the all-MiniLM-L6-v2
embedding model.
schema.xml
.$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fieldtypes -H 'Content-type:application/json' -d '{
"add-field-type": {
"name": "vector",
"class": "solr.DenseVectorField",
"vectorDimension": 384,
"similarityFunction": "cosine"
}
}'
$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fields -H 'Content-type:application/json' -d '{
"add-field": {
"name":"embeddings",
"type":"vector",
"indexed":true,
"stored":false, // true if you want to see the vectors for debugging
"multiValued":false,
"required":false,
"dimension":384, // adjust to your embedder size
"similarityFunction":"cosine"
}
}'
solrconfig.xml
for atomic updates to use with the Opensolr Index Embedding API:<!-- The default high-performance update handler -->
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<int name="numVersionBuckets">65536</int>
<int name="maxNumLogsToKeep">10</int>
<int name="numRecordsToKeep">10</int>
</updateLog>
.....
</updateHandler>
As much as we love innovation, vector search still has a few quirks:
Hybrid search bridges the gapâcombining trusty keyword (lexical) search with smart vector (neural) search for results that are both sharp and relevant.
Contrary to the grapevine, Solr can absolutely do hybrid searchâeven if the docs are a little shy about it. If your schema mixes traditional fields with a solr.DenseVectorField
, youâre all set.
Solrâs Boolean Query Parser lets you mix and match candidate sets with flair:
q={!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery={!type=edismax qf=text_field}term1&
vectorQuery={!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
Result: All unique hits from both searches. No duplicates, more to love! â¤ď¸
q={!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery={!type=edismax qf=text_field}term1&
vectorQuery={!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
Result: Only the most relevant docsâwhere both worlds collide. đ¤
You also have to be mindful of the Solr version you are using, since we were able to make this work only on Solr version 9.0. Beware this did not work on Solr 9.6! Only reranking queries worked on Solr 9.6 (as shown below).
Basically, at this point, here are all the paramerers we sent Solr, to make this hybrid search working on Solr version 9.0:
Classic Solr Edismax Search combined with dense vector search (UNION)
{
"mm":"1<100% 2<70% 3<45% 5<30% 7<20% 10<10%",
"df":"title",
"ps":"3",
"bf":"recip(rord(timestamp),1,1500,500)^90",
"fl":"score,meta_file_modification_date*,score,og_image,id,uri,description,title,meta_icon,content_type,creation_date,timestamp,meta_robots,content_type,meta_domain,meta_*,text",
"start":"0",
"fq":"+content_type:text*",
"rows":"100",
"vectorQuery":"{!knn f=embeddings topK=100}[-0.024160323664546,...,0.031963128596544]",
"q":"{!bool must=$lexicalQuery must=$vectorQuery}",
"qf":"title^10 description^5 uri^3 text^2 phonetic_title^0.1",
"pf":"title^15 description^7 uri^9",
"lexicalQuery":"{!edismax qf=$qf bf=$bf ps=$ps pf=$pf pf2=$pf2 pf3=$pf3 mm=$mm}trump tariffs",
"pf3":"text^5",
"pf2":"tdescription^6"
}
{
"mm":"1<100% 2<70% 3<45% 5<30% 7<20% 10<10%",
"df":"title",
"ps":"3",
"bf":"recip(rord(timestamp),1,1500,500)^90",
"fl":"score,meta_file_modification_date*,score,og_image,id,uri,description,title,meta_icon,content_type,creation_date,timestamp,meta_robots,content_type,meta_domain,meta_*,text",
"start":"0",
"fq":"+content_type:text*",
"rows":"100",
"q":"{!knn f=embeddings topK=100}[-0.024160323664546,...,0.031963128596544]",
"rqq":"{!edismax qf=$qf bf=$bf ps=$ps pf=$pf pf2=$pf2 pf3=$pf3 mm=$mm}trump tariffs",
"qf":"title^10 description^5 uri^3 text^2 phonetic_title^0.1",
"pf":"title^15 description^7 uri^9",
"pf3":"text^5",
"pf2":"tdescription^6",
"rq":"{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}"
}
đš This is based on the classic Opensolr Web Crawler Index, that does most of it’s work within the fields: title, description, text, uri
.
đ° Index is populated with data crawled from various public news websites.
đ We embedded a concatenation of title
, description
and the first 50 sentences of text
.
đź We use the Opensolr Query Embed API, to embed our query at search-time.
đđťââď¸ You can see this search in action, here.
đŠđťâđť You can also see the Solr data and make your own queries on it. This index’ Solr API, is here.
đ Credentials are: Username: 123 / Password: 123 -> Enjoy! đĽł
𤼠Below is a cheat-sheet, of the fields and how you’re supposed to use them if you run knn queries. Solr is very picky about what goes with knn and what doesn’t. For example, for the Union query, we were unable to use highlighting. But, if you follow the specs below, you’ll probably won’t be getting any Query can not be null
Solr errors… (or will you? đ¤)
{!edismax}
in lexicalQuery
? đ§žParameter | Inside lexicalQuery |
Why |
---|---|---|
q |
â YES | Required for the subquery to function |
qf , pf , bf , bq , mm , ps |
â YES | All edismax features must go inside |
defType |
â NO | Already defined by {!edismax} |
hl , spellcheck , facet , rows , start , sort |
â NO | These are top-level Solr request features |
— |
Hereâs how to do it right when you want all the bells and whistles (highlighting, spellcheck, deep edismax):
# TOP-LEVEL BOOLEAN QUERY COMPOSING EDISMAX AND KNN
q={!bool should=$lexicalQuery should=$vectorQuery}
# LEXICAL QUERY: ALL YOUR EDISMAX STUFF GOES HERE
&lexicalQuery={!edismax q=$qtext qf=$qf pf=$pf mm=$mm bf=$bf}
# VECTOR QUERY
&vectorQuery={!knn f=vectorField topK=10}[0.123, -0.456, ...]
# EDISMAX PARAMS
&qtext='flying machine'
&qf=title^6 description^3 text^2 uri^4
&pf=text^10
&mm=1<100% 2<75% 3<50% 6<30%
&bf=recip(ms(NOW,publish_date),3.16e-11,1,1)
# NON-QUERY STUFF
&hl=true
&hl.fl=text
&hl.q=$lexicalQuery
&spellcheck=true
&spellcheck.q=$qtext
&rows=20
&start=0
&sort=score desc
Hybrid search gives you the sharp accuracy of keywords and the deep smarts of vectorsâall in one system. With Solr, you can have classic reliability and modern magic. đŚâ¨
âWhy choose between classic and cutting-edge, when you can have both? Double-scoop your search!â
Happy hybrid searching! đĽł
The Opensolr AI-Hits API, is free to use as part of your Opensolr Account.
The Opensolr AI-Hints LLM will generate a summary of the context, either coming form your Opensolr Web Crawler Index, or a manually entered context.
A number of other instructions can be passed on to this API, for NER, and other capabilities. It is in Beta at this point, but will get better with time.
Example: https://api.opensolr.com/solr_manager/api/ai_summary?email=PLEASE_LOG_IN& api_key=PLEASE_LOG_IN&index_name=my_crawler_solr_index&instruction=Answer%20The%20Query&query=Who%20is%20Donald%20Trump?
embed
The embed
endpoint allows you to generate vector embeddings for any arbitrary text payload (up to 50,000 characters) and store those embeddings in your specified Opensolr index. This is ideal for embedding dynamic or ad-hoc content, without having to pre-index data in Solr first.
https://api.opensolr.com/solr_manager/api/embed
Supports only POST requests.
Parameter | Type | Required | Description |
---|---|---|---|
string | Yes | Your Opensolr registration email address. | |
api_key | string | Yes | Your API key from the Opensolr dashboard. |
index_name | string | Yes | Name of your Opensolr index/core to use. |
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
payload | string | Yes | â | The raw text string to embed. Maximum: 50,000 characters. |
payload
can be any UTF-8 text (e.g., a document, user input, generated content, etc).payload
is missing or less than 2 characters, the API returns a 404 error with a JSON error response.index_name
to indicate where the embedding should be stored (requires the appropriate field in your Solr schema).To store embeddings, your Solr schema must define an appropriate vector field, for example:
<field name="embeddings" type="vector" indexed="true" stored="false" multiValued="false"/>
<fieldType name="vector" class="solr.DenseVectorField" vectorDimension="384" required="false" similarityFunction="cosine"/>
Adjust the name
, type
, and vectorDimension
as needed to fit your use-case and model.
POST https://api.opensolr.com/solr_manager/api/embed
Content-Type: application/x-www-form-urlencoded
[email protected]&api_key=YOUR_API_KEY&index_name=your_index&payload=Your text to embed here.
email
and api_key
.payload
parameter (must be 2-50,000 characters).{
"status": "success",
"embedding": [/* vector values */],
"length": 4381
}
Or, for invalid input:
{
"ERROR": "Invalid payload"
}
For more information or help, visit Opensolr Support or use your Opensolr dashboard.
embed_opensolr_index
Using the embed_opensolr_index
endpoint involves Solr atomic updates, meaning each Solr document is updated individually with the new embeddings. Atomic updates in Solr only update the fields you include in the update payloadâall other fields remain unchanged. However, you cannot generate embeddings from fields that are stored=false
, because Solr cannot retrieve their values for you.
You will not lose stored=false
fields just by running an atomic update. Atomic updates do NOT remove or overwrite fields you do not explicitly update. Data loss of non-stored fields only happens if you replace the entire document (full document overwrite), not during field-level atomic updates.
Because of this, it’s highly recommended to understand the implications of Solr atomic updates clearly. For most users, the safer approach is to create embeddings at indexing time (using the /embed
endpoint), especially if you rely on non-stored fields for downstream features.
Please review the official documentation on Solr Atomic Updates to fully understand these implications before using this endpoint.
schema.xml
<!--VECTORS-->
<field name="embeddings" type="vector" indexed="true" stored="true" multiValued="false" required="false" />
<fieldType name="vector" class="solr.DenseVectorField" vectorDimension="384" similarityFunction="cosine"/>
$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fieldtypes -H 'Content-type:application/json' -d '{
"add-field-type": {
"name": "vector",
"class": "solr.DenseVectorField",
"vectorDimension": 384,
"similarityFunction": "cosine"
}
}'
$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fields -H 'Content-type:application/json' -d '{
"add-field": {
"name":"embeddings",
"type":"vector",
"indexed":true,
"stored":false, // true if you want to see the vectors for debugging
"multiValued":false,
"required":false,
"dimension":384, // adjust to your embedder size
"similarityFunction":"cosine"
}
}'
solrconfig.xml
:<!-- The default high-performance update handler -->
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<int name="numVersionBuckets">65536</int>
<int name="maxNumLogsToKeep">10</int>
<int name="numRecordsToKeep">10</int>
</updateLog>
.....
</updateHandler>
The embed_opensolr_index
endpoint allows Opensolr users to generate and store text embeddings for documents in their Opensolr indexes using a Large Language Model (LLM). These embeddings power advanced features such as semantic search, classification, and artificial intelligence capabilities on top of your Solr data.
https://api.opensolr.com/solr_manager/api/embed_opensolr_index
Supports both GET and POST methods.
Parameter | Type | Required | Description |
---|---|---|---|
string | Yes | Your Opensolr registration email address. | |
api_key | string | Yes | Your API key from the Opensolr dashboard. |
index_name | string | Yes | Name of your Opensolr index/core to be embedded. |
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
emb_solr_fields | string | No | title,description,text | Comma-separated list of Solr fields to embed (can be any valid fields in your index). |
emb_solr_embeddings_field_name | string | No | embeddings | Name of the Solr field to store generated embeddings. |
emb_full_solr_grab | bool | string | No | false | If “yes”, embed all documents in the index; otherwise use pagination parameters below. |
emb_solr_start | integer | No | 0 | Starting document offset (for pagination). |
emb_solr_rows | integer | No | 10 | Number of documents to process in the current request (page size). |
emb_solr_fields
, which defaults to title,description,text
, but you may specify any fields from your index for embedding.emb_solr_embeddings_field_name
to match the embeddings field in your schema.schema.xml
. Example configuration:<field name="embeddings" type="vector" indexed="true" stored="false" multiValued="false"/>
<fieldType name="vector" class="solr.DenseVectorField" vectorDimension="384" required="false" similarityFunction="cosine"/>
embeddings
and vector
with your custom names if you use different field names.Solr atomic updates update only the fields you specify in the update request. Other fieldsâincluding those defined as non-stored (stored=false
)âare not changed or removed by an atomic update. However, since non-stored fields cannot be retrieved from Solr, you cannot use them to generate embeddings after indexing time.
If you ever replace an entire document (full overwrite), non-stored fields will be lost unless you explicitly provide their values again.
yes
to embed all documents in the index; otherwise, the endpoint uses pagination.POST https://api.opensolr.com/solr_manager/api/embed_opensolr_index
Content-Type: application/x-www-form-urlencoded
[email protected]&api_key=YOUR_API_KEY&index_name=your_index
POST https://api.opensolr.com/solr_manager/api/embed_opensolr_index
Content-Type: application/x-www-form-urlencoded
[email protected]&api_key=YOUR_API_KEY&index_name=your_index&emb_solr_fields=title,content&emb_solr_embeddings_field_name=embeddings&emb_full_solr_grab=yes
GET https://api.opensolr.com/solr_manager/api/[email protected]&api_key=YOUR_API_KEY&index_name=your_index
email
and api_key
.index_name
.emb_solr_fields
.emb_solr_embeddings_field_name
.emb_full_solr_grab
is yes
, processes all documents; otherwise uses emb_solr_start
and emb_solr_rows
for batch processing.For more information or help, visit Opensolr Support or use your Opensolr dashboard.
Heads up!
Before you dive into using NLP models with your Opensolr index, please contact us to request the NLP models to be installed for your Opensolr index.
We’ll reply with the correct path to use for the.bin
files in yourschema.xml
orsolrconfig.xml
. Or, if you’d rather avoid all the hassle, just ask us to set it up for youâdone and done.
This is your step-by-step guide to using AI-powered OpenNLP models with Opensolr. In this walkthrough, weâll cover Named Entity Recognition (NER) using default OpenNLP models, so you can start extracting valuable information (like people, places, and organizations) directly from your indexed data.
â ď¸ Note:
Currently, these models are enabled by default only in the Germany, Solr Version 9 environment. So, if you want an easy life, create your index there!
Weâre happy to set up the models in any region (or even your dedicated Opensolr infrastructure for corporate accounts) if you reach out via our Support Helpdesk.
You can also download OpenNLP default models from us or the official OpenNLP website.
Create your Opensolr Index
Edit Your schema.xml
schema.xml
to edit.Dynamic Field (for storing entities):
<dynamicField name="*_s" type="string" multiValued="true" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" storeOffsetsWithPositions="true" />
**NLP Tokenizer fieldType:**
<fieldType name="text_nlp" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="en-sent.bin"
tokenizerModel="en-token.bin"/>
<filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
<filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="en-chunker.bin"/>
<filter class="solr.TypeAsPayloadFilterFactory"/>
</analyzer>
</fieldType>
- **Important:** Donât use the `text_nlp` type for your dynamic fields! Itâs only for the update processor.
Save, then Edit Your solrconfig.xml
updateRequestProcessorChain
(and corresponding requestHandler
):<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">nlp</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="nlp">
<!-- Extract English People Names -->
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-ner-person.bin</str>
<str name="analyzerFieldType">text_nlp</str>
<arr name="source">
<str>title</str>
<str>description</str>
</arr>
<str name="dest">people_s</str>
</processor>
<!-- Extract Spanish People Names -->
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">es-ner-person.bin</str>
<str name="analyzerFieldType">text_nlp</str>
<arr name="source">
<str>title</str>
<str>description</str>
</arr>
<str name="dest">people_s</str>
</processor>
<!-- Extract Locations -->
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-ner-location.bin</str>
<str name="analyzerFieldType">text_nlp</str>
<arr name="source">
<str>title</str>
<str>description</str>
</arr>
<str name="dest">location_s</str>
</processor>
<!-- Extract Organizations -->
<processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
<str name="modelFile">en-ner-organization.bin</str>
<str name="analyzerFieldType">text_nlp</str>
<arr name="source">
<str>title</str>
<str>description</str>
</arr>
<str name="dest">organization_s</str>
</processor>
<!-- Language Detection -->
<processor class="org.apache.solr.update.processor.OpenNLPLangDetectUpdateProcessorFactory">
<str name="langid.fl">title,text,description</str>
<str name="langid.langField">language_s</str>
<str name="langid.model">langdetect-183.bin</str>
</processor>
<!-- Remove duplicate extracted entities -->
<processor class="solr.UniqFieldsUpdateProcessorFactory">
<str name="fieldRegex">.*_s</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Populate Test Data (for the impatient!)
Sample JSON:
{
"id": "1",
"title": "Jack Sparrow was a pirate. Many feared him. He used to live in downtown Las Vegas.",
"description": "Jack Sparrow and Janette Sparrowa, are now on their way to Monte Carlo for the summer vacation, after working hard for Microsoft, creating the new and exciting Windows 11 which everyone now loves. :)",
"text": "The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.Learn more about how you can get involved."
}
See the Magic!
If any step trips you up, contact us and we’ll gladly assist youâwhether it’s model enablement, schema help, or just a friendly chat about Solr and AI. đ¤
Happy Solr-ing & entity extracting!
If youâre uploading or saving configuration files using the Opensolr Editor, you might occasionally be greeted by an error that looks a little something like this:
Error loading class ‘solr.ICUCollationField’
Donât worryâthis doesnât mean the sky is falling or that your config files have started speaking in tongues.
The error above simply means the ICU (International Components for Unicode) library isnât enabled on your Opensolr server (yet!). This library is required if your configuration references classes like solr.ICUCollationField
âusually for advanced language collation and sorting.
The solution is delightfully simple: Contact Opensolr Support and request that we enable the ICU library for your server.
A real human (yes, a human!) will flip the right switches for your server, and youâll be back to uploading config files in no time.
If youâre not sure what sort of error youâre running intoâor just want to peek under the hoodâyou can always check your Error Logs after uploading config files:
You’ll see something like this button in your dashboard:
Check the logs to spot any ICU or other config errors. If it smells like ICU, contact usâif it smells like something else, well⌠contact us anyway. We’re here to help!
Happy indexing!