Hybrid Search in Apache Solr: Modern Power, Classic Roots
The Evolution of Search: From Keywords to Vectors 🔍➡️🧠
Important Pre-Req.
First make sure you have this embeddings field in your schema.xml (works with):
<!--VECTORS--> <field name="embeddings" type="vector" indexed="true" stored="true" multiValued="false" required="false" /> <fieldType name="vector" class="solr.DenseVectorField" vectorDimension="1024" similarityFunction="cosine"/>
⚠️ Pay very close attention to the vectorDimension, as it has to match the embeddings that you are creating with your LLM Model. If using the Opensolr Index Embedding API, this has to be exactly: 1024.
This works with the Opensolr Embed API Endpoint which uses the BAAI/bge-m3 embedding model.
Opensolr Also supports the native Solr /schema API, so you can also run these two, in order to add your fields to the schema.xml.
$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fieldtypes -H 'Content-type:application/json' -d '{ "add-field-type": { "name": "vector", "class": "solr.DenseVectorField", "vectorDimension": 1024, "similarityFunction": "cosine" } }' $ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fields -H 'Content-type:application/json' -d '{ "add-field": { "name":"embeddings", "type":"vector", "indexed":true, "stored":false, // true if you want to see the vectors for debugging "multiValued":false, "required":false, "dimension":1024, // adjust to your embedder size "similarityFunction":"cosine" } }'
Seocond make sure you have this in solrconfig.xml for atomic updates to use with the Opensolr Index Embedding API:
<!-- The default high-performance update handler --> <updateHandler class="solr.DirectUpdateHandler2"> <updateLog> <int name="numVersionBuckets">65536</int> <int name="maxNumLogsToKeep">10</int> <int name="numRecordsToKeep">10</int> </updateLog> ..... </updateHandler>
Already on Opensolr? Web Crawler indexes come with hybrid search ready out of the box — the
embeddingsfield, BGE-m3 vectors, and the hybrid query pipeline are all pre-configured. No schema edits, no embedding setup. The manual setup above is for custom Opensolr indexes or self-hosted Solr.
Why Vector Search Isn’t a Silver Bullet ⚠️
As much as we love innovation, vector search still has a few quirks:
- Mystery Rankings: Why did document B leapfrog document A? Sometimes, it’s anyone’s guess. 🕳️
- Chunky Business: Embedding models are picky eaters—they work best with just the right size of text chunks.
- Keyword Nostalgia: Many users still expect the comfort of exact matches. “Where’s my keyword?” they ask. (Fair question!)
Hybrid Search: The Best of Both Worlds 🤝
Hybrid search bridges the gap—combining trusty keyword (lexical) search with smart vector (neural) search for results that are both sharp and relevant.
How It Works
- Double the Fun: Run a classic keyword query and a KNN vector search at the same time, creating two candidate lists.
- Clever Combining: Merge and rank for maximum “aha!” moments.
Tuning the Balance: On Opensolr, Search Tuning gives you a visual slider to control the balance between keyword and semantic scoring (0.0 = pure keyword, 3.0 = heavily semantic). The system also adapts dynamically based on query length — short queries lean keyword, longer queries lean semantic.
Apache Solr Does Hybrid Search (Despite the Rumors) 💡
Contrary to the grapevine, Solr can absolutely do hybrid search—even if the docs are a little shy about it. If your schema mixes traditional fields with a solr.DenseVectorField, you’re all set.
Candidate Selection: Boolean Query Parser to the Rescue 🦸♂️
Solr’s Boolean Query Parser lets you mix and match candidate sets with flair:
Union Example
q={!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery={!type=edismax qf=text_field}term1&
vectorQuery={!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
Result: All unique hits from both searches. No duplicates, more to love! ❤️
Intersection Example
q={!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery={!type=edismax qf=text_field}term1&
vectorQuery={!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
Result: Only the most relevant docs—where both worlds collide. 🤝
You also have to be mindful of the Solr version you are using, since we were able to make this work only on Solr version 9.0. Beware this did not work on Solr 9.6! Only reranking queries worked on Solr 9.6 (as shown below).
Basically, at this point, here are all the paramerers we sent Solr, to make this hybrid search working on Solr version 9.0:
Classic Solr Edismax Search combined with dense vector search (UNION)
{ "mm":"1<100% 2<70% 3<45% 5<30% 7<20% 10<10%", "df":"title", "ps":"3", "bf":"recip(rord(timestamp),1,1500,500)^90", "fl":"score,meta_file_modification_date*,score,og_image,id,uri,description,title,meta_icon,content_type,creation_date,timestamp,meta_robots,content_type,meta_domain,meta_*,text", "start":"0", "fq":"+content_type:text*", "rows":"100", "vectorQuery":"{!knn f=embeddings topK=100}[-0.024160323664546,...,0.031963128596544]", "q":"{!bool must=$lexicalQuery must=$vectorQuery}", "qf":"title^10 description^5 uri^3 text^2 phonetic_title^0.1", "pf":"title^15 description^7 uri^9", "lexicalQuery":"{!edismax qf=$qf bf=$bf ps=$ps pf=$pf pf2=$pf2 pf3=$pf3 mm=$mm}trump tariffs", "pf3":"text^5", "pf2":"tdescription^6" }
Solr 9.6 reranking query. (It also works in Solr 9.0):
{ "mm":"1<100% 2<70% 3<45% 5<30% 7<20% 10<10%", "df":"title", "ps":"3", "bf":"recip(rord(timestamp),1,1500,500)^90", "fl":"score,meta_file_modification_date*,score,og_image,id,uri,description,title,meta_icon,content_type,creation_date,timestamp,meta_robots,content_type,meta_domain,meta_*,text", "start":"0", "fq":"+content_type:text*", "rows":"100", "q":"{!knn f=embeddings topK=100}[-0.024160323664546,...,0.031963128596544]", "rqq":"{!edismax qf=$qf bf=$bf ps=$ps pf=$pf pf2=$pf2 pf3=$pf3 mm=$mm}trump tariffs", "qf":"title^10 description^5 uri^3 text^2 phonetic_title^0.1", "pf":"title^15 description^7 uri^9", "pf3":"text^5", "pf2":"tdescription^6", "rq":"{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}" }
A few remarks:
🎹 This is based on the classic Opensolr Web Crawler Index, that does most of it's work within the fields: title, description, text, uri.
📰 Index is populated with data crawled from various public news websites.
🔗 We embedded a concatenation of title, description and the first 50 sentences of text.
💼 We use the Opensolr Query Embed API, to embed our query at search-time.
🏃🏻♂️ You can see this search in action, here.
👩🏻💻 You can also see the Solr data and make your own queries on it. This index' Solr API, is here.
📦 For content the crawler can't reach, the Data Ingestion API lets you push documents via REST — each document automatically gets BGE-m3 embeddings, sentiment analysis, and language detection.
🔐 Credentials are: Username: 123 / Password: 123 -> Enjoy! 🥳
Cheat Sheet
🤥 Below is a cheat-sheet, of the fields and how you're supposed to use them if you run knn queries. Solr is very picky about what goes with knn and what doesn't. For example, for the Union query, we were unable to use highlighting. But, if you follow the specs below, you'll probably won't be getting any Query can not be null Solr errors... (or will you? 🤭)
What Belongs Inside {!edismax} in lexicalQuery? 🧾
| Parameter | Inside lexicalQuery |
Why |
|---|---|---|
q |
✅ YES | Required for the subquery to function |
qf, pf, bf, bq, mm, ps |
✅ YES | All edismax features must go inside |
defType |
❌ NO | Already defined by {!edismax} |
hl, spellcheck, facet, rows, start, sort |
❌ NO | These are top-level Solr request features |
💡 Hybrid Query Cheat Sheet
Here’s how to do it right when you want all the bells and whistles (highlighting, spellcheck, deep edismax):
# TOP-LEVEL BOOLEAN QUERY COMPOSING EDISMAX AND KNN q={!bool should=$lexicalQuery should=$vectorQuery} # LEXICAL QUERY: ALL YOUR EDISMAX STUFF GOES HERE &lexicalQuery={!edismax q=$qtext qf=$qf pf=$pf mm=$mm bf=$bf} # VECTOR QUERY &vectorQuery={!knn f=vectorField topK=10}[0.123, -0.456, ...] # EDISMAX PARAMS &qtext='flying machine' &qf=title^6 description^3 text^2 uri^4 &pf=text^10 &mm=1<100% 2<75% 3<50% 6<30% &bf=recip(ms(NOW,publish_date),3.16e-11,1,1) # NON-QUERY STUFF &hl=true &hl.fl=text &hl.q=$lexicalQuery &spellcheck=true &spellcheck.q=$qtext &rows=20 &start=0 &sort=score desc
In Summary
Hybrid search gives you the sharp accuracy of keywords and the deep smarts of vectors—all in one system. With Solr, you can have classic reliability and modern magic. 🍦✨
"Why choose between classic and cutting-edge, when you can have both? Double-scoop your search!"
Opensolr: Hybrid Search Without the Complexity
Everything above — schema fields, embedding pipelines, boolean query composition, reranking — is what you wire up manually on your own Solr. On Opensolr, the hard parts are handled for you:
-
Web Crawler Indexes — Hybrid search works out of the box. The crawler automatically generates 1024-dim BGE-m3 embeddings for every page it crawls. No schema setup, no embedding code. Point it at a URL and you have a hybrid search engine. (Web Crawler)
-
One-Click Embeddings — Have an existing index? The Embedding API generates vectors for every document in your index with one call. No external model hosting, no batch scripts. (Index Embedding API)
-
Search Tuning — A visual slider controls the keyword vs semantic balance per index (0.0 = pure keyword BM25, 3.0 = heavily semantic). The system also adapts dynamically based on query length. No config files, instant effect. (Search Tuning)
-
Query Elevation — When the hybrid algorithm ranks something wrong for a specific query, pin the right document to the top or exclude the wrong one. Instant, no reindexing. (Query Elevation)
-
Click Analytics — See which hybrid results users actually click. High impressions with low CTR means the keyword/semantic balance needs adjusting — or the result needs pinning. (Click Analytics)
-
Data Ingestion API — Push documents from databases, APIs, or internal systems. Every document gets automatic BGE-m3 embeddings, sentiment analysis, and language detection — same enrichment pipeline as the crawler. (Data Ingestion API)
-
EDisMax + Vectors — The
lexicalQueryexamples in this guide use EDisMax for the keyword side. Same parameters (qf,pf,mm,ps), same behavior — Opensolr just removes the infrastructure work. -
Error Audit — Hybrid queries can fail silently (wrong vector dimensions, missing fields, malformed KNN syntax). Error Audit captures every Solr error from the last 7 days, parsed and searchable — so you catch these before users do.
The hybrid search concepts in this guide apply directly on Opensolr — same Solr, same query syntax. The difference is you skip the embedding infrastructure, the schema plumbing, and the scoring guesswork.
Happy hybrid searching! 🥳