Hybrid Search

Opensolr Hybrid Search — find answers to your questions

Testing Opensolr AI Search — Vector Search, AI Hints & Doc...

Step-by-Step Guide
Testing Your Opensolr AI Search Engine
Four powerful features ship with every Opensolr Web Crawler index — intent-based Vector Search, instant AI Hints, one-click Document Reader, and hands-on Query Elevation.
CrawlIndexEmbedSolrSearch
Your complete AI search pipeline — fully managed, out of the box
Intent-Based Vector Search
Instead of matching exact keywords, vector search understands what you mean. A query like "winter hat" finds wool beanies, fleece earflap caps, and knit headwear — even when those exact words aren't on the page. Opensolr uses BGE-m3 embeddings (1024 dimensions) combined with traditional BM25 scoring for the best of both worlds: semantic understanding plus keyword precision.
winter hatAIBGE-m31024-dimensional vector embeddings98%Wool Winter Cap94%Knit Beanie Set89%Fleece Earflap Hat
Hybrid Scoring (BM25 + Vectors)BGE-m3 1024-dimMultilingual
AI Hints — Instant Answers from Your Content
Before your users even scroll through results, AI Hints delivers a concise, AI-generated answer right at the top of the page. It uses RAG (Retrieval-Augmented Generation) — the AI retrieves the most relevant passages from YOUR indexed content, then generates a focused answer. No hallucinations, no external data — every hint is grounded in your actual pages.
best pellet heater for garage?RAG: retrieves from YOUR indexed contentAI HintLook for 40,000+ BTU models with thermostatVentilation required for enclosed spacesSee top-rated pellet heaters in results below
RAG-PoweredGrounded in Your DataZero Hallucinations
Document Reader — Summarize Any Search Result
Every search result includes a "Read" button. Click it, and the AI reads the entire web page, extracts the key information, and generates a clean summary — in seconds. You can then download the summary as a PDF. No need to visit the page, skim through ads, or parse dense content yourself.
Best Pellet Heaters 2026 — Expert ReviewsComplete guide to choosing the right pellet heater...heatersguide.com/pellet-heaters-2026ReadAIReaderPage SummaryTop 5 pellet heaters ranked by efficiency, noise level,and value. Castle 12327 rated best overall at $1,299...Download PDF
One-Click SummariesPDF ExportKey Feature Extraction
Query Elevation — Pin & Exclude Search Results
Take full control of what your users see. Query Elevation lets you pin important results to the top or exclude irrelevant ones — directly from the Search UI, with zero code and no reindexing required. Perfect for promoting landing pages, burying outdated content, or curating high-value queries.
Search ResultsProduct Landing Pageyoursite.com/products/best-sellerPin↑ Pinned #1— forced to top for this queryDrag to reorder when multiple results are pinnedExcluded result — hidden from this query
  • Pin — Force a specific result to the top for a given search query
  • Exclude — Hide a result completely so it never appears for that query
  • Exclude All — Apply the rule globally, across every search query
  • Drag & drop — Reorder pinned results to control exactly which one shows first
Zero Code RequiredExclude Irrelevant ResultsPin & Reorder

Try It Live

Test these demo search engines with real vector search. Use conceptual, intent-based queries:

Try these conceptual queries to see how vector similarity goes beyond keyword matching:

  • climate disasters hurricanes floods wildfires
  • space exploration mars colonization economy
  • ancient microbes life beyond earth

Every demo page includes built-in dev tools — query parameter inspector, full Solr debugQuery output, crawl statistics, and search analytics.


Using the Solr API Directly

Direct API access for advanced users — learn more about hybrid search.

Example Solr endpoints (credentials: 123 / 123):

https://de9.solrcluster.com/solr/vector/select?wt=json&indent=true&q=*:*&rows=2
https://fi.solrcluster.com/solr/rueb/select?wt=json&indent=true&q=*:*&rows=2
https://chicago96.solrcluster.com/solr/peilishop/select?wt=json&indent=true&q=*:*&rows=2

Simple Lexical Query

curl -u 123:123 "https://de9.solrcluster.com/solr/vector/select?q=climate+change&rows=5&wt=json"

Pure Vector Query (KNN)

curl -u 123:123 "https://de9.solrcluster.com/solr/vector/select?q={!knn%20f=embeddings%20topK=50}[0.123,0.432,0.556,...]&wt=json"

Replace the vector array with your own embedding from the OpenSolr AI NLP API.

Hybrid Query (Lexical + Vector)

curl -u 123:123 "https://de9.solrcluster.com/solr/vector/select?q={!bool%20should=$lexicalQuery%20should=$vectorQuery}&lexicalQuery={!edismax%20qf=content}climate+change&vectorQuery={!knn%20f=embeddings%20topK=50}[0.12,0.43,0.66,...]&wt=json"

Combines traditional keyword scoring with semantic vector similarity — best of both worlds.


Getting Embeddings via OpenSolr API

Generate vector embeddings for any text using these endpoints:

function postEmbeddingRequest($email, $api_key, $core_name, $payload) {
    $apiUrl = "https://api.opensolr.com/solr_manager/api/embed";
    $postFields = http_build_query([
        'email'      => $email,
        'api_key'    => $api_key,
        'index_name' => $core_name,
        'payload'    => is_array($payload) ? json_encode($payload) : $payload
    ]);

    $ch = curl_init($apiUrl);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => $postFields,
        CURLOPT_HTTPHEADER     => ['Content-Type: application/x-www-form-urlencoded'],
        CURLOPT_TIMEOUT        => 30,
    ]);

    $response = curl_exec($ch);
    curl_close($ch);
    return json_decode($response, true);
}

The response includes the vector embedding array you can pass directly to Solr.


Code Examples

PHP

<?php
$url = 'https://de9.solrcluster.com/solr/vector/select?wt=json';
$params = [
    'q'            => '{!bool should=$lexicalQuery should=$vectorQuery}',
    'lexicalQuery' => '{!edismax qf=content}climate disasters',
    'vectorQuery'  => '{!knn f=embeddings topK=50}[0.12,0.43,0.56,0.77]'
];

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERPWD, '123:123');
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($params));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);

echo $response;

Python

import requests
from requests.auth import HTTPBasicAuth

url = "https://de9.solrcluster.com/solr/vector/select"
params = {
    'q': '{!bool should=$lexicalQuery should=$vectorQuery}',
    'lexicalQuery': '{!edismax qf=content}climate disasters',
    'vectorQuery': '{!knn f=embeddings topK=50}[0.12,0.43,0.56,0.77]',
    'wt': 'json'
}

response = requests.post(url, data=params, auth=HTTPBasicAuth('123', '123'))
print(response.json())

JavaScript (AJAX)

<script>
fetch('https://de9.solrcluster.com/solr/vector/select?wt=json&q={!knn%20f=embeddings%20topK=10}[0.11,0.22,0.33]', {
    headers: { 'Authorization': 'Basic ' + btoa('123:123') }
})
.then(r => r.json())
.then(console.log);
</script>

Quick Reference

  • Adjust topK to control how many similar results to retrieve (usually 20-100).
  • Use {!bool should=...} for softer relevance mixing — vector similarity has more influence on ranking.
  • For best hybrid results, always combine both lexical and vector queries.
  • All demo search pages include built-in query inspector, debugQuery, crawl stats, and search analytics.
Ready to Add AI Search to Your Site?
Get a fully managed vector search engine with AI Hints and Document Reader — set up in minutes.
Read Full Answer

Search Tuning — Per-Index Relevancy Controls

Search Relevancy
Search Tuning — Fine-Tune How Your Search Ranks Results
Every index is different. A news site needs freshness. A product catalog needs exact matches. A knowledge base needs semantic understanding. Search Tuning gives you visual controls to shape relevancy per index — no code, no config files, instant effect.
DEFAULT SETTINGS
wireless headphones review
Sony WH-1000XM5 Review#1
Posted yesterday — comprehensive review...
Best Wireless Headphones 2024#2
Old roundup from 2 years ago...
Headphone Buying Guide#3
AFTER TUNING
wireless headphones review
Sony WH-1000XM5 ReviewFRESH
Posted yesterday — boosted by freshness...
Headphone Buying Guide#2
Semantically relevant — matched by meaning...
Best Wireless Headphones 2024#3
Older content ranked lower...
Freshness boost + semantic balance pushed the new review up and demoted stale content.

Where to Find It
Search Tuning lives inside Index Settings in your Opensolr dashboard. Open your index, click the gear icon or Index Settings, and expand the Search Tuning section. Every change saves automatically — move a slider, and your very next search uses the new settings.

The Six Controls

Field Weights
Control how much each field contributes to relevancy ranking. The four searchable fields are Title, Description, URI, and Text (full body). Use the master slider to quickly shift between title-focused ranking (great for navigational queries) and text-focused ranking (great for deep content search).
5.0
Title
4.0
Description
0.5
URI
0.01
Text
Default: Title 5.0, Description 4.0, URI 0.5, Text 0.01 — title-heavy. Drag the master slider right to give body text more influence, or type exact values into each field.
Freshness Boost
How much newer content is preferred over older content. Higher values push recently published or updated pages toward the top. Uses the document's creation_date field with a time-decay curve — recent documents get the biggest boost, which fades over days and weeks.
Range: 10 (barely noticeable) to 1000 (aggressively fresh). Default: 100. Only applies when search mode is set to "Fresh" — standard search ignores this setting.
Minimum Match
How many of the user's search words must appear in a document for it to be considered a match. Three presets:
Flexible
Show more results
Some words can be missing
Balanced
Most words must match
Good middle ground
Strict
All words must match
Fewest but most precise
Default: System-managed (adapts automatically for vector indexes). Choose a preset to override.
Semantic vs Keyword Balance
Controls how much weight goes to semantic (vector) understanding versus exact keyword matching. Only available on vector-enabled indexes (those with embeddings in the schema). Move left for keyword-heavy results, right for semantic-heavy.
More Keyword
More Semantic
Range: 0.0 (pure keyword) to 3.0 (heavily semantic). Default: 1.5 — balanced. The system also adapts dynamically based on query length (longer queries get more semantic weight), but your override takes priority.
Result Quality Threshold
The minimum relevance score a document must reach to appear in results. Raise it to filter out weak matches and show only highly relevant results. Lower it to be more inclusive and show everything that has some match.
Range: 0.0 (show everything) to 1.0 (only near-perfect matches). Default: 0.60 — filters out low-relevance noise while keeping useful results.
Results Per Page
How many search results are returned in each page. Applies to both the Opensolr Search UI and API responses. Higher values show more results but increase response size.
Range: 10 to 200. Default: 50. Adjust based on your UI layout — grid layouts work well with 20-30, list layouts with 50+.

How It Works — Under the Hood

1
You move a slider
Change any control in the Search Tuning panel. The value saves automatically after a 400ms debounce — no Save button needed.
2
Stored per index
Your custom value is saved to your index configuration. A NULL value means "use system defaults" — so resetting a control removes the override entirely.
3
Applied on next search
When a search request comes in, the engine loads your custom values and applies them as overrides on top of the system defaults. No reindexing, no restart. The very next query uses your tuning.

Reset Behavior

Every control has its own Reset button that restores it to the system default. There's also a Reset All to Defaults button at the bottom of the panel that clears all customizations at once.

Reset Individual Control
Click the Reset button next to any control. The value goes back to system default and the override is removed from your index. System defaults include adaptive behavior — for example, vector indexes automatically adjust semantic weight based on query length.
Reset All to Defaults
Clears every custom value at once. Your index goes back to behaving exactly like it did before you opened Search Tuning. All adaptive behaviors are restored.

Quick Recipes

News Site Prioritize fresh articles
Set Freshness Boost to 500-800. Set Minimum Match to Flexible. Leave field weights at defaults — titles already have the highest weight, and news articles have strong titles.
Knowledge Base Semantic understanding first
Set Semantic vs Keyword to 2.0-2.5 (more semantic). Set Minimum Match to Flexible. Set Field Weights — increase Text weight to 1.0+ so body content has more influence. Freshness doesn't matter for evergreen docs, keep it low (10-30).
E-Commerce Exact product matches
Set Minimum Match to Strict — users searching for "blue wireless headphones" should see results with all three words. Keep Semantic at 1.0-1.5 so typos still work. Set Result Quality Threshold to 0.70+ to cut weak matches. Results Per Page at 20-30 for grid layouts.
Blog / Content Site Deep content discovery
Increase Text field weight to 0.5-1.0 (use the master slider toward "Text-focused"). Set Freshness at 100-200 for moderate recency bias. Minimum Match on Balanced. Semantic at 2.0 for natural-language queries that blog readers tend to use.

Defaults at a Glance

Control Default Value Range
Title Weight 5.0 0 – 20
Description Weight 4.0 0 – 20
URI Weight 0.5 0 – 20
Text Weight 0.01 0 – 20
Freshness Boost 100 10 – 1,000
Minimum Match System-managed Flexible / Balanced / Strict
Semantic vs Keyword 1.5 0.0 – 3.0
Result Quality Threshold 0.60 0.0 – 1.0
Results Per Page 50 10 – 200

FAQ

Do I need to reindex after changing tuning settings?
No. Search Tuning controls are applied at query time, not index time. Your changes take effect on the very next search request.
What happens if I don't customize anything?
Everything stays at system defaults. The search engine uses battle-tested defaults that work well for most use cases, including adaptive behavior for vector indexes that adjusts parameters based on query length.
Does Semantic vs Keyword show up for all indexes?
No. It only appears on vector-enabled indexes — those using the embeddings field for semantic search. Non-vector indexes use pure keyword search, so the control isn't shown.
Does Freshness Boost always apply?
Only when the user searches with Fresh mode enabled (the "Fresh" toggle on the search UI, or fresh=yes in the API). Standard search does not apply freshness boosting regardless of this setting.
Can I set different tuning for different indexes?
Yes — that's the whole point. Every index has its own Search Tuning settings. A news index can have high freshness and flexible matching, while a product index on the same account has strict matching and low freshness. Each index is tuned independently.

Ready to Tune Your Search?
Open Index Settings in your dashboard and expand Search Tuning. Changes take effect on the very next search.
Read Full Answer

Opensolr Web Crawler vs Algolia — Why Opensolr Is the Comp...

Comparison
Opensolr Web Crawler vs Algolia
You need search on your website — or you have data that needs to be searchable. Here's what that actually takes with each platform, and why Opensolr gives you a complete search engine while Algolia gives you an API and a to-do list.
How Opensolr Works — From Zero to Full Search in Minutes 1 Create Index Name your index with __dense suffix for vector search support 2 Paste Your URL Enter your website URL. Configure scope, depth and follow rules (or just use defaults) 3 Start Crawl Click Start. Multi-threaded crawler with JS rendering handles HTML, PDF, DOCX, Excel, PPT and more 4 Monitor Crawl Stats show progress, pages crawled, errors and status codes What You Get — Included, No Extra Cost Hybrid Search Vector + Lexical + RRF 3 tunable modes AI Search Hints Streaming LLM answers from your own content Full Search UI Dark/light, facets, scroll mobile-ready, embed code Query Elevation Pin and exclude results per query or globally Analytics Top queries, zero-results click tracking and CTR JS Rendering Auto-detects React, Next Angular, Vue and more 21 File Formats PDF, DOCX, XLSX, PPTX ODT, RTF, MSG and more Price Extraction Auto-extracts prices with range slider Spellcheck "Did you mean?" + vector semantic understanding Data Ingestion API Push JSON or upload files up to 100 docs per batch Dedup Protection URI-based document identity auto-rejects duplicates Rich Text Extraction PDF, DOCX, PPTX, ODT auto-extracted via API

All included. Fixed monthly price. No per-query charges. No per-record fees. Crawl your website, push data via API, or both. Same index, same search, same everything.

What It Actually Takes to Get Search Working
The real comparison isn't features — it's effort.
Algolia
Steps to get search on your website
1Sign up for an Algolia account
2Read their API documentation
3Structure your content as JSON records
4Write code to push records via their API
5Build a frontend UI with InstantSearch widgets
6Configure ranking and relevance rules
7Set up analytics (paid add-on on some plans)
8Write update scripts when content changes
9Hope your bill doesn't spike next month
Developer required. Weeks of integration work.
Opensolr Web Crawler
Steps to get search on your website
1Create an Opensolr Index (add __dense suffix)
2Paste your website URL
3Click Start Crawl
Done.
Full hybrid search, AI hints, analytics, elevation — all live.
No developer needed. Minutes, not weeks.
Optional: tune scope, schedule recrawls, customize embed code, pin results, read analytics — but none of that is required to be up and running.
Fixed price. No surprises. No per-query tax.
Feature-by-Feature Breakdown
Everything Opensolr includes out of the box — no add-ons, no extra cost.
1 Zero-Code Web Crawling
Opensolr
  • Paste your URL, click Start — that's it
  • Multi-threaded crawler with intelligent JS rendering
  • Three-tier rendering pipeline: curl-cffi, httpx, Playwright headless Chromium
  • Auto-detects SPAs (React, Next.js, Angular, Vue, Nuxt, SvelteKit, Gatsby)
  • Crawls 21 MIME types: HTML, PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, MSG, and more
  • Robots.txt obedience, spider trap detection, sitemap following
  • Configurable scope: domain, subdomain, path, or full web
  • Scheduled recrawls (hourly, daily, weekly)
Algolia
  • No built-in crawler — you must write code to push records via their API
  • Algolia Crawler exists as a separate paid product with limited features
  • You structure your data as JSON and maintain push scripts
  • When your content changes, you update and re-push — manually or via custom scripts
  • No document format extraction — want PDF search? Build your own pipeline
2 True Hybrid Search — Three Modes, Full Control
Opensolr
  • Vector mode: Normalized weighted sum of lexical + vector scores with tunable weights and log normalization
  • RRF mode: Reciprocal Rank Fusion — two separate requests merged mathematically for the best of both worlds
  • Solr mode: Lexical-first search with vector reranking — precision-focused
  • 1024-dimensional multilingual embeddings (50+ languages) on title and description fields
  • KNN cosine similarity on dense vectors
  • Per-field boost weights, phrase matching multipliers, minimum match tuning
  • Typos? The vector model understands what you meant, not just what you typed — semantic understanding makes traditional typo tolerance look primitive
  • Plus spellcheck with "Did you mean?" suggestions on top of that
Algolia
  • "NeuralSearch" exists but it's a black box — no control over modes, weights, or normalization
  • No user-tunable hybrid parameters
  • No choice between search strategies
  • Typo tolerance is good, but it only handles character-level errors — it doesn't understand meaning
  • Cannot tune field weights, phrase boosting, or minimum match
3 AI-Powered Search Summaries
Opensolr
  • Streaming AI hints powered by a GPU-accelerated LLM
  • Context-aware: sends top results (title + description + content) to the LLM
  • Real-time Server-Sent Events streaming directly in the search UI
  • Answers appear as the user searches — no extra clicks, no separate page
  • Built on your own indexed content, not hallucinated from training data
Algolia
  • No built-in LLM integration
  • To get AI summaries, you'd build your own RAG pipeline on top of Algolia
  • That means another service, another API, another bill
4 Complete Search UI — Ready to Embed
Opensolr
  • Full themed search page with dark and light modes
  • Infinite scroll or traditional pagination
  • Faceted navigation (language, locale, source, custom facets)
  • OG image previews, favicons, content type icons
  • Mobile-responsive out of the box
  • Configurable via URL parameters — no code needed
  • One-line embed code: drop an iframe and you're done
Algolia
  • Provides InstantSearch.js widget library for React, Vue, Angular
  • YOU assemble the UI from components
  • More flexible for developers, but far more work for everyone else
  • No ready-to-embed, zero-code search page
5 Full Analytics Suite — Built In, Not Upsold
Opensolr
  • Query Analytics: Top queries, daily trends, query length distribution, CSV export
  • Zero-Results Dashboard: Every zero-result query tracked by unique IP — find your content gaps
  • Click Analytics with CTR: Track which results get clicked, click-through rates per query, detect low-CTR queries that need better results
  • Bulk management: Select and delete junk/test queries across all tabs
  • All included in every plan
Algolia
  • Analytics exists but is a paid add-on on higher tiers
  • Click analytics requires additional client-side integration code
  • No zero-result tracking out of the box
  • You pay more to understand how your own users search
6 Query Elevation — Pin, Exclude and Curate Results
Opensolr
  • Pin specific documents to the top for specific queries
  • Exclude documents from appearing for specific queries
  • Global wildcard rules that apply to ALL queries
  • Visual elevation bar directly on the search results page
  • One-click pin/exclude while browsing results — no context switching
Algolia
  • Has "Rules" for pinning and hiding results
  • But the Rules UI is separate from search results — you can't pin while searching
  • No global wildcard rules
  • More cumbersome workflow for result curation
7 Automatic Price Extraction and Filtering
Opensolr
  • Crawler automatically extracts prices from JSON-LD, microdata, and meta tags
  • Price range slider in the search UI (no code needed)
  • Sort by price (ascending, descending, or by relevance)
  • Currency detection and display
  • Works for e-commerce sites out of the box
Algolia
  • You must manually structure price data in your JSON records
  • No automatic extraction from web pages
  • Price faceting available but requires manual schema design
8 21 Document Formats — Crawled and Indexed Automatically
Opensolr
  • HTML, PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, MSG (Outlook email), plain text, XML, RSS, JSON
  • Full text extraction with metadata preservation
  • Document reader lets users view extracted content inline without leaving search results
  • The crawler handles everything — you don't convert, parse, or pre-process anything
Algolia
  • Only indexes JSON records you push via API
  • Want to search PDFs? Build a PDF extraction pipeline yourself
  • Want to search Word documents? Same story
  • Every non-HTML format is your problem to solve
9 Sentiment Analysis and Language Detection
Opensolr
  • VADER sentiment scoring on every crawled page (positive, negative, neutral, compound)
  • Language detection via langid (50+ languages)
  • Language and locale facets in the search UI
  • All automatic — no configuration needed
Algolia
  • No sentiment analysis
  • Basic language detection but nothing automatic or enriching
10 Spellcheck, Stemming and Text Analysis
Opensolr
  • "Did you mean?" spellcheck suggestions
  • Edge n-grams for instant prefix matching (autocomplete)
  • ASCII folding for accent-insensitive search (cafe = café)
  • Stemming and synonym support
  • And on top of all that, vector search that understands meaning regardless of exact spelling
Algolia
  • Typo tolerance is solid (one of Algolia's strengths)
  • But it's a black box — no tuning available
  • No vector-level semantic understanding of typos
11 URL Exclusion and Content Control
Opensolr
  • Exclude specific URL patterns from search results via search.xml config
  • Regex-based exclusion patterns
  • Combined with Query Elevation for full result curation
Algolia
  • No URL exclusion mechanism — you'd remove records via API calls
  • Content control is code-driven, not configuration-driven
12 Predictable Pricing — No Per-Query Tax
Opensolr
  • Fixed monthly pricing — search all you want
  • No per-search-request charges
  • No per-record-per-month charges
  • Everything included: crawling, hybrid search, AI hints, analytics, elevation, search UI
  • Your bill this month is the same as next month
Algolia
  • Charges per search request ($1/1,000 searches on some plans)
  • Charges per record per month
  • Analytics, AI, and advanced features are paid add-ons
  • A traffic spike can make your bill jump 5-10x overnight
  • Algolia's pricing page is deliberately confusing — good luck figuring out your actual cost before you're committed
See our plans: Opensolr Pricing
13 Your Data, Your Infrastructure
Opensolr
  • Data lives on dedicated Solr clusters
  • No vendor lock-in — standard Apache Solr under the hood
  • Master-replica architecture with automatic failover
  • Full Solr API access to your index
  • Can migrate to self-hosted Solr at any time — your schema, your data, your rules
Algolia
  • Proprietary engine — your data is in their cloud, in their format
  • Migration out requires rebuilding everything from scratch
  • No standard API compatibility with anything else
  • You're locked in the moment you integrate
14 Data Ingestion API — Push Any Data Into Your Index Live
Opensolr
  • POST JSON payloads with up to 100 documents per batch
  • Or upload a .json file — ideal for large batches from CMS exports or data pipelines
  • URI-based document identity: every document needs a URI, and the document ID is always md5(uri). Same URI = same document. Resubmit to update.
  • Automatic dedup protection: duplicate URIs already in the queue are rejected before processing
  • Rich text extraction: set rtf: true and pass a URL to a PDF, DOCX, PPTX, ODT, XLSX, or RTF — Opensolr extracts the text for you
  • Full Solr error reporting per document — know exactly which document failed and why
  • Returns doc_ids array in every response for tracking
  • All the same features: dense vectors, hybrid search, AI hints, query elevation, analytics, the full search UI
  • Complete PHP, Python and cURL examples in the documentation
Algolia
  • JSON record push via API — this is their only method of getting data in
  • No file upload — you must always construct and send JSON programmatically
  • No built-in rich text extraction — want to index a PDF? Extract it yourself first
  • No URI-based dedup — you manage document identity and deduplication in your own code
  • No integrated queue or job status — you build your own pipeline
  • Per-record-per-month charges on top of everything else
The Bottom Line
Algolia is an API. You still need to build everything around it — the crawling, the extraction, the UI, the AI layer. Opensolr is the entire search engine, ready to go. Crawl your website with zero code, push structured data via the Data Ingestion API, or do both at once — and you get hybrid vector search, AI summaries, analytics, query elevation, rich text extraction, dedup protection, and a complete search UI — all for a fixed monthly price with no per-query surprises.

For the price of a pizza, you get what would take a team of developers weeks to build on top of Algolia.
Read Full Answer

Hybrid Search in Opensolr: A Modern Approach

Hybrid Search in Apache Solr: Modern Power, Classic Roots

The Evolution of Search: From Keywords to Vectors 🔍➡️🧠

Important Pre-Req.

First make sure you have this embeddings field in your schema.xml (works with):
<!--VECTORS-->
<field name="embeddings" type="vector" indexed="true" stored="true" multiValued="false" required="false" />
<fieldType name="vector" class="solr.DenseVectorField" vectorDimension="1024" similarityFunction="cosine"/>

⚠️ Pay very close attention to the vectorDimension, as it has to match the embeddings that you are creating with your LLM Model. If using the Opensolr Index Embedding API, this has to be exactly: 1024. This works with the Opensolr Embed API Endpoint which uses the BAAI/bge-m3 embedding model.


Opensolr Also supports the native Solr /schema API, so you can also run these two, in order to add your fields to the schema.xml.
$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fieldtypes -H 'Content-type:application/json' -d '{
  "add-field-type": {
    "name": "vector",
    "class": "solr.DenseVectorField",
    "vectorDimension": 1024,
    "similarityFunction": "cosine"
  }
}'

$ curl -u <INDEX_USERNAME>:<INDEX_PASSWORD> https://<OPENSOLR_INDEX_HOST>solr/<OPENSOLR_INDEX_NAME>/schema/fields -H 'Content-type:application/json' -d '{
  "add-field": {
    "name":"embeddings",
    "type":"vector",
    "indexed":true,
    "stored":false, // true if you want to see the vectors for debugging
    "multiValued":false,
    "required":false,
    "dimension":1024,  // adjust to your embedder size
    "similarityFunction":"cosine"
  }
}'

Seocond make sure you have this in solrconfig.xml for atomic updates to use with the Opensolr Index Embedding API:
<!-- The default high-performance update handler -->
<updateHandler class="solr.DirectUpdateHandler2">
      
        <updateLog>
          <int name="numVersionBuckets">65536</int>
          <int name="maxNumLogsToKeep">10</int>
          <int name="numRecordsToKeep">10</int>
        </updateLog>

.....

</updateHandler>

Why Vector Search Isn’t a Silver Bullet ⚠️

As much as we love innovation, vector search still has a few quirks:

  • Mystery Rankings: Why did document B leapfrog document A? Sometimes, it’s anyone’s guess. 🕳️
  • Chunky Business: Embedding models are picky eaters—they work best with just the right size of text chunks.
  • Keyword Nostalgia: Many users still expect the comfort of exact matches. “Where’s my keyword?” they ask. (Fair question!)

Hybrid Search: The Best of Both Worlds 🤝

Hybrid search bridges the gap—combining trusty keyword (lexical) search with smart vector (neural) search for results that are both sharp and relevant.

How It Works

  1. Double the Fun: Run a classic keyword query and a KNN vector search at the same time, creating two candidate lists.
  2. Clever Combining: Merge and rank for maximum “aha!” moments.

Apache Solr Does Hybrid Search (Despite the Rumors) 💡

Contrary to the grapevine, Solr can absolutely do hybrid search—even if the docs are a little shy about it. If your schema mixes traditional fields with a solr.DenseVectorField, you’re all set.


Candidate Selection: Boolean Query Parser to the Rescue 🦸‍♂️

Solr’s Boolean Query Parser lets you mix and match candidate sets with flair:

Union Example

q={!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery={!type=edismax qf=text_field}term1&
vectorQuery={!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]

Result: All unique hits from both searches. No duplicates, more to love! ❤️

Intersection Example

q={!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery={!type=edismax qf=text_field}term1&
vectorQuery={!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]

Result: Only the most relevant docs—where both worlds collide. 🤝


You also have to be mindful of the Solr version you are using, since we were able to make this work only on Solr version 9.0. Beware this did not work on Solr 9.6! Only reranking queries worked on Solr 9.6 (as shown below).

Basically, at this point, here are all the paramerers we sent Solr, to make this hybrid search working on Solr version 9.0:

Classic Solr Edismax Search combined with dense vector search (UNION)

{
  "mm":"1<100% 2<70% 3<45% 5<30% 7<20% 10<10%",
  "df":"title",
  "ps":"3",
  "bf":"recip(rord(timestamp),1,1500,500)^90",
  "fl":"score,meta_file_modification_date*,score,og_image,id,uri,description,title,meta_icon,content_type,creation_date,timestamp,meta_robots,content_type,meta_domain,meta_*,text",
  "start":"0",
  "fq":"+content_type:text*",
  "rows":"100",
  "vectorQuery":"{!knn f=embeddings topK=100}[-0.024160323664546,...,0.031963128596544]",
  "q":"{!bool must=$lexicalQuery must=$vectorQuery}",
  "qf":"title^10 description^5 uri^3 text^2 phonetic_title^0.1",
  "pf":"title^15 description^7 uri^9",
  "lexicalQuery":"{!edismax qf=$qf bf=$bf ps=$ps pf=$pf pf2=$pf2 pf3=$pf3 mm=$mm}trump tariffs",
  "pf3":"text^5",
  "pf2":"tdescription^6"
}

Solr 9.6 reranking query. (It also works in Solr 9.0):

{
  "mm":"1<100% 2<70% 3<45% 5<30% 7<20% 10<10%",
  "df":"title",
  "ps":"3",
  "bf":"recip(rord(timestamp),1,1500,500)^90",
  "fl":"score,meta_file_modification_date*,score,og_image,id,uri,description,title,meta_icon,content_type,creation_date,timestamp,meta_robots,content_type,meta_domain,meta_*,text",
  "start":"0",
  "fq":"+content_type:text*",
  "rows":"100",
  "q":"{!knn f=embeddings topK=100}[-0.024160323664546,...,0.031963128596544]",
  "rqq":"{!edismax qf=$qf bf=$bf ps=$ps pf=$pf pf2=$pf2 pf3=$pf3 mm=$mm}trump tariffs",
  "qf":"title^10 description^5 uri^3 text^2 phonetic_title^0.1",
  "pf":"title^15 description^7 uri^9",
  "pf3":"text^5",
  "pf2":"tdescription^6",
  "rq":"{!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}"
}

A few remarks:

🎹 This is based on the classic Opensolr Web Crawler Index, that does most of it's work within the fields: title, description, text, uri.

📰 Index is populated with data crawled from various public news websites.

🔗 We embedded a concatenation of title, description and the first 50 sentences of text.

💼 We use the Opensolr Query Embed API, to embed our query at search-time.

🏃🏻‍♂️ You can see this search in action, here.

👩🏻‍💻 You can also see the Solr data and make your own queries on it. This index' Solr API, is here.

🔐 Credentials are: Username: 123 / Password: 123 -> Enjoy! 🥳


Cheat Sheet

🤥 Below is a cheat-sheet, of the fields and how you're supposed to use them if you run knn queries. Solr is very picky about what goes with knn and what doesn't. For example, for the Union query, we were unable to use highlighting. But, if you follow the specs below, you'll probably won't be getting any Query can not be null Solr errors... (or will you? 🤭)


What Belongs Inside {!edismax} in lexicalQuery? 🧾

Parameter Inside lexicalQuery Why
q ✅ YES Required for the subquery to function
qf, pf, bf, bq, mm, ps ✅ YES All edismax features must go inside
defType ❌ NO Already defined by {!edismax}
hl, spellcheck, facet, rows, start, sort ❌ NO These are top-level Solr request features

💡 Hybrid Query Cheat Sheet

Here’s how to do it right when you want all the bells and whistles (highlighting, spellcheck, deep edismax):

# TOP-LEVEL BOOLEAN QUERY COMPOSING EDISMAX AND KNN
q={!bool should=$lexicalQuery should=$vectorQuery}

# LEXICAL QUERY: ALL YOUR EDISMAX STUFF GOES HERE
&lexicalQuery={!edismax q=$qtext qf=$qf pf=$pf mm=$mm bf=$bf}

# VECTOR QUERY
&vectorQuery={!knn f=vectorField topK=10}[0.123, -0.456, ...]

# EDISMAX PARAMS
&qtext='flying machine'
&qf=title^6 description^3 text^2 uri^4
&pf=text^10
&mm=1<100% 2<75% 3<50% 6<30%
&bf=recip(ms(NOW,publish_date),3.16e-11,1,1)

# NON-QUERY STUFF
&hl=true
&hl.fl=text
&hl.q=$lexicalQuery
&spellcheck=true
&spellcheck.q=$qtext
&rows=20
&start=0
&sort=score desc

In Summary

Hybrid search gives you the sharp accuracy of keywords and the deep smarts of vectors—all in one system. With Solr, you can have classic reliability and modern magic. 🍦✨

“Why choose between classic and cutting-edge, when you can have both? Double-scoop your search!”

Happy hybrid searching! 🥳

Read Full Answer

Using NLP Models

🧠 Using NLP Models in Your Solr schema_extra_types.xml

Leverage the power of Natural Language Processing (NLP) right inside Solr!
With built-in support for OpenNLP models, you can add advanced tokenization, part-of-speech tagging, named entity recognition, and much more—no PhD required.


🚀 Why Use NLP Models in Solr?

Integrating NLP in your schema allows you to:

  • Extract nouns, verbs, or any part-of-speech you fancy.
  • Perform more relevant searches by filtering, stemming, and synonymizing.
  • Create blazing-fast autocomplete and suggestion features via EdgeNGrams.
  • Support multi-language, linguistically smart queries.

In short: your Solr becomes smarter and your users get better search results.


⚙️ Example: Dutch Edge NGram Nouns Field

Here’s a typical fieldType in your schema_extra_types.xml using OpenNLP:

<fieldType name="text_edge_nouns_nl" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"/>
    <filter class="solr.TypeTokenFilterFactory" types="pos_edge_nouns_nl.txt" useWhitelist="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/nl-sent.bin" tokenizerModel="/opt/nlp/nl-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/nl-pos-maxent.bin"/>
    <filter class="solr.TypeTokenFilterFactory" types="pos_edge_nouns_nl.txt" useWhitelist="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_edge_nouns_nl.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

🔎 Important Details

  • Model Paths:
    Always reference the full absolute path for NLP model files. For example:

    sentenceModel="/opt/nlp/nl-sent.bin"
    tokenizerModel="/opt/nlp/nl-token.bin"
    posTaggerModel="/opt/nlp/nl-pos-maxent.bin"
    

    This ensures Solr always finds your precious language models—no “file not found” drama!

  • Type Token Filtering:
    The TypeTokenFilterFactory with useWhitelist="true" will only keep tokens matching the allowed parts of speech (like nouns, verbs, etc.), as defined in pos_edge_nouns_nl.txt. This keeps your index tight and focused.

  • Synonym Graphs:
    Add SynonymGraphFilterFactory to enable query-side expansion. This is great for handling multiple word forms, synonyms, and local lingo.


🧑‍🔬 Best Practices & Gotchas

  • Keep your NLP model files up to date and tested for your language version!
  • If using multiple languages, make sure you have the right models for each language. (No, Dutch models won’t help with Klingon. Yet.)
  • EdgeNGram and NGram fields are fantastic for autocomplete—but don’t overdo it, as they can bloat your index if not tuned.
  • Use RemoveDuplicatesTokenFilterFactory to keep things clean and efficient.

🌍 Not Just for Dutch!

You can set up similar analyzers for English, undefined language, or anything you like. For example:

<fieldType name="text_nouns_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/en-sent.bin" tokenizerModel="/opt/nlp/en-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/en-pos-maxent.bin"/>
    <filter class="solr.TypeTokenFilterFactory" types="pos_nouns_en.txt" useWhitelist="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="/opt/nlp/en-sent.bin" tokenizerModel="/opt/nlp/en-token.bin"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="/opt/nlp/en-pos-maxent.bin"/>
    <filter class="solr.TypeTokenFilterFactory" types="pos_nouns_en.txt" useWhitelist="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms_nouns_en.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

📦 Keep It Organized

  • Store all model files in a single, logical directory (like /opt/nlp/), and keep a README so you know what’s what.
  • Protect those models! They’re your “brains” for language tasks.

🛠️ Wrap-up

Using NLP models in your Solr analyzers will supercharge your search, make autocomplete smarter, and help users find what they’re actually looking for (even if they type like my cat walks on a keyboard).

Need more examples?
Check out the Solr Reference Guide - OpenNLP Integration or Opensolr documentation.


Happy indexing, and may your tokens always be well-typed! 😸🤓

Read Full Answer

How to use OpenNLP (NER) with Opensolr

UPDATE Oct 29, 2024: OpenNLP + Opensolr Integration Guide

Heads up!
Before you dive into using NLP models with your Opensolr index, please contact us to request the NLP models to be installed for your Opensolr index.
We'll reply with the correct path to use for the .bin files in your schema.xml or solrconfig.xml. Or, if you'd rather avoid all the hassle, just ask us to set it up for you—done and done.


What’s this all about?

This is your step-by-step guide to using AI-powered OpenNLP models with Opensolr. In this walkthrough, we’ll cover Named Entity Recognition (NER) using default OpenNLP models, so you can start extracting valuable information (like people, places, and organizations) directly from your indexed data.

⚠️ Note:
Currently, these models are enabled by default only in the Germany, Solr Version 9 environment. So, if you want an easy life, create your index there!
We’re happy to set up the models in any region (or even your dedicated Opensolr infrastructure for corporate accounts) if you reach out via our Support Helpdesk.

Add New Opensolr Index

You can also download OpenNLP default models from us or the official OpenNLP website.


🛠️ Step-by-Step: Enable NLP Entity Extraction

  1. Create your Opensolr Index

    • Use this guide to create your Opensolr index (Solr 7, 8, or 9).
    • Pro Tip: Creating your index in the Germany Solr 9 Web Crawler Environment skips most of the manual steps below.
  2. Edit Your schema.xml

    • Go to the Opensolr Control Panel.
    • Click your Index Name → Configuration tab → select schema.xml to edit.
    Edit schema.xml
    • Add these snippets:

      Dynamic Field (for storing entities):

<dynamicField name="*_s" type="string" multiValued="true" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" storeOffsetsWithPositions="true" />
  **NLP Tokenizer fieldType:**
<fieldType name="text_nlp" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.OpenNLPTokenizerFactory"
            sentenceModel="en-sent.bin"
            tokenizerModel="en-token.bin"/>
         <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="en-pos-maxent.bin"/>
         <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="en-chunker.bin"/>
         <filter class="solr.TypeAsPayloadFilterFactory"/>
     </analyzer>
 </fieldType>
- **Important:** Don’t use the `text_nlp` type for your dynamic fields! It’s only for the update processor.
  1. Save, then Edit Your solrconfig.xml

    Save schema.xml
    • Add the following updateRequestProcessorChain (and corresponding requestHandler):
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
    <lst name="defaults">
        <str name="update.chain">nlp</str>
    </lst>
</requestHandler>
<updateRequestProcessorChain name="nlp">
    <!-- Extract English People Names -->
    <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
        <str name="modelFile">en-ner-person.bin</str>
        <str name="analyzerFieldType">text_nlp</str>
        <arr name="source">
            <str>title</str>
            <str>description</str>
        </arr>
        <str name="dest">people_s</str>
    </processor>
    <!-- Extract Spanish People Names -->
    <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
        <str name="modelFile">es-ner-person.bin</str>
        <str name="analyzerFieldType">text_nlp</str>
        <arr name="source">
            <str>title</str>
            <str>description</str>
        </arr>
        <str name="dest">people_s</str>
    </processor>
    <!-- Extract Locations -->
    <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
        <str name="modelFile">en-ner-location.bin</str>
        <str name="analyzerFieldType">text_nlp</str>
        <arr name="source">
            <str>title</str>
            <str>description</str>
        </arr>
        <str name="dest">location_s</str>
    </processor>
    <!-- Extract Organizations -->
    <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
        <str name="modelFile">en-ner-organization.bin</str>
        <str name="analyzerFieldType">text_nlp</str>
        <arr name="source">
            <str>title</str>
            <str>description</str>
        </arr>
        <str name="dest">organization_s</str>
    </processor>
    <!-- Language Detection -->
    <processor class="org.apache.solr.update.processor.OpenNLPLangDetectUpdateProcessorFactory">
        <str name="langid.fl">title,text,description</str>
        <str name="langid.langField">language_s</str>
        <str name="langid.model">langdetect-183.bin</str>
    </processor>
    <!-- Remove duplicate extracted entities -->
    <processor class="solr.UniqFieldsUpdateProcessorFactory">
        <str name="fieldRegex">.*_s</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
  1. Populate Test Data (for the impatient!)

    • If you’re using the Germany Solr 9 Web Crawler, you can crawl your site and extract all the juicy entities automatically.
    • Or, insert a sample doc via Solr Admin:
    Solr Admin Panel Add Docs to Solr Index

    Sample JSON:

{
    "id": "1",
    "title": "Jack Sparrow was a pirate. Many feared him. He used to live in downtown Las Vegas.",
    "description": "Jack Sparrow and Janette Sparrowa, are now on their way to Monte Carlo for the summer vacation, after working hard for Microsoft, creating the new and exciting Windows 11 which everyone now loves. :)",
    "text": "The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.Learn more about how you can get involved."
}
  1. See the Magic!

    • Visit the query tab to see extracted entities in action!
    Solr Query Opensolr NLP End Result

Need a hand?

If any step trips you up, contact us and we'll gladly assist you—whether it's model enablement, schema help, or just a friendly chat about Solr and AI. 🤝


Happy Solr-ing & entity extracting!

Read Full Answer