Opensolr Web Crawler — Site Search Solution
The Opensolr Web Crawler crawls your entire site, extracts structured data, generates vector embeddings, and feeds everything into Solr — fully indexed and ready to search. No manual config. No fiddling with schemas. Just point it at your site and go.
For setup details, assistance, or pricing: support@opensolr.com
Key Features
Search Analytics — Understand Your Users
URL Exclusion via Config — Hide Pages from Search
search.xml configuration file — no code changes, no reindexing required. The filter applies to every search query automatically.<?xml version="1.0" encoding="UTF-8" ?> <!-- /select request handler in search.xml --> <requestHandler name="/select" class="solr.SearchHandler"> <!-- Default query parameters --> <lst name="defaults"> <str name="df">text</str> <str name="echoParams">explicit</str> </lst> <!-- Appended filter queries — added to EVERY search automatically --> <!-- Users cannot override appended params, making them permanent --> <lst name="appends"> <str name="fq">-uri_s:*/checkout/*</str> <str name="fq">-uri_s:*/thank-you*</str> <str name="fq">-uri_s:*/admin/*</str> <str name="fq">-uri_s:*/product-category/*</str> <str name="fq">-uri_s:*/login*</str> </lst> <!-- Spellcheck component --> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
-uri_s:*/pattern/* filter queries inside the <lst name="appends"> block. Common exclusions:- Checkout / cart pages —
-uri_s:*/checkout/* - Thank-you / confirmation pages —
-uri_s:*/thank-you* - Admin / internal pages —
-uri_s:*/admin/* - Category / tag listing pages —
-uri_s:*/product-category/* - Login / registration pages —
-uri_s:*/login*
Data Ingestion API — Push Your Own Documents
The Web Crawler handles automated site indexing, but sometimes you need to push content that the crawler can't reach — internal databases, CMS drafts, gated content, product feeds, or any structured data from your application. The Data Ingestion API is live and ready to use.
Required Config Set: Data Ingestion requires the Web Crawler config set. For Solr 9, download the mandatory config set for Solr 9 and upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.
The Data Ingestion API lets you POST documents directly into your Web Crawler index. Every document goes through the same enrichment pipeline as crawled pages: vector embeddings, sentiment analysis, language detection, and all derived search fields are generated automatically. Documents from both sources coexist seamlessly in the same index.
- Push up to 100 documents per batch via JSON body or .json file upload
- Update existing documents — same URI = same document (ID is always md5 of URI)
- Extract text from PDFs, DOCX, PPTX, XLSX, ODT, RTF via
rtf:true - Automatic dedup protection — duplicate URIs already queued are rejected
- Returns
doc_idsarray for tracking every document in the batch - Detailed per-document Solr error reporting in the job queue
- Monitor job progress, pause, resume, or cancel
- 1024-dim BGE-m3 vector embeddings
- Sentiment scores (VADER)
- Language detection
- Autocomplete tags, phonetic fields, spellcheck
Full Data Ingestion API Reference → | Manage Your Ingestion Queue →
Live Demos
Vector Search (AI-powered)
Keyword Search Demos
-
Stiri (RO) | Nyheter (SV) | Fresh News (EN) | Tech News (EN)
-
Full Documentation & Testing Guide | Hybrid Search Deep-Dive
Crawl Modes
The crawl mode controls how far the crawler follows links from your starting URL. There are three scope types — each available in full depth or shallow (depth 1) variants.
Mode 1 — Follow Domain Links (full depth)
Crawls all pages across the entire domain, including all subdomains.
Example: Start URL is
https://www.example.com/blogThe crawler will follow links towww.example.com,shop.example.com,help.example.com— anything onexample.com.
Best for: Indexing an entire website including all its subdomains.
Mode 2 — Follow Host Links (full depth)
Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.
Example: Start URL is
https://www.example.com/blogThe crawler will follow links onwww.example.comonly. Links toshop.example.comorhelp.example.comare ignored.
Best for: Indexing one specific subdomain without pulling in content from other parts of the site.
Mode 3 — Follow Path Links (full depth)
Crawls only pages that start with the same URL path on the same host.
Example: Start URL is
https://www.example.com/blog/The crawler will followwww.example.com/blog/2024/my-postandwww.example.com/blog/categories, but will skipwww.example.com/aboutorwww.example.com/shop/.
Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.
Mode 4 — Shallow Domain Crawl (depth 1)
Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.
Example: Start URL is
https://www.example.comThe crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.
Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.
Mode 5 — Shallow Host Crawl (depth 1)
Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.
Best for: A quick, shallow index of a single subdomain.
Mode 6 — Shallow Path Crawl (depth 1)
Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.
Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.
Embedding & Customization
Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters.
Important: To embed the search UI on your website, contact us to have your domain whitelisted and approved for iframe embedding.
| Parameter | Description |
|---|---|
&topbar=off |
Hide the top search bar |
&q=SEARCH_QUERY |
Set the initial search query |
&in=web/media/images |
Filter by content type |
&og=yes/no |
Show or hide OG images per result |
&source=WEBSITE |
Restrict results to a single domain |
&fresh=... |
Apply result freshness or sentiment bias |
&lang=en |
Filter by language |
&pagination_style=scroll/pages |
Infinite scroll (default) or numbered pages |
&ui_theme=light/dark |
Color theme |
&layout=default/fullwidth |
Container width |
&locale=en_us/de_de/ro_ro |
Filter by OG locale metadata |
What's New
- Data Ingestion API — push documents directly from your CMS, app, or data pipeline. JSON body or file upload, up to 100 docs per batch, with URI-based dedup and rich text extraction. Full reference →
- Query Elevation — pin results to the top or exclude them from any query, directly from the Search UI. No code needed. Full guide →
- AI-Hints enabled by default for every crawler index.
- Automatic Language Detection for every indexed document.
- Customizable for any language and analysis pipeline.
- Full support for spellcheck, autocomplete, backup, and replication.
- Live SEO & crawling stats and sentiment analysis.
- Pause & Resume with schedule management via UI or REST API.
- Schedule Optimize — set your index to auto-optimize on a recurring schedule.
Solr Configuration
To enable smooth crawling and full feature support, use the ready-made Solr configs:
Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.