Opensolr Web Crawler - Site Search Solution
Opensolr Web Crawler
A fully managed platform to crawl, index, enrich, and search your web content — automatically. Learn more about Hybrid Search
For setup details, assistance, or pricing information, contact us at:
What is the Opensolr Web Crawler?
The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size. It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.
- Instantly searchable — all content becomes available via a fully responsive, embeddable search UI.
- AI-driven enrichment — named entities, sentiment, language detection, and more are extracted on the fly.
- Get started in minutes — launch a powerful, custom search engine on your data without manual setup.
Key Features
- Full NLP and NER — extract people, locations, organizations, and more using OpenNLP.
- Comprehensive Metadata Extraction — collects meta tags, page structure, creation dates, and document fields.
- AI-Hints — Opensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.
- Automatic Language Detection — indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.
- Responsive, Embeddable Search UI — integrate Opensolr search into your site, customize top bar, filters, and behavior.
- Scheduled Recrawling & Live Stats — only new and updated content is fetched, with live stats for crawling and SEO.
- Secure & Flexible — supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.
- Rich Content Support — indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats — extracting content, meta, GPS/location data, and sentiment.
- Pause & Resume — pause and resume crawls anytime; supports cron jobs and incremental indexing.
Live Demos
Vector Search (AI-powered)
Keyword Search Demos
- Stiri (RO) | Nyheter (SV) | Fresh News (EN) | Tech News (EN)
- ProLabs (EN) | Australian News (EN) | Times of India (EN)
- Italy News (IT) | Germany News (DE) | Alternative News (EN)
Or try the Solr API for a live crawl.
Crawl Modes
The crawl mode controls how far the crawler follows links from your starting URL. Choose the mode that best fits your use case.
Mode 1 — Follow Domain Links (full depth)
Crawls all pages across the entire domain, including all subdomains.
Example: Start URL is
https://www.example.com/blogThe crawler will follow links towww.example.com,shop.example.com,help.example.com— anything onexample.com.
Best for: Indexing an entire website including all its subdomains.
Mode 2 — Follow Host Links (full depth)
Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.
Example: Start URL is
https://www.example.com/blogThe crawler will follow links onwww.example.comonly. Links toshop.example.comorhelp.example.comare ignored.
Best for: Indexing one specific subdomain without pulling in content from other parts of the site.
Mode 3 — Follow Path Links (full depth)
Crawls only pages that start with the same URL path on the same host.
Example: Start URL is
https://www.example.com/blog/The crawler will followwww.example.com/blog/2024/my-postandwww.example.com/blog/categories, but will skipwww.example.com/aboutorwww.example.com/shop/.
Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.
Mode 4 — Shallow Domain Crawl (depth 1)
Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.
Example: Start URL is
https://www.example.comThe crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.
Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.
Mode 5 — Shallow Host Crawl (depth 1)
Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.
Example: Start URL is
https://docs.example.comStays ondocs.example.com, reads the start page and its direct children, but does not discover new links beyond that.
Best for: A quick, shallow index of a single subdomain.
Mode 6 — Shallow Path Crawl (depth 1)
Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.
Example: Start URL is
https://www.example.com/products/Stays within/products/, reads the start page and its direct children, but stops discovering new links after that.
Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.
Embedding & Customization
Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters:
| Parameter | Description |
|---|---|
&topbar=off |
Hide the top search bar |
&q=SEARCH_QUERY |
Set the initial search query |
&in=web/media/images |
Filter by content type |
&og=yes/no |
Show or hide OG images per result |
&source=WEBSITE |
Restrict results to a single domain |
&fresh=... |
Apply result freshness or sentiment bias |
&lang=en |
Filter by language |
What's New
- AI-Hints enabled by default for every crawler index.
- Automatic Language Detection and advanced NER via OpenNLP.
- Customizable for any language and analysis pipeline.
- Full support for spellcheck, autocomplete, backup, and replication.
- Live SEO & crawling stats and sentiment analysis.
- Pause & Resume with schedule management via UI or REST API.
- Schedule Optimize — set your index to auto-optimize on a recurring schedule.
Solr Configuration for Crawling
To enable smooth crawling and full feature support, use the ready-made Solr configs:
Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.