Opensolr Web Crawler - Site Search Solution

Opensolr Web Crawler

A fully managed platform to crawl, index, enrich, and search your web content — automatically. Learn more about Hybrid Search

For setup details, assistance, or pricing information, contact us at:

support@opensolr.com

What is the Opensolr Web Crawler?

The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size. It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.

Instantly searchable — all content becomes available via a fully responsive, embeddable search UI.
AI-driven enrichment — named entities, sentiment, language detection, and more are extracted on the fly.
Get started in minutes — launch a powerful, custom search engine on your data without manual setup.

Key Features

Full NLP and NER — extract people, locations, organizations, and more using OpenNLP.
Comprehensive Metadata Extraction — collects meta tags, page structure, creation dates, and document fields.
AI-Hints — Opensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.
Automatic Language Detection — indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.
Responsive, Embeddable Search UI — integrate Opensolr search into your site, customize top bar, filters, and behavior.
Scheduled Recrawling & Live Stats — only new and updated content is fetched, with live stats for crawling and SEO.
Secure & Flexible — supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.
Rich Content Support — indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats — extracting content, meta, GPS/location data, and sentiment.
Pause & Resume — pause and resume crawls anytime; supports cron jobs and incremental indexing.

Live Demos

Vector Search (AI-powered)

Keyword Search Demos

Or try the Solr API for a live crawl.

Full Documentation & Examples

Crawl Modes

The crawl mode controls how far the crawler follows links from your starting URL. Choose the mode that best fits your use case.

Mode 1 — Follow Domain Links (full depth)

Crawls all pages across the entire domain, including all subdomains.

Example: Start URL is https://www.example.com/blog The crawler will follow links to www.example.com, shop.example.com, help.example.com — anything on example.com.

Best for: Indexing an entire website including all its subdomains.

Mode 2 — Follow Host Links (full depth)

Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.

Example: Start URL is https://www.example.com/blog The crawler will follow links on www.example.com only. Links to shop.example.com or help.example.com are ignored.

Best for: Indexing one specific subdomain without pulling in content from other parts of the site.

Mode 3 — Follow Path Links (full depth)

Crawls only pages that start with the same URL path on the same host.

Example: Start URL is https://www.example.com/blog/ The crawler will follow www.example.com/blog/2024/my-post and www.example.com/blog/categories, but will skip www.example.com/about or www.example.com/shop/.

Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.

Mode 4 — Shallow Domain Crawl (depth 1)

Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.

Example: Start URL is https://www.example.com The crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.

Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.

Mode 5 — Shallow Host Crawl (depth 1)

Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.

Example: Start URL is https://docs.example.com Stays on docs.example.com, reads the start page and its direct children, but does not discover new links beyond that.

Best for: A quick, shallow index of a single subdomain.

Mode 6 — Shallow Path Crawl (depth 1)

Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.

Example: Start URL is https://www.example.com/products/ Stays within /products/, reads the start page and its direct children, but stops discovering new links after that.

Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.

Embedding & Customization

Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters:

Parameter	Description
`&topbar=off`	Hide the top search bar
`&q=SEARCH_QUERY`	Set the initial search query
`&in=web/media/images`	Filter by content type
`&og=yes/no`	Show or hide OG images per result
`&source=WEBSITE`	Restrict results to a single domain
`&fresh=...`	Apply result freshness or sentiment bias
`&lang=en`	Filter by language

What's New

AI-Hints enabled by default for every crawler index.
Automatic Language Detection and advanced NER via OpenNLP.
Customizable for any language and analysis pipeline.
Full support for spellcheck, autocomplete, backup, and replication.
Live SEO & crawling stats and sentiment analysis.
Pause & Resume with schedule management via UI or REST API.
Schedule Optimize — set your index to auto-optimize on a recurring schedule.

Solr Configuration for Crawling

To enable smooth crawling and full feature support, use the ready-made Solr configs:

Solr 9 Config Zip Archive

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.