Opensolr Web Crawler - Site Search Solution

Documentation > OPENSOLR-Web Crawler > Opensolr Web Crawler - Site Search Solution

Opensolr Web Crawler

A fully managed platform to crawl, index, enrich, and search your web content — automatically. Learn more about Hybrid Search


For setup details, assistance, or pricing information, contact us at:


What is the Opensolr Web Crawler?

The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size. It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.

  • Instantly searchable — all content becomes available via a fully responsive, embeddable search UI.
  • AI-driven enrichment — named entities, sentiment, language detection, and more are extracted on the fly.
  • Get started in minutes — launch a powerful, custom search engine on your data without manual setup.

Key Features

  • Full NLP and NER — extract people, locations, organizations, and more using OpenNLP.
  • Comprehensive Metadata Extraction — collects meta tags, page structure, creation dates, and document fields.
  • AI-HintsOpensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.
  • Automatic Language Detection — indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.
  • Responsive, Embeddable Search UI — integrate Opensolr search into your site, customize top bar, filters, and behavior.
  • Scheduled Recrawling & Live Stats — only new and updated content is fetched, with live stats for crawling and SEO.
  • Secure & Flexible — supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.
  • Rich Content Support — indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats — extracting content, meta, GPS/location data, and sentiment.
  • Pause & Resume — pause and resume crawls anytime; supports cron jobs and incremental indexing.

Live Demos

Vector Search (AI-powered)

Keyword Search Demos

Or try the Solr API for a live crawl.


Crawl Modes

The crawl mode controls how far the crawler follows links from your starting URL. Choose the mode that best fits your use case.

Mode 1 — Follow Domain Links (full depth)

Crawls all pages across the entire domain, including all subdomains.

Example: Start URL is https://www.example.com/blog The crawler will follow links to www.example.com, shop.example.com, help.example.com — anything on example.com.

Best for: Indexing an entire website including all its subdomains.

Mode 2 — Follow Host Links (full depth)

Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.

Example: Start URL is https://www.example.com/blog The crawler will follow links on www.example.com only. Links to shop.example.com or help.example.com are ignored.

Best for: Indexing one specific subdomain without pulling in content from other parts of the site.

Mode 3 — Follow Path Links (full depth)

Crawls only pages that start with the same URL path on the same host.

Example: Start URL is https://www.example.com/blog/ The crawler will follow www.example.com/blog/2024/my-post and www.example.com/blog/categories, but will skip www.example.com/about or www.example.com/shop/.

Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.

Mode 4 — Shallow Domain Crawl (depth 1)

Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.

Example: Start URL is https://www.example.com The crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.

Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.

Mode 5 — Shallow Host Crawl (depth 1)

Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.

Example: Start URL is https://docs.example.com Stays on docs.example.com, reads the start page and its direct children, but does not discover new links beyond that.

Best for: A quick, shallow index of a single subdomain.

Mode 6 — Shallow Path Crawl (depth 1)

Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.

Example: Start URL is https://www.example.com/products/ Stays within /products/, reads the start page and its direct children, but stops discovering new links after that.

Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.


Embedding & Customization

Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters:

Parameter Description
&topbar=off Hide the top search bar
&q=SEARCH_QUERY Set the initial search query
&in=web/media/images Filter by content type
&og=yes/no Show or hide OG images per result
&source=WEBSITE Restrict results to a single domain
&fresh=... Apply result freshness or sentiment bias
&lang=en Filter by language

What's New

  • AI-Hints enabled by default for every crawler index.
  • Automatic Language Detection and advanced NER via OpenNLP.
  • Customizable for any language and analysis pipeline.
  • Full support for spellcheck, autocomplete, backup, and replication.
  • Live SEO & crawling stats and sentiment analysis.
  • Pause & Resume with schedule management via UI or REST API.
  • Schedule Optimize — set your index to auto-optimize on a recurring schedule.

Solr Configuration for Crawling

To enable smooth crawling and full feature support, use the ready-made Solr configs:

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.


Quick Video Demo