Opensolr Web Crawler — Site Search Solution

Documentation > OPENSOLR-Web Crawler > Opensolr Web Crawler — Site Search Solution
Web Crawler
Opensolr Web Crawler — Site Search Solution
A fully managed platform to crawl, index, enrich, and search your web content — automatically. Point it at your site and get a production-ready AI search engine in minutes.
YOUR URLyoursite.comCRAWL PAGESHTML, PDF, docsNLP + EMBEDNER, vectors, AISOLR INDEXStructured & searchableSEARCH LIVEAI Hints, Vector Search, ReaderFully automated — from URL to production search in minutes

The Opensolr Web Crawler crawls your entire site, extracts structured data, applies NLP + NER, generates vector embeddings, and feeds everything into Solr — fully indexed and ready to search. No manual config. No fiddling with schemas. Just point it at your site and go.

For setup details, assistance, or pricing: support@opensolr.com


Key Features
Every crawler index includes these out of the box:
Full-Site Crawling
Crawls every page, PDF, document, and image. Scheduled recrawling, pause/resume, and incremental indexing built in.
NLP & Entity Recognition
OpenNLP extracts people, locations, organizations. Automatic sentiment analysis, language detection, and AI enrichment.
Hybrid Vector Search
BGE-m3 embeddings (1024-dim) combined with BM25 scoring. Users search by intent, not just keywords. Multilingual.
AI Hints & Document Reader
RAG-powered instant answers. Click "Read" on any result for an AI summary with PDF export. Enabled by default.
PDFDOC
Rich Content Support
HTML, PDF, Word, Excel, images — including metadata, GPS coordinates, and structured data (JSON-LD, microdata).
Live Stats & Scheduling
Live crawl stats, search analytics, query inspector, and Solr debugQuery. Schedule recrawls and index optimization.

Live Demos

Vector Search (AI-powered)

Keyword Search Demos


Crawl Modes

The crawl mode controls how far the crawler follows links from your starting URL. There are three scope types — each available in full depth or shallow (depth 1) variants.

Domain ScopeModes 1 & 4example.comwww.*shop.*help.*All subdomains crawledHost ScopeModes 2 & 5example.comwww.*shop.*help.*Single hostname onlyPath ScopeModes 3 & 6www.example.com/blog/*/about/shopSpecific path section

Mode 1 — Follow Domain Links (full depth)

Crawls all pages across the entire domain, including all subdomains.

Example: Start URL is https://www.example.com/blog The crawler will follow links to www.example.com, shop.example.com, help.example.com — anything on example.com.

Best for: Indexing an entire website including all its subdomains.

Mode 2 — Follow Host Links (full depth)

Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.

Example: Start URL is https://www.example.com/blog The crawler will follow links on www.example.com only. Links to shop.example.com or help.example.com are ignored.

Best for: Indexing one specific subdomain without pulling in content from other parts of the site.

Mode 3 — Follow Path Links (full depth)

Crawls only pages that start with the same URL path on the same host.

Example: Start URL is https://www.example.com/blog/ The crawler will follow www.example.com/blog/2024/my-post and www.example.com/blog/categories, but will skip www.example.com/about or www.example.com/shop/.

Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.

Mode 4 — Shallow Domain Crawl (depth 1)

Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.

Example: Start URL is https://www.example.com The crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.

Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.

Mode 5 — Shallow Host Crawl (depth 1)

Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.

Best for: A quick, shallow index of a single subdomain.

Mode 6 — Shallow Path Crawl (depth 1)

Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.

Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.


Embedding & Customization

Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters.

Important: To embed the search UI on your website, contact us to have your domain whitelisted and approved for iframe embedding.

Parameter Description
&topbar=off Hide the top search bar
&q=SEARCH_QUERY Set the initial search query
&in=web/media/images Filter by content type
&og=yes/no Show or hide OG images per result
&source=WEBSITE Restrict results to a single domain
&fresh=... Apply result freshness or sentiment bias
&lang=en Filter by language
&pagination_style=scroll/pages Infinite scroll (default) or numbered pages
&ui_theme=light/dark Color theme
&layout=default/fullwidth Container width
&locale=en_us/de_de/ro_ro Filter by OG locale metadata

What's New

  • AI-Hints enabled by default for every crawler index.
  • Automatic Language Detection and advanced NER via OpenNLP.
  • Customizable for any language and analysis pipeline.
  • Full support for spellcheck, autocomplete, backup, and replication.
  • Live SEO & crawling stats and sentiment analysis.
  • Pause & Resume with schedule management via UI or REST API.
  • Schedule Optimize — set your index to auto-optimize on a recurring schedule.

Solr Configuration

To enable smooth crawling and full feature support, use the ready-made Solr configs:

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.


Quick Video Demo

Ready to Crawl Your Site?
We handle everything — from crawling to deployment. You just provide the URL.

Query Elevation — Pin & Exclude Search Results

Take full control of what your users see. Query Elevation lets you pin important results to the top or exclude irrelevant ones — directly from the Search UI, with zero code and no reindexing required.

  • Pin — Force a specific result to the top of the list for a given search query
  • Exclude — Hide a result completely so it never appears for that query
  • Pin All / Exclude All — Apply the rule globally, across every search query
  • Drag & drop — Reorder pinned results to control exactly which one shows first

Enable it from your index settings panel, then open the Search UI — every result gets an elevation toolbar. Perfect for promoting landing pages, burying outdated content, or curating high-value queries.