📖 Full Step-by-Step Guide with Screenshots
Includes server selection, crawler setup, URL verification, search tuning and more.
What is the Opensolr Web Crawler?
The Opensolr Web Crawler is a fully managed, automated content indexing system built on top of the Opensolr Solr hosting platform. Point it at your website — or any website — and it automatically crawls pages, extracts text (including PDFs, DOCX, and other documents), generates 1024-dimensional BGE-m3 vector embeddings, runs sentiment analysis and language detection, and indexes everything into your Solr index. No code, no pipelines, no infrastructure to manage.
The result is a fully functional hybrid search engine — combining BM25 keyword relevance with dense vector semantic search — accessible at search.opensolr.com/YOUR_INDEX_NAME and embeddable on any website with a single script tag.
What Happens Automatically
Follows links from your seed URL or sitemap. Respects robots.txt, stays on your domain, handles redirects.
Extracts clean text from HTML pages, PDFs, DOCX, XLSX, ODT and more. Removes nav, ads, boilerplate.
Each page is embedded with BGE-m3 (1024 dims, GPU-accelerated) for semantic/intent-based search. Requires __dense index suffix.
Auto-detects language (50+ langs) and computes sentiment scores (positive/negative/neutral) for each document.
Runs on a background schedule. New and updated pages are discovered and re-indexed automatically. Your index stays fresh.
Optionally uses a headless Chromium renderer for JavaScript-heavy pages (SPAs, React apps) that curl can't see.
At a Glance — How to Get Started
__dense suffix for vector searchsearch.opensolr.com/YOUR_INDEX_NAME