Getting Started: Web Crawler Site Search

Web Crawler Overview

📖 Full Step-by-Step Guide with Screenshots

Includes server selection, crawler setup, URL verification, search tuning and more.

Full Platform Guide →

What is the Opensolr Web Crawler?

The Opensolr Web Crawler is a fully managed, automated content indexing system built on top of the Opensolr Solr hosting platform. Point it at your website — or any website — and it automatically crawls pages, extracts text (including PDFs, DOCX, and other documents), generates 1024-dimensional BGE-m3 vector embeddings, runs sentiment analysis and language detection, and indexes everything into your Solr index. No code, no pipelines, no infrastructure to manage.

The result is a fully functional hybrid search engine — combining BM25 keyword relevance with dense vector semantic search — accessible at search.opensolr.com/YOUR_INDEX_NAME and embeddable on any website with a single script tag.

What Happens Automatically

🕷️ Page Discovery

Follows links from your seed URL or sitemap. Respects robots.txt, stays on your domain, handles redirects.

📄 Content Extraction

Extracts clean text from HTML pages, PDFs, DOCX, XLSX, ODT and more. Removes nav, ads, boilerplate.

🔢 Vector Embeddings

Each page is embedded with BGE-m3 (1024 dims, GPU-accelerated) for semantic/intent-based search. Requires __dense index suffix.

🌍 Language & Sentiment

Auto-detects language (50+ langs) and computes sentiment scores (positive/negative/neutral) for each document.

🔄 Scheduled Re-crawl

Runs on a background schedule. New and updated pages are discovered and re-indexed automatically. Your index stays fresh.

🤖 JS Rendering

Optionally uses a headless Chromium renderer for JavaScript-heavy pages (SPAs, React apps) that curl can't see.

At a Glance — How to Get Started

1Register at opensolr.com and go to Solr → Add New Index
2Filter Crawler = YES, pick your region, name your index with __dense suffix for vector search
Need a crawler server in your region? We deploy dedicated web crawler servers wherever you need them. Contact us to get one in your own region.
3Open your index → click WebCrawler in the sidebar → Add URL → verify ownership
4Click Start Crawl Schedule — your search is live at search.opensolr.com/YOUR_INDEX_NAME