The Opensolr Web Crawler is a fully managed, automated content indexing system built on top of the Opensolr Solr hosting platform. Point it at your website — or any website — and it automatically crawls pages, extracts text (including PDFs, DOCX, and other documents), generates 1024-dimensional BGE-m3 vector embeddings, runs sentiment analysis and language detection, and indexes everything into your Solr index. No code, no pipelines, no infrastructure to manage.
The result is a fully functional hybrid search engine — combining BM25 keyword relevance with dense vector semantic search — accessible at search.opensolr.com/YOUR_INDEX_NAME and embeddable on any website with a single script tag.
What Happens Automatically
🕷️ Page Discovery
Follows links from your seed URL or sitemap. Respects robots.txt, stays on your domain, handles redirects.
📄 Content Extraction
Extracts clean text from HTML pages, PDFs, DOCX, XLSX, ODT and more. Removes nav, ads, boilerplate.
🔢 Vector Embeddings
Each page is embedded with BGE-m3 (1024 dims, GPU-accelerated) for semantic/intent-based search. Requires __dense index suffix.
🌍 Language & Sentiment
Auto-detects language (50+ langs) and computes sentiment scores (positive/negative/neutral) for each document.
🔄 Scheduled Re-crawl
Runs on a background schedule. New and updated pages are discovered and re-indexed automatically. Your index stays fresh.
🤖 JS Rendering
Optionally uses a headless Chromium renderer for JavaScript-heavy pages (SPAs, React apps) that curl can't see.
At a Glance — How to Get Started
1Register at opensolr.com and go to Solr → Add New Index
2Filter Crawler = YES, pick your region, name your index with __dense suffix for vector search
Need a crawler server in your region?We deploy dedicated web crawler servers wherever you need them. Contact us to get one in your own region.
3Open your index → click WebCrawler in the sidebar → Add URL → verify ownership
4Click Start Crawl Schedule — your search is live at search.opensolr.com/YOUR_INDEX_NAME
Once your Opensolr Web Crawler index is created and you have entered a starting URL, there are several things worth knowing about how the crawler operates and how to get the best results.
🔧 Mandatory Configuration Set (Required)
Before you start crawling, your Solr index must use the Opensolr Web Crawler configuration set. This config set defines all the dynamic fields, field types, and analyzers that the crawler needs to properly index your content — including vector embeddings, sentiment fields, phonetic matching, autocomplete edge ngrams, and all meta_* dynamic fields.
Without this configuration set, crawling will fail or produce incomplete results.
New indexes created from the Opensolr dashboard — The config set is applied automatically when you create a new Web Crawler index. No manual step needed.
Existing indexes — If you have an existing Solr 9.x index and want to enable the Web Crawler on it, upload this config set from the Index Settings panel → Upload Config section. This will overwrite your current schema and solrconfig.xml, so make sure to back up any custom configurations first.
Note: This config set is specifically for Solr 9.x. If you are running an older Solr version, contact support@opensolr.com for the correct version.
📄 Content Types the Crawler Handles
The crawler does not just handle plain HTML pages. It can extract text and metadata from:
HTML pages — The primary content type. Extracts title, description, body text, meta tags, OG tags, author, and more.
PDF documents — Full text extraction from PDF files linked on your site.
DOCX / ODT — Microsoft Word and OpenDocument text files.
XLSX — Excel spreadsheets (extracts cell data as text).
Images — Extracts EXIF metadata, GPS coordinates, and alt text.
If your site links to downloadable documents, the crawler will follow those links and index the document contents just like it indexes HTML pages.
⚡ Renderer Setting: Curl vs Chrome
The crawler offers two rendering modes, selectable from the Renderer dropdown in the Web Crawler settings (or via the renderer API parameter):
Curl (Fast) — Default
Uses a fast HTTP client (curl) to fetch pages. No browser is launched. This is the right choice for the vast majority of websites — server-rendered HTML, static sites, WordPress, Drupal, and any site that returns its content in the initial HTTP response. Pages are fetched in ~0.2 seconds each.
Chrome (JS Rendering)
Every page is fetched with curl first, then rendered through a headless Chromium browser (Playwright). Use this mode for JavaScript-heavy single-page applications built with React, Vue, Angular, Next.js, Gatsby, Nuxt, or similar frameworks — sites where the raw HTML returned by the server is empty or contains only a loading spinner, and the actual content is rendered client-side by JavaScript.
Chrome mode adds ~0.5–1 second per page. The browser automatically blocks ads, analytics, fonts, and social media embeds for faster rendering.
Tips for JS-Heavy Sites
Make sure your site does not block headless browsers in its server configuration
If you use client-side routing, ensure that direct URL access (not just SPA navigation) works for each page
If you are unsure whether your site needs Chrome mode, try Curl first — if the crawled content looks complete, stick with Curl
🤖 robots.txt
The crawler fully respects your robots.txt file. Before crawling any page, it checks the robots.txt rules for that domain.
If you want to allow the Opensolr crawler while blocking other bots, you can add specific rules:
If your robots.txt blocks / entirely, the crawler cannot crawl anything
Some CMS platforms (WordPress, Drupal) have robots.txt rules that may block crawlers from CSS/JS files — this can prevent proper rendering of JS-heavy pages
Check your robots.txt at https://yoursite.com/robots.txt before starting the crawler
URL Validation & Quality Checks
The crawler is strict about URL quality, similar to how Google's crawler operates:
Only valid HTTP/HTTPS URLs are followed
Fragment-only links (#section) are ignored (they point to the same page)
Duplicate URLs are detected and skipped (normalized by removing trailing slashes, sorting query parameters, etc.)
Redirect chains are followed (301, 302) up to a reasonable limit
Broken links (404, 500, etc.) are logged but not indexed
External links (pointing to a different domain) are not followed by default — the crawler stays within the domain(s) you specify
Content deduplication — if two different URLs serve the exact same content (detected via MD5 hash), only one copy is indexed
🔄 Recrawling & Scheduling
After the initial crawl completes, your pages are indexed and searchable. But websites change — new pages are added, existing pages are updated, some pages are removed.
You can recrawl your site at any time from the Web Crawler section of your Index Control Panel. The crawler will:
Discover and index new pages
Update content on pages that changed since the last crawl
Handle pages that no longer exist
You can also schedule automatic recrawls to keep your index fresh without manual intervention.
Keep Index Fresh — In the Index Settings modal under your Web Crawler panel, the Keep Index Fresh section lets you set an automatic refresh interval (every 7–365 days). When triggered, all previously crawled pages are re-queued for re-crawling. The normal crawl schedule picks them up and re-fetches each page — updating content, prices, and metadata in your search index. Pages that now return 404 or 500 errors are automatically removed from your Solr index, keeping your search results clean and accurate. The minimum interval scales with your index size (e.g. 14 days for 10K+ pages, 30 days for 50K+) to protect server resources.
Crawl Depth & Coverage
The crawler starts from the URL you provide and follows links recursively. The crawl depth determines how many link-hops away from the starting URL the crawler will go:
Depth 1 — Only the starting page
Depth 2 — The starting page plus all pages linked from it
Depth 3 — Two hops away from the starting page
And so on...
For most websites, the default depth setting covers the entire site. If your site has a deep link structure (many clicks from the homepage to reach some pages), you may need to increase the crawl depth or provide additional starting URLs.
Tips for Better Coverage
Provide a sitemap URL as the starting point if your site has one (https://yoursite.com/sitemap.xml) — the crawler can parse XML sitemaps and discover all listed URLs directly
If some pages are not being discovered, it usually means they are not linked from anywhere that the crawler can reach — add internal links or provide the URL directly
Pages behind login/authentication walls cannot be crawled
What Gets Indexed
For every successfully crawled page, the following data is extracted and indexed:
Data
Source
Index Field
Page URL
The crawled URL
uri, uri_s
Page Title
<title> tag
title, title_s
Meta Description
<meta name="description">
description
Full Body Text
Visible text content
text
Author
<meta name="author">
author
OG Image
<meta property="og:image">
og_image
All Meta Tags
All <meta> tags
meta_* dynamic fields
Content Type
HTTP Content-Type header
content_type
HTTP Status
HTTP response code
content_status
Publish Date
Various date sources
creation_date, timestamp
Content Hash
MD5 of body text
signature, meta_md5
Language
AI language detection
meta_detected_language
Sentiment
VADER sentiment analysis
sent_pos, sent_neg, sent_neu, sent_com
Vector Embedding
AI embedding model (1024d)
embeddings
Prices
JSON-LD, microdata, meta tags
price_f, currency_s
GPS Coordinates
EXIF, structured data
lat, lon, coords
Query Elevation — Pin & Exclude Search Results
Take full control of what your users see. Query Elevation lets you pin important results to the top or exclude irrelevant ones — directly from the Search UI, with zero code and no reindexing required.
Pin — Force a specific result to the top of the list for a given search query
Exclude — Hide a result completely so it never appears for that query
Pin All / Exclude All — Apply the rule globally, across every search query
Drag & drop — Reorder pinned results to control exactly which one shows first
Enable it from your index settings panel, then open the Search UI — every result gets an elevation toolbar. Perfect for promoting landing pages, burying outdated content, or curating high-value queries.