Web Crawler Overview

Getting Started: Web Crawler Site Search

📖 Full Step-by-Step Guide with Screenshots

Includes server selection, crawler setup, URL verification, search tuning and more.

What is the Opensolr Web Crawler?

The Opensolr Web Crawler is a fully managed, automated content indexing system built on top of the Opensolr Solr hosting platform. Point it at your website — or any website — and it automatically crawls pages, extracts text (including PDFs, DOCX, and other documents), generates 1024-dimensional BGE-m3 vector embeddings, runs sentiment analysis and language detection, and indexes everything into your Solr index. No code, no pipelines, no infrastructure to manage.

The result is a fully functional hybrid search engine — combining BM25 keyword relevance with dense vector semantic search — accessible at search.opensolr.com/YOUR_INDEX_NAME and embeddable on any website with a single script tag.

What Happens Automatically

🕷️ Page Discovery

Follows links from your seed URL or sitemap. Respects robots.txt, stays on your domain, handles redirects.

📄 Content Extraction

Extracts clean text from HTML pages, PDFs, DOCX, XLSX, ODT and more. Removes nav, ads, boilerplate.

🔢 Vector Embeddings

Each page is embedded with BGE-m3 (1024 dims, GPU-accelerated) for semantic/intent-based search. Requires __dense index suffix.

🌍 Language & Sentiment

Auto-detects language (50+ langs) and computes sentiment scores (positive/negative/neutral) for each document.

🔄 Scheduled Re-crawl

Runs on a background schedule. New and updated pages are discovered and re-indexed automatically. Your index stays fresh.

🤖 JS Rendering

Optionally uses a headless Chromium renderer for JavaScript-heavy pages (SPAs, React apps) that curl can't see.

At a Glance — How to Get Started

1Register at opensolr.com and go to Solr → Add New Index

2Filter Crawler = YES, pick your region, name your index with __dense suffix for vector search

Need a crawler server in your region? We deploy dedicated web crawler servers wherever you need them. Contact us to get one in your own region.

3Open your index → click WebCrawler in the sidebar → Add URL → verify ownership

4Click Start Crawl Schedule — your search is live at search.opensolr.com/YOUR_INDEX_NAME

Full Guide with Screenshots → Start Free Trial Try Live Demo

Read Full Answer

Crawler Configuration & Best Practices

Once your Opensolr Web Crawler index is created and you have entered a starting URL, there are several things worth knowing about how the crawler operates and how to get the best results.

🔧 Mandatory Configuration Set (Required)

Before you start crawling, your Solr index must use the Opensolr Web Crawler configuration set. This config set defines all the dynamic fields, field types, and analyzers that the crawler needs to properly index your content — including vector embeddings, sentiment fields, phonetic matching, autocomplete edge ngrams, and all meta_* dynamic fields.

Without this configuration set, crawling will fail or produce incomplete results.

⬇ Download mandatory_web_crawler_config_set_solr_9.zip

How to Apply

New indexes created from the Opensolr dashboard — The config set is applied automatically when you create a new Web Crawler index. No manual step needed.
Existing indexes — If you have an existing Solr 9.x index and want to enable the Web Crawler on it, upload this config set from the Index Settings panel → Upload Config section. This will overwrite your current schema and solrconfig.xml, so make sure to back up any custom configurations first.
Via the Drupal module — The Opensolr Search Drupal module applies this config set automatically during index setup.

Note: This config set is specifically for Solr 9.x. If you are running an older Solr version, contact support@opensolr.com for the correct version.

📄 Content Types the Crawler Handles

The crawler does not just handle plain HTML pages. It can extract text and metadata from:

HTML pages — The primary content type. Extracts title, description, body text, meta tags, OG tags, author, and more.
PDF documents — Full text extraction from PDF files linked on your site.
DOCX / ODT — Microsoft Word and OpenDocument text files.
XLSX — Excel spreadsheets (extracts cell data as text).
Images — Extracts EXIF metadata, GPS coordinates, and alt text.

If your site links to downloadable documents, the crawler will follow those links and index the document contents just like it indexes HTML pages.

⚡ Renderer Setting: Curl vs Chrome

The crawler offers two rendering modes, selectable from the Renderer dropdown in the Web Crawler settings (or via the renderer API parameter):

Curl (Fast) — Default

Uses a fast HTTP client (curl) to fetch pages. No browser is launched. This is the right choice for the vast majority of websites — server-rendered HTML, static sites, WordPress, Drupal, and any site that returns its content in the initial HTTP response. Pages are fetched in ~0.2 seconds each.

Chrome (JS Rendering)

Every page is fetched with curl first, then rendered through a headless Chromium browser (Playwright). Use this mode for JavaScript-heavy single-page applications built with React, Vue, Angular, Next.js, Gatsby, Nuxt, or similar frameworks — sites where the raw HTML returned by the server is empty or contains only a loading spinner, and the actual content is rendered client-side by JavaScript.

Chrome mode adds ~0.5–1 second per page. The browser automatically blocks ads, analytics, fonts, and social media embeds for faster rendering.

Tips for JS-Heavy Sites

Make sure your site does not block headless browsers in its server configuration
If you use client-side routing, ensure that direct URL access (not just SPA navigation) works for each page
If you are unsure whether your site needs Chrome mode, try Curl first — if the crawled content looks complete, stick with Curl

🤖 robots.txt

The crawler fully respects your robots.txt file. Before crawling any page, it checks the robots.txt rules for that domain.

If you want to allow the Opensolr crawler while blocking other bots, you can add specific rules:

User-agent: *
Disallow: /admin/
Disallow: /private/

# Allow Opensolr crawler everywhere
User-agent: OpensolrBot
Allow: /

Common Pitfalls

If your robots.txt blocks / entirely, the crawler cannot crawl anything
Some CMS platforms (WordPress, Drupal) have robots.txt rules that may block crawlers from CSS/JS files — this can prevent proper rendering of JS-heavy pages
Check your robots.txt at https://yoursite.com/robots.txt before starting the crawler

URL Validation & Quality Checks

The crawler is strict about URL quality, similar to how Google's crawler operates:

Only valid HTTP/HTTPS URLs are followed
Fragment-only links (#section) are ignored (they point to the same page)
Duplicate URLs are detected and skipped (normalized by removing trailing slashes, sorting query parameters, etc.)
Redirect chains are followed (301, 302) up to a reasonable limit
Broken links (404, 500, etc.) are logged but not indexed
External links (pointing to a different domain) are not followed by default — the crawler stays within the domain(s) you specify
Content deduplication — if two different URLs serve the exact same content (detected via MD5 hash), only one copy is indexed

🔄 Recrawling & Scheduling

After the initial crawl completes, your pages are indexed and searchable. But websites change — new pages are added, existing pages are updated, some pages are removed.

You can recrawl your site at any time from the Web Crawler section of your Index Control Panel. The crawler will:

Discover and index new pages
Update content on pages that changed since the last crawl
Handle pages that no longer exist

You can also schedule automatic recrawls to keep your index fresh without manual intervention.

Keep Index Fresh — In the Index Settings modal under your Web Crawler panel, the Keep Index Fresh section lets you set an automatic refresh interval (every 7–365 days). When triggered, all previously crawled pages are re-queued for re-crawling. The normal crawl schedule picks them up and re-fetches each page — updating content, prices, and metadata in your search index. Pages that now return 404 or 500 errors are automatically removed from your Solr index, keeping your search results clean and accurate. The minimum interval scales with your index size (e.g. 14 days for 10K+ pages, 30 days for 50K+) to protect server resources.

Crawl Depth & Coverage

The crawler starts from the URL you provide and follows links recursively. The crawl depth determines how many link-hops away from the starting URL the crawler will go:

Depth 1 — Only the starting page
Depth 2 — The starting page plus all pages linked from it
Depth 3 — Two hops away from the starting page
And so on...

For most websites, the default depth setting covers the entire site. If your site has a deep link structure (many clicks from the homepage to reach some pages), you may need to increase the crawl depth or provide additional starting URLs.

Tips for Better Coverage

Provide a sitemap URL as the starting point if your site has one (https://yoursite.com/sitemap.xml) — the crawler can parse XML sitemaps and discover all listed URLs directly
If some pages are not being discovered, it usually means they are not linked from anywhere that the crawler can reach — add internal links or provide the URL directly
Pages behind login/authentication walls cannot be crawled

What Gets Indexed

For every successfully crawled page, the following data is extracted and indexed:

Data	Source	Index Field
Page URL	The crawled URL	`uri`, `uri_s`
Page Title	`<title>` tag	`title`, `title_s`
Meta Description	`<meta name="description">`	`description`
Full Body Text	Visible text content	`text`
Author	`<meta name="author">`	`author`
OG Image	`<meta property="og:image">`	`og_image`
All Meta Tags	All `<meta>` tags	`meta_*` dynamic fields
Content Type	HTTP Content-Type header	`content_type`
HTTP Status	HTTP response code	`content_status`
Publish Date	Various date sources	`creation_date`, `timestamp`
Content Hash	MD5 of body text	`signature`, `meta_md5`
Language	AI language detection	`meta_detected_language`
Sentiment	VADER sentiment analysis	`sent_pos`, `sent_neg`, `sent_neu`, `sent_com`
Vector Embedding	AI embedding model (1024d)	`embeddings`
Prices	JSON-LD, microdata, meta tags	`price_f`, `currency_s`
GPS Coordinates	EXIF, structured data	`lat`, `lon`, `coords`

Query Elevation — Pin & Exclude Search Results

Take full control of what your users see. Query Elevation lets you pin important results to the top or exclude irrelevant ones — directly from the Search UI, with zero code and no reindexing required.

Pin — Force a specific result to the top of the list for a given search query
Exclude — Hide a result completely so it never appears for that query
Pin All / Exclude All — Apply the rule globally, across every search query
Drag & drop — Reorder pinned results to control exactly which one shows first

Enable it from your index settings panel, then open the Search UI — every result gets an elevation toolbar. Perfect for promoting landing pages, burying outdated content, or curating high-value queries.

Read Full Answer