Crawler Configuration & Best Practices
Crawler Configuration & Best Practices
Once your Opensolr Web Crawler index is created and you have entered a starting URL, there are several things worth knowing about how the crawler operates and how to get the best results.
π Content Types the Crawler Handles
The crawler does not just handle plain HTML pages. It can extract text and metadata from:
- HTML pages β The primary content type. Extracts title, description, body text, meta tags, OG tags, author, and more.
- PDF documents β Full text extraction from PDF files linked on your site.
- DOCX / ODT β Microsoft Word and OpenDocument text files.
- XLSX β Excel spreadsheets (extracts cell data as text).
- Images β Extracts EXIF metadata, GPS coordinates, and alt text.
If your site links to downloadable documents, the crawler will follow those links and index the document contents just like it indexes HTML pages.
β‘ JavaScript-Rendered Pages (SPA / React / Vue / Angular)
Many modern websites are built with JavaScript frameworks like React, Vue, Angular, Next.js, Gatsby, or Nuxt. These sites often render their content dynamically in the browser β the raw HTML that a simple HTTP request returns may be empty or contain only a loading spinner.
The Opensolr Web Crawler handles this automatically:
- It first tries a fast, lightweight HTTP fetch (no browser)
- If it detects that the page likely needs JavaScript rendering (empty body, SPA framework markers,
<noscript>warnings, high script-to-content ratio), it automatically switches to a headless Chromium browser (Playwright) - The browser renders the full page, waits for JavaScript to execute, and then extracts the fully rendered content
This means your React or Angular single-page application will be crawled correctly β no special configuration needed.
Tips for JS-Heavy Sites
- Make sure your site does not block headless browsers in its server configuration
- If you use client-side routing, ensure that direct URL access (not just SPA navigation) works for each page
- The crawler blocks ads, analytics, fonts, and social media embeds automatically for faster rendering
π€ robots.txt
The crawler fully respects your robots.txt file. Before crawling any page, it checks the robots.txt rules for that domain.
If you want to allow the Opensolr crawler while blocking other bots, you can add specific rules:
User-agent: * Disallow: /admin/ Disallow: /private/ # Allow Opensolr crawler everywhere User-agent: OpensolrBot Allow: /
Common Pitfalls
- If your robots.txt blocks
/entirely, the crawler cannot crawl anything - Some CMS platforms (WordPress, Drupal) have robots.txt rules that may block crawlers from CSS/JS files β this can prevent proper rendering of JS-heavy pages
- Check your robots.txt at
https://yoursite.com/robots.txtbefore starting the crawler
URL Validation & Quality Checks
The crawler is strict about URL quality, similar to how Google's crawler operates:
- Only valid HTTP/HTTPS URLs are followed
- Fragment-only links (
#section) are ignored (they point to the same page) - Duplicate URLs are detected and skipped (normalized by removing trailing slashes, sorting query parameters, etc.)
- Redirect chains are followed (301, 302) up to a reasonable limit
- Broken links (404, 500, etc.) are logged but not indexed
- External links (pointing to a different domain) are not followed by default β the crawler stays within the domain(s) you specify
- Content deduplication β if two different URLs serve the exact same content (detected via MD5 hash), only one copy is indexed
π Recrawling & Scheduling
After the initial crawl completes, your pages are indexed and searchable. But websites change β new pages are added, existing pages are updated, some pages are removed.
You can recrawl your site at any time from the Web Crawler section of your Index Control Panel. The crawler will:
- Discover and index new pages
- Update content on pages that changed since the last crawl
- Handle pages that no longer exist
You can also schedule automatic recrawls to keep your index fresh without manual intervention.
Crawl Depth & Coverage
The crawler starts from the URL you provide and follows links recursively. The crawl depth determines how many link-hops away from the starting URL the crawler will go:
- Depth 1 β Only the starting page
- Depth 2 β The starting page plus all pages linked from it
- Depth 3 β Two hops away from the starting page
- And so on...
For most websites, the default depth setting covers the entire site. If your site has a deep link structure (many clicks from the homepage to reach some pages), you may need to increase the crawl depth or provide additional starting URLs.
Tips for Better Coverage
- Provide a sitemap URL as the starting point if your site has one (
https://yoursite.com/sitemap.xml) β the crawler can parse XML sitemaps and discover all listed URLs directly - If some pages are not being discovered, it usually means they are not linked from anywhere that the crawler can reach β add internal links or provide the URL directly
- Pages behind login/authentication walls cannot be crawled
What Gets Indexed
For every successfully crawled page, the following data is extracted and indexed:
| Data | Source | Index Field |
|---|---|---|
| Page URL | The crawled URL | uri, uri_s |
| Page Title | <title> tag |
title, title_s |
| Meta Description | <meta name="description"> |
description |
| Full Body Text | Visible text content | text |
| Author | <meta name="author"> |
author |
| OG Image | <meta property="og:image"> |
og_image |
| All Meta Tags | All <meta> tags |
meta_* dynamic fields |
| Content Type | HTTP Content-Type header | content_type |
| HTTP Status | HTTP response code | content_status |
| Publish Date | Various date sources | creation_date, timestamp |
| Content Hash | MD5 of body text | signature, meta_md5 |
| Language | AI language detection | meta_detected_language |
| Sentiment | VADER sentiment analysis | sent_pos, sent_neg, sent_neu, sent_com |
| Vector Embedding | AI embedding model (1024d) | embeddings |
| Prices | JSON-LD, microdata, meta tags | price_f, currency_s |
| GPS Coordinates | EXIF, structured data | lat, lon, coords |