Data Crawler β€” Web Crawling & Scheduling

Configure and manage the Web Crawler

Data Crawler

The Opensolr Web Crawler fetches your pages automatically from your XML sitemap, extracts structured content (text, titles, meta tags, JSON-LD, images), and indexes everything into your Opensolr index. No code, no API calls from your server β€” the crawler does all the work.

Register Your Sitemap

The plugin auto-generates an XML sitemap at /opensolr-sitemap.xml on your site. This sitemap includes all published content from the content types you selected. On the Data Crawler tab, register this URL so the crawler knows where to find your content.

The sitemap URL is pre-filled for you. Click Register Sitemap to send it to the Opensolr crawler. This only needs to be done once β€” the crawler remembers your sitemap URL for all future crawls.

Starting a Crawl

Three buttons control your crawl operations:

Reindex Everything

Starts a fresh crawl of all URLs in your sitemap. Existing documents in the index are kept β€” new and updated pages are added or overwritten. Use this for regular re-crawls.

Resume

Resumes a stopped or interrupted crawl from where it left off. URLs already crawled are skipped. Useful if you paused a large crawl or it timed out.

Reindex From Scratch

Completely resets your index and starts over. All existing documents are deleted first, then the crawler re-fetches everything from scratch. Use this when you have made major structural changes to your site.

Reindex From Scratch deletes your index

"Reindex From Scratch" wipes your entire search index before re-crawling. Your site's search will return no results until the crawl completes. Use "Reindex Everything" instead for routine re-crawls β€” it adds and updates documents without deleting anything.

Crawl Status Modal

After starting a crawl (or clicking Check Crawl Status), a live status dialog opens showing real-time progress:

Progress Bar

Visual percentage bar showing crawl completion. Shows "Crawling..." (blue) when active or "Idle" (green) when finished.

Stats Grid

Two-column layout with: pages crawled, pages in Solr buffer, remaining in queue, errors, oversize pages, noindex pages, MB downloaded, dynamic pages, blocked by scope, redirects outside.

Auto-Polling

Refreshes automatically at a configurable interval β€” 5 seconds (default), 10s, 30s, 60s, or off. Your preference is remembered across sessions.

Collapsible Details

Expandable sections for error URLs, noindex pages, pages blocked by scope, and redirects that went outside your domain. Click to view the full list of affected URLs.

Crawler Settings

Fine-tune how the crawler operates on the Crawler Settings panel (collapsible at the top of the Data Crawler tab):

  • Max Threads (1-10) β€” how many pages the crawler fetches simultaneously. Higher values crawl faster but put more load on your server. Default: 3.
  • Relax Time (0-10 seconds) β€” pause between requests to be gentle on your server. Use 1-2 seconds for shared hosting, 0 for powerful servers.
  • Follow Documents β€” when enabled, the crawler also follows and indexes linked PDF, DOCX, and other document files found on your pages.
  • Renderer β€” controls how pages are fetched:
    • Auto (recommended) β€” tries fast HTTP first, falls back to a headless browser for JavaScript-heavy pages
    • Curl β€” fast HTTP fetching only, works for most WordPress sites
    • Browser β€” always uses a headless Chromium browser, required for heavily JavaScript-rendered content
Crawl mode is Shallow Host

The crawler always operates in Shallow Host mode (Mode 5) β€” it only crawls URLs listed in your sitemap, staying strictly on your domain. It will never follow links to external sites or crawl pages you have not included in your sitemap. This is hardcoded for safety and cannot be changed.

Scheduled Crawls

Set up automatic re-crawls to keep your index fresh without manual intervention:

  • Frequency β€” choose Daily, Weekly, or Monthly
  • Keep Fresh Days β€” the number of days before a page is considered stale and re-crawled. Pages crawled within this window are skipped to save bandwidth.

Scheduled crawls run as "Reindex Everything" β€” they add and update documents without deleting anything from your index.

Stopping a Crawl

Click Stop at any time to halt a running crawl. Already-indexed pages stay in your index β€” nothing is lost. You can resume later from where it stopped.

What the Crawler Extracts

The Opensolr crawler extracts structured data from every page it visits:

Text & Metadata

Page title, meta description, full body text, Open Graph tags, Twitter cards, canonical URL, author, publication date

Structured Data

JSON-LD (Article, Product, BreadcrumbList), Microdata, RDFa β€” prices, categories, ratings, and taxonomy hierarchies

Images

Featured images, Open Graph images, product gallery images β€” stored with full URLs for search result thumbnails

AI Enrichment

On vector-enabled plans: BGE-m3 embeddings (1024 dimensions), sentiment analysis, and automatic language detection β€” all computed server-side during the crawl

Crawl + Ingest = best of both worlds

Use the crawler for comprehensive, scheduled indexing and add Data Ingestion for instant updates when you publish or edit a post. Both methods produce identical documents β€” no duplicates, just faster freshness.