Opensolr Changelog — Recent Updates & Improvements

Web Crawler Apr 4, 2026

Improved Improved web crawler date extraction for sites that lack JSON-LD or meta tags. New targeted extraction looks for dates inside HTML elements with date-related CSS classes (e.g. .date, .posted, .info) — much safer than scanning all page text.

Improved Web Crawler now respects the package's vector access flag — indexes without AI features skip embedding generation during crawls, saving GPU resources and speeding up indexing.

Improved Core-wide thread limit enforcement — the max_threads setting now controls the total number of concurrent crawler processes across all start URLs for an index, not per URL. Setting threads to 1 means exactly 1 process at a time.
Improved Smarter description extraction — the crawler no longer picks up CSS, JavaScript, or theme builder garbage as page descriptions. Description priority: meta tags → JSON-LD structured data → first two sentences of extracted text.
Improved Crawler settings changes now take effect immediately. When you save new thread count, crawl mode, renderer, or pause settings, the active crawl schedule is automatically updated — no need to stop and restart.

Improved Web Crawler now automatically removes documents from the search index when their pages return non-200 status codes (404, 500, etc.) during crawling. Previously, dead pages could remain in search results indefinitely.

Improved Faster Playwright rendering in Chrome mode. Pages now complete in ~0.5–1s instead of 2–25s. The old approach waited for all network activity to stop (analytics, trackers, ad pixels), which stalled on busy pages. Now it waits for the DOM, gives JS 500ms to hydrate, and grabs the content.

Improved Solr batch indexing is now more reliable during crawls. When a batch insert to Solr fails (e.g. temporary overload or timeout), the documents are kept in the local buffer and retried on the next flush cycle, instead of being silently lost.

Improved Smarter Resume for the Web Crawler. Clicking Resume now always launches the crawler, even when the queue appears empty. Previously, the UI would refuse to resume if there were no pages left in the queue — but that is exactly the scenario where Resume needs to work, because the crawler re-discovers new content by re-reading your sitemaps. No more misleading "nothing to resume" messages.

Improved Web Crawler indexing is now faster — crawled pages are sent to Solr in larger batches instead of one at a time, reducing round-trip overhead and significantly speeding up the overall indexing process.
Improved Clicking Resume when the crawler queue is empty now shows a clear message explaining there are no more pages to process, instead of silently doing nothing. It suggests stopping the cron schedule and starting a fresh crawl.
Improved The crawler status badge now distinguishes between Running (green), Paused (blue), and Stopped (red). When the cron schedule is active but no crawler processes are running, the dashboard shows Paused instead of Running, so you always know the actual state of your crawl.

Improved Smarter content extraction — the web crawler now uses a dual-extraction strategy that runs two independent text extraction engines and picks whichever captures more real content. Pages with heavy JavaScript, complex layouts, or framework-rendered content (React, Next.js, Angular, Vue) are now detected and rendered automatically. The result: richer, more complete text in your Opensolr Index, especially for modern web applications.