Opensolr Changelog — Recent Updates & Improvements

Web Crawler Mar 28, 2026

Improved Core-wide thread limit enforcement — the max_threads setting now controls the total number of concurrent crawler processes across all start URLs for an index, not per URL. Setting threads to 1 means exactly 1 process at a time.
Fix Fixed a bug where the Solr buffer could get permanently stuck after stopping a crawl mid-run. Documents with oversized embedding payloads no longer block the entire batch — payloads are automatically capped, and each batch flush succeeds independently.
Improved Smarter description extraction — the crawler no longer picks up CSS, JavaScript, or theme builder garbage as page descriptions. Description priority: meta tags → JSON-LD structured data → first two sentences of extracted text.
Improved Crawler settings changes now take effect immediately. When you save new thread count, crawl mode, renderer, or pause settings, the active crawl schedule is automatically updated — no need to stop and restart.
New New Stop Crawl button in the Web Crawler panel — immediately stops all running crawler processes without removing the crawl schedule. The schedule resumes automatically on the next cycle.

New Reindex All button in the Web Crawler panel. One click moves all previously crawled URLs back into the crawl queue for a full re-crawl — useful after schema changes, config updates, or when you need to refresh your entire index on demand.

Improved Web Crawler now automatically removes documents from the search index when their pages return non-200 status codes (404, 500, etc.) during crawling. Previously, dead pages could remain in search results indefinitely.
New New: Web Crawler — Keep Index Fresh. Automatically re-crawl all previously indexed pages on a configurable schedule (every 7–365 days) to update content, prices, and detect broken links. Pages that return 404 or 500 are automatically removed from your search index. Available in the Crawler Settings panel under Index Settings.

New Bulk query deletion in Query Analytics — select multiple queries with checkboxes and delete them in one click. Available on the Queries, No Results, and Click Analytics tabs. Useful for cleaning out junk, test queries, or inappropriate search terms from your analytics history.
New Click Analytics with CTR tracking — see which search results users actually click. Three views in the new Click Analytics tab: Top Clicked documents, By Query with click-through rates, and Low CTR to find queries where users search but never click. All click data is IP-deduplicated and rate-limited to prevent bot noise.
New No-Results Dashboard — a new tab in Query Analytics that tracks every search returning zero results. Each zero-result query is counted by unique IP (not raw page views), so the numbers reflect real users, not refreshes. Use it to spot content gaps, missing synonyms, or pages your crawler hasn't reached yet.

Fix Document indexing now works reliably for all major office formats. DOCX, DOC, XLSX, XLS, and PPTX files are fully supported with proper text extraction — including tables, headers, footers, and speaker notes. Previously, many documents were indexed with empty text due to format misdetection and encoding issues.

New Clear button on Crawl Stats for 4xx and 5xx errors. Click Clear next to Client Errors or Server Errors to delete those entries from the crawl database. Useful for cleaning up old 404s before resuming a crawl so they get retried on the next run.
Improved Faster Playwright rendering in Chrome mode. Pages now complete in ~0.5–1s instead of 2–25s. The old approach waited for all network activity to stop (analytics, trackers, ad pixels), which stalled on busy pages. Now it waits for the DOM, gives JS 500ms to hydrate, and grabs the content.
New New Renderer setting in the Web Crawler. Choose between Curl (Fast) — the default, fetching pages in ~0.2s each — and Chrome (JS Rendering) for JavaScript SPAs like React, Vue, or Angular where content is rendered client-side. Chrome runs every page through a headless Chromium browser. Available in the UI dropdown and the REST API (renderer parameter), and persists across cron restarts.

Improved Solr batch indexing is now more reliable during crawls. When a batch insert to Solr fails (e.g. temporary overload or timeout), the documents are kept in the local buffer and retried on the next flush cycle, instead of being silently lost.
Fix Fixed tag field generation during crawl. The tags and title_tags fields used for autocomplete and spellcheck were being stored with raw special characters intact, which could produce noisy or broken suggestions. They are now properly cleaned — special characters stripped, whitespace normalized — so autocomplete and spellcheck results are cleaner.

Fix Automatic cleanup of stale crawler lock files. If a previous crawl crashed or was interrupted, leftover lock files could silently prevent the next run from starting — the crawler would launch but do nothing. Resume now detects and removes stale lock files before starting, so scheduled cron runs and manual resumes always work reliably.
New Sitemap re-discovery on Resume. When you resume a finished crawl, the crawler now re-fetches all XML sitemaps — not just the top-level sitemap index, but every child sitemap too (e.g. sitemap-products1.xml through sitemap-products22.xml). Any new URLs found in those sitemaps get queued and crawled automatically. This means your index stays up to date as your site adds new pages, without needing a full re-crawl.
Improved Smarter Resume for the Web Crawler. Clicking Resume now always launches the crawler, even when the queue appears empty. Previously, the UI would refuse to resume if there were no pages left in the queue — but that is exactly the scenario where Resume needs to work, because the crawler re-discovers new content by re-reading your sitemaps. No more misleading "nothing to resume" messages.

Fix Fixed the Flush to Solr button in Web Crawler always reporting buffer is empty even when documents were actually flushed. It now correctly reports the number of documents flushed, and automatically commits after flushing so your documents become searchable immediately — no more waiting for the next auto-commit cycle.