Opensolr Changelog — Recent Updates & Improvements

Web Crawler Apr 4, 2026

Improved Improved web crawler date extraction for sites that lack JSON-LD or meta tags. New targeted extraction looks for dates inside HTML elements with date-related CSS classes (e.g. .date, .posted, .info) — much safer than scanning all page text.
Fix Fixed date indexing errors caused by timezone offsets (e.g. +03:00) in date fields. All dates are now strictly converted to UTC before sending to Solr. Added a final safety gate — any date that doesn't match the exact Solr format is dropped rather than causing an indexing error.

Web Crawler Apr 3, 2026

Improved Web Crawler now respects the package's vector access flag — indexes without AI features skip embedding generation during crawls, saving GPU resources and speeding up indexing.

Web Crawler Mar 28, 2026

Improved Core-wide thread limit enforcement — the max_threads setting now controls the total number of concurrent crawler processes across all start URLs for an index, not per URL. Setting threads to 1 means exactly 1 process at a time.
Fix Fixed a bug where the Solr buffer could get permanently stuck after stopping a crawl mid-run. Documents with oversized embedding payloads no longer block the entire batch — payloads are automatically capped, and each batch flush succeeds independently.
Improved Smarter description extraction — the crawler no longer picks up CSS, JavaScript, or theme builder garbage as page descriptions. Description priority: meta tags → JSON-LD structured data → first two sentences of extracted text.
Improved Crawler settings changes now take effect immediately. When you save new thread count, crawl mode, renderer, or pause settings, the active crawl schedule is automatically updated — no need to stop and restart.
New New Stop Crawl button in the Web Crawler panel — immediately stops all running crawler processes without removing the crawl schedule. The schedule resumes automatically on the next cycle.

Web Crawler Mar 23, 2026

New Reindex All button in the Web Crawler panel. One click moves all previously crawled URLs back into the crawl queue for a full re-crawl — useful after schema changes, config updates, or when you need to refresh your entire index on demand.

Web Crawler Mar 20, 2026

Improved Web Crawler now automatically removes documents from the search index when their pages return non-200 status codes (404, 500, etc.) during crawling. Previously, dead pages could remain in search results indefinitely.
New New: Web Crawler — Keep Index Fresh. Automatically re-crawl all previously indexed pages on a configurable schedule (every 7–365 days) to update content, prices, and detect broken links. Pages that return 404 or 500 are automatically removed from your search index. Available in the Crawler Settings panel under Index Settings.

Web Crawler Mar 7, 2026

New Bulk query deletion in Query Analytics — select multiple queries with checkboxes and delete them in one click. Available on the Queries, No Results, and Click Analytics tabs. Useful for cleaning out junk, test queries, or inappropriate search terms from your analytics history.
New Click Analytics with CTR tracking — see which search results users actually click. Three views in the new Click Analytics tab: Top Clicked documents, By Query with click-through rates, and Low CTR to find queries where users search but never click. All click data is IP-deduplicated and rate-limited to prevent bot noise.
New No-Results Dashboard — a new tab in Query Analytics that tracks every search returning zero results. Each zero-result query is counted by unique IP (not raw page views), so the numbers reflect real users, not refreshes. Use it to spot content gaps, missing synonyms, or pages your crawler hasn't reached yet.

Web Crawler Mar 6, 2026

Fix Document indexing now works reliably for all major office formats. DOCX, DOC, XLSX, XLS, and PPTX files are fully supported with proper text extraction — including tables, headers, footers, and speaker notes. Previously, many documents were indexed with empty text due to format misdetection and encoding issues.

Web Crawler Mar 5, 2026

New Clear button on Crawl Stats for 4xx and 5xx errors. Click Clear next to Client Errors or Server Errors to delete those entries from the crawl database. Useful for cleaning up old 404s before resuming a crawl so they get retried on the next run.
Improved Faster Playwright rendering in Chrome mode. Pages now complete in ~0.5–1s instead of 2–25s. The old approach waited for all network activity to stop (analytics, trackers, ad pixels), which stalled on busy pages. Now it waits for the DOM, gives JS 500ms to hydrate, and grabs the content.
New New Renderer setting in the Web Crawler. Choose between Curl (Fast) — the default, fetching pages in ~0.2s each — and Chrome (JS Rendering) for JavaScript SPAs like React, Vue, or Angular where content is rendered client-side. Chrome runs every page through a headless Chromium browser. Available in the UI dropdown and the REST API (renderer parameter), and persists across cron restarts.

Web Crawler Mar 4, 2026

Improved Solr batch indexing is now more reliable during crawls. When a batch insert to Solr fails (e.g. temporary overload or timeout), the documents are kept in the local buffer and retried on the next flush cycle, instead of being silently lost.
Fix Fixed tag field generation during crawl. The tags and title_tags fields used for autocomplete and spellcheck were being stored with raw special characters intact, which could produce noisy or broken suggestions. They are now properly cleaned — special characters stripped, whitespace normalized — so autocomplete and spellcheck results are cleaner.

Web Crawler Mar 2, 2026

Fix Automatic cleanup of stale crawler lock files. If a previous crawl crashed or was interrupted, leftover lock files could silently prevent the next run from starting — the crawler would launch but do nothing. Resume now detects and removes stale lock files before starting, so scheduled cron runs and manual resumes always work reliably.
New Sitemap re-discovery on Resume. When you resume a finished crawl, the crawler now re-fetches all XML sitemaps — not just the top-level sitemap index, but every child sitemap too (e.g. sitemap-products1.xml through sitemap-products22.xml). Any new URLs found in those sitemaps get queued and crawled automatically. This means your index stays up to date as your site adds new pages, without needing a full re-crawl.
Improved Smarter Resume for the Web Crawler. Clicking Resume now always launches the crawler, even when the queue appears empty. Previously, the UI would refuse to resume if there were no pages left in the queue — but that is exactly the scenario where Resume needs to work, because the crawler re-discovers new content by re-reading your sitemaps. No more misleading "nothing to resume" messages.

Web Crawler Mar 1, 2026

Fix Fixed the Flush to Solr button in Web Crawler always reporting buffer is empty even when documents were actually flushed. It now correctly reports the number of documents flushed, and automatically commits after flushing so your documents become searchable immediately — no more waiting for the next auto-commit cycle.

Web Crawler Feb 27, 2026

Improved Web Crawler indexing is now faster — crawled pages are sent to Solr in larger batches instead of one at a time, reducing round-trip overhead and significantly speeding up the overall indexing process.
Improved Clicking Resume when the crawler queue is empty now shows a clear message explaining there are no more pages to process, instead of silently doing nothing. It suggests stopping the cron schedule and starting a fresh crawl.
Improved The crawler status badge now distinguishes between Running (green), Paused (blue), and Stopped (red). When the cron schedule is active but no crawler processes are running, the dashboard shows Paused instead of Running, so you always know the actual state of your crawl.
New New Pause and Resume controls for the Web Crawler. You can now temporarily pause a running crawl without losing your cron schedule — the crawler will automatically pick back up on the next scheduled tick, or you can hit Resume to restart it immediately. The Stop button has been renamed to Stop Cron Schedule to make it clear that it permanently removes the schedule.

Web Crawler Feb 25, 2026

Improved Smarter content extraction — the web crawler now uses a dual-extraction strategy that runs two independent text extraction engines and picks whichever captures more real content. Pages with heavy JavaScript, complex layouts, or framework-rendered content (React, Next.js, Angular, Vue) are now detected and rendered automatically. The result: richer, more complete text in your Opensolr Index, especially for modern web applications.