Web Crawler
The Opensolr Web Crawler automatically visits your website, reads the content of every page, and adds it to your search index. Think of it as your own personal Google — it crawls your site so your visitors can search everything on it. For a full walkthrough, see the main crawler guide or jump straight to the getting started tutorial.
How the Web Crawler Works
How It Works
The Opensolr Web Crawler works just like Google's crawler, but for your own search index. Here is the simple version:
- You give it a starting URL (for example,
https://www.example.com). - The crawler visits that page, reads the content, and finds all the links on it.
- It follows those links to discover more pages on your site.
- For each page, it extracts the title, description, main text, images, and metadata.
- All of that content gets added to your Opensolr index, making it searchable.
The result: your visitors can search your entire website using fast, relevant Opensolr-powered search. See the crawler standards reference for technical details on how the crawler respects robots.txt, sitemaps, and other web standards.
Adding URLs to Crawl
Before the crawler can do its job, you need to tell it where to start. You can add one or more URLs.
- Open your Index Dashboard and click on the index you want to crawl content into.
- Go to the "Web Crawler" tab in the tools section.
- Enter a URL in the "Add URL" field. This is your starting point. For most sites, enter your homepage (e.g.,
https://www.example.com). - Click "Add" — the URL will appear in the crawl queue.
You can add as many URLs as you want. This is useful if your site has sections that are not linked from the homepage, or if you want to crawl multiple subdomains. For tips on choosing the right starting URLs and scope, see Crawler Configuration Best Practices.
URL Verification (Proving Ownership)
To prevent abuse, Opensolr needs to verify that you own or control the website you want to crawl. There are two ways to prove ownership:
Verification File Upload
Download a small verification file from Opensolr and upload it to the root of your website. The crawler checks for this file to confirm you have access to the server. Similar to how Google Search Console verification works.
HMAC-Signed API
If you prefer automation, you can use the Opensolr API to add URLs programmatically. The API uses HMAC signatures (a tamper-proof digital stamp) to verify that the request comes from your account. No file upload needed.
Crawler Settings
The crawler has several settings that let you control exactly how it behaves. Here is what each one does:
CPU Threads (Parallel Requests)
Controls how many pages the crawler fetches at the same time. More threads = faster crawling, but also more load on your website. Start with a lower number (2-4) and increase if your server handles it well.
Crawl Mode: Domain vs. Host
Domain mode crawls all subdomains (www, blog, shop, etc.). Host mode sticks to the exact hostname you provide. Use Host mode if you only want to index one part of your site.
Content Types
HTML Only — crawls regular web pages (HTML). This is the default and covers most websites. HTML + Documents — also extracts text from PDFs, DOCX, XLSX, PPTX, and other document formats found on your site. Great if your site hosts downloadable files that should be searchable.
Renderer: Curl vs. Chrome
Use Curl for regular websites where content is in the HTML source code. Use Chrome if your site uses JavaScript frameworks (React, Vue, Angular) that load content dynamically after the page loads.
Resume vs. Start Over
Resume continues from where the crawler last stopped — it skips pages already crawled and only processes new ones. Start Over clears the crawl history and visits every page fresh, as if crawling for the first time.
Pause Between Requests (Politeness Delay)
Adds a short delay between page requests (from 0.1 seconds to several seconds). This prevents the crawler from overwhelming your web server with too many requests at once. A polite crawler is a good crawler. Increase this value if your server is under heavy load.
HTTP Authentication for Protected Sites
If your website requires a username and password to access (HTTP Basic Auth), you can enter those credentials here. The crawler will use them when visiting your pages. This is useful for crawling staging environments or password-protected areas.
Crawl Controls
Once your URLs and settings are configured, you control the crawler with these buttons:
Start Crawl
Begins crawling your website. The crawler will visit pages, extract content, and queue it for indexing.
Pause Crawl
Temporarily stops crawling. The crawler remembers where it left off so you can resume later.
Stop Crawl
Completely stops the current crawl session. Use this if you need to change settings or start over.
Flush to Solr
Pushes all crawled content from the staging area into your Opensolr index immediately, making it searchable right away.
The crawler first stores extracted content in a staging area. Flushing sends that content into your actual Opensolr index. Think of it as pressing "publish" — the content goes live for search.
All of these controls are also available as API endpoints: Start, Stop, Pause, and Flush to Solr. Use them to integrate crawler operations into your CI/CD pipeline or scheduling scripts.
Scheduling Recurring Crawls
You can set the crawler to run automatically on a schedule, so your index always stays up to date without manual work.
- Go to the Scheduling section in the Web Crawler tab.
- Choose a frequency — daily, weekly, or a custom interval.
- Save your schedule — the crawler will automatically start at the scheduled time and crawl your site.
Once scheduled, the crawler runs on autopilot. New pages on your site will be discovered and indexed automatically on your chosen schedule.
Keep Your Index Fresh
Websites change over time — pages get updated, new pages are added, old pages are removed. The "Keep Index Fresh" feature automatically re-crawls your site every N days to make sure your search index reflects the current state of your website.
When enabled, the crawler periodically revisits all known pages and updates the index with any changes. Pages that no longer exist are removed from the index.
Reindex All vs. Reindex From Scratch
Reindex All
Re-visits every known URL and updates the content in your index. Does NOT delete anything first. Faster because the crawler already knows what pages exist. Use this when your content has changed and you want to refresh everything.
Reindex From Scratch
Deletes ALL existing crawl data and starts fresh, as if crawling for the very first time. Use this when you have restructured your site, changed domains, or want a completely clean slate.
Reindex All is usually what you want — it is faster and does not cause any downtime in your search. Reindex From Scratch temporarily removes all content from your index while the crawler works, so your search results may be empty or incomplete until the crawl finishes.
Monitoring Your Crawl
While the crawler is running (or after it finishes), you can monitor its progress in real time:
- Pages Crawled — how many pages the crawler has visited so far.
- Pages Queued — how many pages are waiting to be crawled.
- Errors — pages that could not be crawled (404 not found, 403 forbidden, server errors, etc.).
- Crawl Speed — how many pages per second the crawler is processing.
The Crawl Log gives you a detailed breakdown of every URL the crawler visited, what happened (success or error), and the HTTP status code returned. This is invaluable for troubleshooting issues like broken links or pages that block the crawler. You can also retrieve live crawl statistics programmatically via the Get Live Crawler Stats API.
200 = Success (page crawled normally). 301/302 = Page redirected. 403 = Access forbidden. 404 = Page not found. 500 = Server error. If you see many errors, check your website's health or adjust crawler settings.
Related Documentation
Data Ingestion API
Push documents directly via API instead of crawling — useful for databases, apps, and custom content.
Search & Embed
Set up a search page for your crawled content and embed it on your website.
AI & Vector Search
Crawled content is automatically enriched with AI embeddings for semantic search.
Crawler Field Reference
Complete reference of all fields the web crawler populates in your index, including types and descriptions.