Web Crawler - Site Crawling & Indexing

Crawl and index your website automatically

Web Crawler

The Opensolr Web Crawler automatically visits your website, reads the content of every page, and adds it to your search index. Think of it as your own personal Google — it crawls your site so your visitors can search everything on it. For a full walkthrough, see the main crawler guide or jump straight to the getting started tutorial.

How the Web Crawler Works

Your Website Crawler Spider Visits every page Crawl Extract Content Title, text, links Extract Opensolr Index Searchable data Index Search The crawler does all of this automatically — you just provide the starting URL.

How It Works

The Opensolr Web Crawler works just like Google's crawler, but for your own search index. Here is the simple version:

  1. You give it a starting URL (for example, https://www.example.com).
  2. The crawler visits that page, reads the content, and finds all the links on it.
  3. It follows those links to discover more pages on your site.
  4. For each page, it extracts the title, description, main text, images, and metadata.
  5. All of that content gets added to your Opensolr index, making it searchable.

The result: your visitors can search your entire website using fast, relevant Opensolr-powered search. See the crawler standards reference for technical details on how the crawler respects robots.txt, sitemaps, and other web standards.

Adding URLs to Crawl

Before the crawler can do its job, you need to tell it where to start. You can add one or more URLs.

  1. Open your Index Dashboard and click on the index you want to crawl content into.
  2. Go to the "Web Crawler" tab in the tools section.
  3. Enter a URL in the "Add URL" field. This is your starting point. For most sites, enter your homepage (e.g., https://www.example.com).
  4. Click "Add" — the URL will appear in the crawl queue.
Multiple starting URLs

You can add as many URLs as you want. This is useful if your site has sections that are not linked from the homepage, or if you want to crawl multiple subdomains. For tips on choosing the right starting URLs and scope, see Crawler Configuration Best Practices.

URL Verification (Proving Ownership)

To prevent abuse, Opensolr needs to verify that you own or control the website you want to crawl. There are two ways to prove ownership:

Verification File Upload

Download a small verification file from Opensolr and upload it to the root of your website. The crawler checks for this file to confirm you have access to the server. Similar to how Google Search Console verification works.

HMAC-Signed API

If you prefer automation, you can use the Opensolr API to add URLs programmatically. The API uses HMAC signatures (a tamper-proof digital stamp) to verify that the request comes from your account. No file upload needed.

Crawler Settings

The crawler has several settings that let you control exactly how it behaves. Here is what each one does:

CPU Threads (Parallel Requests)

Controls how many pages the crawler fetches at the same time. More threads = faster crawling, but also more load on your website. Start with a lower number (2-4) and increase if your server handles it well.

Crawl Mode: Domain vs. Host

Domain Mode Crawls ALL subdomains example.com www. blog. shop. Host Mode Crawls ONE hostname only example.com www. blog. shop.

Domain mode crawls all subdomains (www, blog, shop, etc.). Host mode sticks to the exact hostname you provide. Use Host mode if you only want to index one part of your site.

Content Types

HTML Only — crawls regular web pages (HTML). This is the default and covers most websites. HTML + Documents — also extracts text from PDFs, DOCX, XLSX, PPTX, and other document formats found on your site. Great if your site hosts downloadable files that should be searchable.

Renderer: Curl vs. Chrome

Curl (Fast) Very Fast ✓ Lightweight, low resources ✓ Great for static sites ✗ Cannot run JavaScript Chrome (Full Browser) Slower but Thorough ✓ Runs JavaScript fully ✓ Sees dynamic content (React, etc.) ✗ Slower, uses more resources

Use Curl for regular websites where content is in the HTML source code. Use Chrome if your site uses JavaScript frameworks (React, Vue, Angular) that load content dynamically after the page loads.

Resume vs. Start Over

Resume continues from where the crawler last stopped — it skips pages already crawled and only processes new ones. Start Over clears the crawl history and visits every page fresh, as if crawling for the first time.

Pause Between Requests (Politeness Delay)

Adds a short delay between page requests (from 0.1 seconds to several seconds). This prevents the crawler from overwhelming your web server with too many requests at once. A polite crawler is a good crawler. Increase this value if your server is under heavy load.

HTTP Authentication for Protected Sites

If your website requires a username and password to access (HTTP Basic Auth), you can enter those credentials here. The crawler will use them when visiting your pages. This is useful for crawling staging environments or password-protected areas.

Crawl Controls

Once your URLs and settings are configured, you control the crawler with these buttons:

Start Crawl

Begins crawling your website. The crawler will visit pages, extract content, and queue it for indexing.

Pause Crawl

Temporarily stops crawling. The crawler remembers where it left off so you can resume later.

Stop Crawl

Completely stops the current crawl session. Use this if you need to change settings or start over.

Flush to Solr

Pushes all crawled content from the staging area into your Opensolr index immediately, making it searchable right away.

What does "Flush to Solr" mean?

The crawler first stores extracted content in a staging area. Flushing sends that content into your actual Opensolr index. Think of it as pressing "publish" — the content goes live for search.

Automate crawler controls via API

All of these controls are also available as API endpoints: Start, Stop, Pause, and Flush to Solr. Use them to integrate crawler operations into your CI/CD pipeline or scheduling scripts.

Scheduling Recurring Crawls

You can set the crawler to run automatically on a schedule, so your index always stays up to date without manual work.

  1. Go to the Scheduling section in the Web Crawler tab.
  2. Choose a frequency — daily, weekly, or a custom interval.
  3. Save your schedule — the crawler will automatically start at the scheduled time and crawl your site.
Set it and forget it

Once scheduled, the crawler runs on autopilot. New pages on your site will be discovered and indexed automatically on your chosen schedule.

Keep Your Index Fresh

Websites change over time — pages get updated, new pages are added, old pages are removed. The "Keep Index Fresh" feature automatically re-crawls your site every N days to make sure your search index reflects the current state of your website.

When enabled, the crawler periodically revisits all known pages and updates the index with any changes. Pages that no longer exist are removed from the index.

Reindex All vs. Reindex From Scratch

Reindex All

Re-visits every known URL and updates the content in your index. Does NOT delete anything first. Faster because the crawler already knows what pages exist. Use this when your content has changed and you want to refresh everything.

Reindex From Scratch

Deletes ALL existing crawl data and starts fresh, as if crawling for the very first time. Use this when you have restructured your site, changed domains, or want a completely clean slate.

Choose wisely

Reindex All is usually what you want — it is faster and does not cause any downtime in your search. Reindex From Scratch temporarily removes all content from your index while the crawler works, so your search results may be empty or incomplete until the crawl finishes.

Monitoring Your Crawl

While the crawler is running (or after it finishes), you can monitor its progress in real time:

  • Pages Crawled — how many pages the crawler has visited so far.
  • Pages Queued — how many pages are waiting to be crawled.
  • Errors — pages that could not be crawled (404 not found, 403 forbidden, server errors, etc.).
  • Crawl Speed — how many pages per second the crawler is processing.

The Crawl Log gives you a detailed breakdown of every URL the crawler visited, what happened (success or error), and the HTTP status code returned. This is invaluable for troubleshooting issues like broken links or pages that block the crawler. You can also retrieve live crawl statistics programmatically via the Get Live Crawler Stats API.

Common status codes in the crawl log

200 = Success (page crawled normally). 301/302 = Page redirected. 403 = Access forbidden. 404 = Page not found. 500 = Server error. If you see many errors, check your website's health or adjust crawler settings.

Related Documentation