Getting Started with the Opensolr Web Crawler
What is the Opensolr Web Crawler?
The Opensolr Web Crawler is an AI-powered web crawling and indexing engine that automatically crawls your website, extracts content from every page, enriches it with AI analysis (sentiment, language detection), generates vector embeddings for semantic search, and stores everything in a high-performance Apache Solr index — ready to search immediately.
Think of it as your own private Google, but for your website. You point it at your site, it crawls every page, and within minutes you have a fully searchable index with all the bells and whistles: full-text search, autocomplete, spell checking, AI-powered semantic search, and more. For a complete overview of everything the crawler does — crawl modes, JS rendering, analytics, query elevation, URL exclusion, and more — see the Opensolr Web Crawler — Site Search Solution.
Step 1: Create an Opensolr Account
Head over to opensolr.com and create a free account. No credit card required to get started.
Step 2: Add a New Index on a Web Crawler Server
Once logged in, go to Control Panel → Add New Index. You will see a list of available Solr servers across different regions.
Important: You need to select a Web Crawler server. These are the servers that have the crawling engine built in. You can easily spot them — they are marked with a small spider icon next to the server name.
You can also use the Crawler filter dropdown at the top of the server list to show only Web Crawler-enabled servers. Select "Yes" in that dropdown and you will only see the crawler-capable servers.
Currently available Web Crawler regions:
- EU-NORTH (Helsinki, Finland) —
FINLAND9 - US-EAST (Chicago) —
CHICAGO-96 - More regions may be added in the future
Pick the region closest to your website audience for the best performance, give your index a name, and click create.
Step 3: Start Crawling Your Website
Once your index is created, go into the Index Control Panel and click on "Web Crawler" in the left sidebar menu.
Here you can:
- Enter your starting URL — this is typically your homepage (e.g.,
https://yoursite.com) - Click Start to begin the crawling process
- Monitor progress in real-time — you will see pages being discovered, crawled, and indexed live
The crawler will automatically:
- Follow all internal links on your site
- Extract the page title, meta description, full body text, author info, and OG images
- Detect the language of each page
- Run sentiment analysis on the content
- Generate 1024-dimensional vector embeddings for AI/semantic search
- Store all extracted metadata (Open Graph tags, Twitter cards, icons, etc.)
- Deduplicate content using MD5 signatures
How the Crawler Works
The Opensolr Web Crawler works similarly to how Google and other search engines crawl the web:
- It starts from the URL you provide and follows every link it finds on your pages
- It respects your
robots.txtfile — if you have blocked certain paths in robots.txt, the crawler will honor those rules - It checks for valid URLs, proper HTTP status codes, and well-formed HTML
- It will not index pages that return errors (404, 500, etc.)
- It supports JavaScript-rendered pages — if your site uses React, Vue, Angular, Next.js, or similar frameworks, you can select Chrome (JS Rendering) from the Renderer dropdown in the crawler settings. This tells the crawler to render every page through a headless Chromium browser (Playwright) so JavaScript-generated content is fully captured. By default, the crawler uses Curl (Fast) — a lightweight HTTP fetch with no browser overhead, which is the right choice for most websites
- It extracts content from HTML pages, PDFs, DOCX, ODT, XLSX, and more
- It detects and handles Cloudflare and other anti-bot protections
Make Sure Your Site is Crawl-Friendly
Before starting the crawler, make sure:
- Do not block the crawler in your
robots.txtor firewall rules - Your pages should have proper
<title>tags and<meta name="description">tags — these become the most important search fields - Your site should return HTTP 200 status codes for pages you want indexed
- Internal links should be clean and working (no broken links)
- If your site is behind authentication (login required), the crawler will not be able to access those pages
What Happens Next?
Once the crawling is complete, your Opensolr Index is fully populated and ready to use. Your index now includes hybrid vector search, AI hints, spellcheck, analytics, and query elevation — all active by default. See the full feature overview for details on everything that ships with your index.\n\nYou have two options:
Option A: Use the Built-in Search UI (Embed Code)
Opensolr provides a ready-made, responsive search interface that you can embed on your website with just two lines of HTML. See the next article: Embedding the Opensolr Search UI.
Option B: Build Your Own Custom Search UI
If you want full control over the look and feel, you can query the Solr index directly using the native Solr /select API and build your own frontend. This developer guide covers everything you need. See: Web Crawler Index Field Reference.
Option C: Push Content via the Data Ingestion API
Need to add content the crawler can't reach? Internal databases, gated pages, CMS drafts, product feeds — you can push documents directly into your index via the Data Ingestion API. Every document goes through the same enrichment pipeline: vector embeddings, sentiment analysis, language detection, autocomplete tags — all generated automatically. Documents from the API and the crawler coexist seamlessly in the same index.
You can monitor and manage all your ingestion jobs from the Ingestion Queue in your control panel.
Your site's content is smart — your search should be too.
Want to see everything the Web Crawler can do? Check the complete Web Crawler overview — crawl modes, live demos, analytics dashboards, URL exclusion, and more.
Query Elevation — Pin & Exclude Results
Once your search is live, you can take direct control of what appears in results. Query Elevation lets you pin important pages to the top for specific queries, or exclude irrelevant results entirely — all from the Search UI, with zero code.
- Pin — force a result to position #1 for a specific search query
- Exclude — hide a result for a specific query
- Exclude All — hide a result from every search globally
Enable it in your index settings and you're good to go.