WEB CRAWLER-Getting Started

Find answers to your questions quickly and easily

Getting Started: Web Crawler Site Search

Getting Started with the Opensolr Web Crawler

What is the Opensolr Web Crawler?

The Opensolr Web Crawler is an AI-powered web crawling and indexing engine that automatically crawls your website, extracts content from every page, enriches it with NLP analysis (sentiment, language detection, named entities), generates vector embeddings for semantic search, and stores everything in a high-performance Apache Solr 9.x index — ready to search immediately.

Think of it as your own private Google, but for your website. You point it at your site, it crawls every page, and within minutes you have a fully searchable index with all the bells and whistles: full-text search, autocomplete, spell checking, AI-powered semantic search, and more.


Step 1: Create an Opensolr Account

Head over to opensolr.com and create a free account. No credit card required to get started.


Step 2: Add a New Index on a Web Crawler Server

Once logged in, go to Control Panel → Add New Index. You will see a list of available Solr servers across different regions.

Important: You need to select a Web Crawler server. These are the servers that have the crawling engine built in. You can easily spot them — they are marked with a small spider icon next to the server name.

You can also use the Crawler filter dropdown at the top of the server list to show only Web Crawler-enabled servers. Select "Yes" in that dropdown and you will only see the crawler-capable servers.

Currently available Web Crawler regions:

  • EU-NORTH (Helsinki, Finland) — FINLAND9
  • US-EAST (Chicago) — CHICAGO-96
  • More regions may be added in the future

Pick the region closest to your website audience for the best performance, give your index a name, and click create.


Step 3: Start Crawling Your Website

Once your index is created, go into the Index Control Panel and click on "Web Crawler" in the left sidebar menu.

Here you can:

  1. Enter your starting URL — this is typically your homepage (e.g., https://yoursite.com)
  2. Click Start to begin the crawling process
  3. Monitor progress in real-time — you will see pages being discovered, crawled, and indexed live

The crawler will automatically:

  • Follow all internal links on your site
  • Extract the page title, meta description, full body text, author info, and OG images
  • Detect the language of each page
  • Run sentiment analysis on the content
  • Generate 1024-dimensional vector embeddings for AI/semantic search
  • Store all extracted metadata (Open Graph tags, Twitter cards, icons, etc.)
  • Deduplicate content using MD5 signatures

How the Crawler Works

The Opensolr Web Crawler works similarly to how Google and other search engines crawl the web:

  • It starts from the URL you provide and follows every link it finds on your pages
  • It respects your robots.txt file — if you have blocked certain paths in robots.txt, the crawler will honor those rules
  • It checks for valid URLs, proper HTTP status codes, and well-formed HTML
  • It will not index pages that return errors (404, 500, etc.)
  • It handles JavaScript-rendered pages — if your site uses React, Vue, Angular, Next.js, or similar frameworks, the crawler can render pages with a headless browser (Playwright/Chromium) to get the fully rendered content
  • It extracts content from HTML pages, PDFs, DOCX, ODT, XLSX, and more
  • It detects and handles Cloudflare and other anti-bot protections

Make Sure Your Site is Crawl-Friendly

Before starting the crawler, make sure:

  • Do not block the crawler in your robots.txt or firewall rules
  • Your pages should have proper <title> tags and <meta name="description"> tags — these become the most important search fields
  • Your site should return HTTP 200 status codes for pages you want indexed
  • Internal links should be clean and working (no broken links)
  • If your site is behind authentication (login required), the crawler will not be able to access those pages

What Happens Next?

Once the crawling is complete, your Opensolr Index is fully populated and ready to use. You have two options:

Option A: Use the Built-in Search UI (Embed Code)

Opensolr provides a ready-made, responsive search interface that you can embed on your website with just two lines of HTML. See the next article: Embedding the Opensolr Search UI.

Option B: Build Your Own Custom Search UI

If you want full control over the look and feel, you can query the Solr index directly using the native Solr /select API and build your own frontend. This developer guide covers everything you need. See: Web Crawler Index Field Reference.


Your site's content is smart — your search should be too.

Read Full Answer

Crawler Configuration & Best Practices

Crawler Configuration & Best Practices

Once your Opensolr Web Crawler index is created and you have entered a starting URL, there are several things worth knowing about how the crawler operates and how to get the best results.


📄 Content Types the Crawler Handles

The crawler does not just handle plain HTML pages. It can extract text and metadata from:

  • HTML pages — The primary content type. Extracts title, description, body text, meta tags, OG tags, author, and more.
  • PDF documents — Full text extraction from PDF files linked on your site.
  • DOCX / ODT — Microsoft Word and OpenDocument text files.
  • XLSX — Excel spreadsheets (extracts cell data as text).
  • Images — Extracts EXIF metadata, GPS coordinates, and alt text.

If your site links to downloadable documents, the crawler will follow those links and index the document contents just like it indexes HTML pages.


⚡ JavaScript-Rendered Pages (SPA / React / Vue / Angular)

Many modern websites are built with JavaScript frameworks like React, Vue, Angular, Next.js, Gatsby, or Nuxt. These sites often render their content dynamically in the browser — the raw HTML that a simple HTTP request returns may be empty or contain only a loading spinner.

The Opensolr Web Crawler handles this automatically:

  1. It first tries a fast, lightweight HTTP fetch (no browser)
  2. If it detects that the page likely needs JavaScript rendering (empty body, SPA framework markers, <noscript> warnings, high script-to-content ratio), it automatically switches to a headless Chromium browser (Playwright)
  3. The browser renders the full page, waits for JavaScript to execute, and then extracts the fully rendered content

This means your React or Angular single-page application will be crawled correctly — no special configuration needed.

Tips for JS-Heavy Sites

  • Make sure your site does not block headless browsers in its server configuration
  • If you use client-side routing, ensure that direct URL access (not just SPA navigation) works for each page
  • The crawler blocks ads, analytics, fonts, and social media embeds automatically for faster rendering

🤖 robots.txt

The crawler fully respects your robots.txt file. Before crawling any page, it checks the robots.txt rules for that domain.

If you want to allow the Opensolr crawler while blocking other bots, you can add specific rules:

User-agent: *
Disallow: /admin/
Disallow: /private/

# Allow Opensolr crawler everywhere
User-agent: OpensolrBot
Allow: /

Common Pitfalls

  • If your robots.txt blocks / entirely, the crawler cannot crawl anything
  • Some CMS platforms (WordPress, Drupal) have robots.txt rules that may block crawlers from CSS/JS files — this can prevent proper rendering of JS-heavy pages
  • Check your robots.txt at https://yoursite.com/robots.txt before starting the crawler

URL Validation & Quality Checks

The crawler is strict about URL quality, similar to how Google's crawler operates:

  • Only valid HTTP/HTTPS URLs are followed
  • Fragment-only links (#section) are ignored (they point to the same page)
  • Duplicate URLs are detected and skipped (normalized by removing trailing slashes, sorting query parameters, etc.)
  • Redirect chains are followed (301, 302) up to a reasonable limit
  • Broken links (404, 500, etc.) are logged but not indexed
  • External links (pointing to a different domain) are not followed by default — the crawler stays within the domain(s) you specify
  • Content deduplication — if two different URLs serve the exact same content (detected via MD5 hash), only one copy is indexed

🔄 Recrawling & Scheduling

After the initial crawl completes, your pages are indexed and searchable. But websites change — new pages are added, existing pages are updated, some pages are removed.

You can recrawl your site at any time from the Web Crawler section of your Index Control Panel. The crawler will:

  • Discover and index new pages
  • Update content on pages that changed since the last crawl
  • Handle pages that no longer exist

You can also schedule automatic recrawls to keep your index fresh without manual intervention.


Crawl Depth & Coverage

The crawler starts from the URL you provide and follows links recursively. The crawl depth determines how many link-hops away from the starting URL the crawler will go:

  • Depth 1 — Only the starting page
  • Depth 2 — The starting page plus all pages linked from it
  • Depth 3 — Two hops away from the starting page
  • And so on...

For most websites, the default depth setting covers the entire site. If your site has a deep link structure (many clicks from the homepage to reach some pages), you may need to increase the crawl depth or provide additional starting URLs.

Tips for Better Coverage

  • Provide a sitemap URL as the starting point if your site has one (https://yoursite.com/sitemap.xml) — the crawler can parse XML sitemaps and discover all listed URLs directly
  • If some pages are not being discovered, it usually means they are not linked from anywhere that the crawler can reach — add internal links or provide the URL directly
  • Pages behind login/authentication walls cannot be crawled

What Gets Indexed

For every successfully crawled page, the following data is extracted and indexed:

Data Source Index Field
Page URL The crawled URL uri, uri_s
Page Title <title> tag title, title_s
Meta Description <meta name="description"> description
Full Body Text Visible text content text
Author <meta name="author"> author
OG Image <meta property="og:image"> og_image
All Meta Tags All <meta> tags meta_* dynamic fields
Content Type HTTP Content-Type header content_type
HTTP Status HTTP response code content_status
Publish Date Various date sources creation_date, timestamp
Content Hash MD5 of body text signature, meta_md5
Language AI language detection meta_detected_language
Sentiment VADER sentiment analysis sent_pos, sent_neg, sent_neu, sent_com
Vector Embedding AI embedding model (1024d) embeddings
Prices JSON-LD, microdata, meta tags price_f, currency_s
GPS Coordinates EXIF, structured data lat, lon, coords
Read Full Answer