Documentation

Opensolr Documentation — find answers to your questions

Opensolr Web Crawler — Site Search Solution

Web Crawler
Opensolr Web Crawler — Site Search Solution
A fully managed platform to crawl, index, enrich, and search your web content — automatically. Point it at your site and get a production-ready AI search engine in minutes.
YOUR URL
yoursite.com
CRAWL PAGES
HTML, PDF, docs
AI + EMBED
Vectors, AI enrichment
SOLR INDEX
Structured & searchable
SEARCH LIVE
AI Hints, Vector Search, Reader
Fully automated — from URL to production search in minutes

The Opensolr Web Crawler crawls your entire site, extracts structured data, generates vector embeddings, and feeds everything into Solr — fully indexed and ready to search. No manual config. No fiddling with schemas. Just point it at your site and go.

For setup details, assistance, or pricing: support@opensolr.com


Key Features

Every crawler index includes these out of the box:
Full-Site Crawling
Crawls every page, PDF, document, and image. Scheduled recrawling, pause/resume, and incremental indexing built in.
AI Enrichment
Automatic sentiment analysis, language detection, vector embeddings, and AI-powered search enrichment for every document.
Hybrid Vector Search
BGE-m3 embeddings (1024-dim) combined with BM25 scoring. Users search by intent, not just keywords. Multilingual.
AI Hints & Document Reader
RAG-powered instant answers. Click "Read" on any result for an AI summary with PDF export. Enabled by default.
PDFDOC
Rich Content Support
HTML, PDF, Word, Excel, images — including metadata, GPS coordinates, and structured data (JSON-LD, microdata).
Query Elevation
Pin results to the top or exclude them from any query — directly from the Search UI, no code needed. Full guide →
Query Analytics
Full search analytics — top queries, daily trends, query length distribution, and CSV export.
No-Results Dashboard
Track searches that return zero results. Find content gaps and fix them. Learn more →
Click Analytics & CTR
Track result clicks and click-through rates. Find low-CTR queries that need better results. Learn more →
Data Ingestion API
Push documents directly from your CMS, app, or data pipeline. Same enrichment as the crawler — embeddings, sentiment, language detection. Full API reference →

Search Analytics — Understand Your Users

Every Opensolr Web Crawler index includes a full analytics suite built right into the Query Analytics dashboard. No third-party tools, no extra setup — it starts working the moment your search goes live.
QUERY ANALYTICS DASHBOARD
Total Searches
12,847
Unique Queries
3,291
Zero-Result Queries
47
Overview Queries No Results Click Analytics Elevation Rules
Top Queries
1,204
891
654
402
198
Click-Through Rate
winter boots68%
product manual31%
return policy8%
No-Results Dashboard
Tracks every search that returns zero results — by unique IP, so one frustrated user refreshing doesn't inflate the count. Spot content gaps, missing synonyms, or pages you haven't crawled yet. Each zero-result query can be resolved by adding content, configuring synonyms, or creating an elevation rule.
Click Analytics and CTR
See which results users actually click — tracked per query, per document, per position. Three views: Top Clicked documents across all queries, By Query with click-through rates, and Low CTR to find high-impression queries where nobody clicks (= relevance problem). All IP-deduplicated and rate-limited.

URL Exclusion via Config — Hide Pages from Search

Beyond per-result exclusion with Query Elevation, you can permanently block entire URL patterns from ever appearing in search results. This is done by adding filter queries to your index's search.xml configuration file — no code changes, no reindexing required. The filter applies to every search query automatically.
search.xml
<?xml version="1.0" encoding="UTF-8" ?>

<!-- /select request handler in search.xml -->
<requestHandler name="/select" class="solr.SearchHandler">

    <!-- Default query parameters -->
    <lst name="defaults">
        <str name="df">text</str>
        <str name="echoParams">explicit</str>
    </lst>

    <!-- Appended filter queries — added to EVERY search automatically -->
    <!-- Users cannot override appended params, making them permanent -->
    <lst name="appends">
        <str name="fq">-uri_s:*/checkout/*</str>
        <str name="fq">-uri_s:*/thank-you*</str>
        <str name="fq">-uri_s:*/admin/*</str>
        <str name="fq">-uri_s:*/product-category/*</str>
        <str name="fq">-uri_s:*/login*</str>
    </lst>

    <!-- Spellcheck component -->
    <arr name="last-components">
        <str>spellcheck</str>
    </arr>

</requestHandler>
Add -uri_s:*/pattern/* filter queries inside the <lst name="appends"> block. Common exclusions:
  • Checkout / cart pages-uri_s:*/checkout/*
  • Thank-you / confirmation pages-uri_s:*/thank-you*
  • Admin / internal pages-uri_s:*/admin/*
  • Category / tag listing pages-uri_s:*/product-category/*
  • Login / registration pages-uri_s:*/login*
These filters are appended to every query, so the excluded pages never appear in any search results. To edit your search.xml, open your Opensolr index control panel and navigate to Config Files.

Data Ingestion API — Push Your Own Documents

The Web Crawler handles automated site indexing, but sometimes you need to push content that the crawler can't reach — internal databases, CMS drafts, gated content, product feeds, or any structured data from your application. The Data Ingestion API is live and ready to use.

Required Config Set: Data Ingestion requires the Web Crawler config set. For Solr 9, download the mandatory config set for Solr 9 and upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.

The Data Ingestion API lets you POST documents directly into your Web Crawler index. Every document goes through the same enrichment pipeline as crawled pages: vector embeddings, sentiment analysis, language detection, and all derived search fields are generated automatically. Documents from both sources coexist seamlessly in the same index.

What You Can Do
  • Push up to 100 documents per batch via JSON body or .json file upload
  • Update existing documents — same URI = same document (ID is always md5 of URI)
  • Extract text from PDFs, DOCX, PPTX, XLSX, ODT, RTF via rtf:true
  • Automatic dedup protection — duplicate URIs already queued are rejected
  • Returns doc_ids array for tracking every document in the batch
  • Detailed per-document Solr error reporting in the job queue
  • Monitor job progress, pause, resume, or cancel
Auto-Generated for Every Document
  • 1024-dim BGE-m3 vector embeddings
  • Sentiment scores (VADER)
  • Language detection
  • Autocomplete tags, phonetic fields, spellcheck
Web Crawler indexes only. The Data Ingestion API is available exclusively for indexes created on Web Crawler servers. Standard Solr indexes use the native Solr update API instead.

Full Data Ingestion API Reference → | Manage Your Ingestion Queue →


Live Demos

Vector Search (AI-powered)

Keyword Search Demos


Crawl Modes

The crawl mode controls how far the crawler follows links from your starting URL. There are three scope types — each available in full depth or shallow (depth 1) variants.

Domain ScopeModes 1 & 4example.comwww.*shop.*help.*All subdomains crawledHost ScopeModes 2 & 5example.comwww.*shop.*help.*Single hostname onlyPath ScopeModes 3 & 6www.example.com/blog/*/about/shopSpecific path section

Mode 1 — Follow Domain Links (full depth)

Crawls all pages across the entire domain, including all subdomains.

Example: Start URL is https://www.example.com/blog The crawler will follow links to www.example.com, shop.example.com, help.example.com — anything on example.com.

Best for: Indexing an entire website including all its subdomains.

Mode 2 — Follow Host Links (full depth)

Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.

Example: Start URL is https://www.example.com/blog The crawler will follow links on www.example.com only. Links to shop.example.com or help.example.com are ignored.

Best for: Indexing one specific subdomain without pulling in content from other parts of the site.

Mode 3 — Follow Path Links (full depth)

Crawls only pages that start with the same URL path on the same host.

Example: Start URL is https://www.example.com/blog/ The crawler will follow www.example.com/blog/2024/my-post and www.example.com/blog/categories, but will skip www.example.com/about or www.example.com/shop/.

Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.

Mode 4 — Shallow Domain Crawl (depth 1)

Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.

Example: Start URL is https://www.example.com The crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.

Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.

Mode 5 — Shallow Host Crawl (depth 1)

Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.

Best for: A quick, shallow index of a single subdomain.

Mode 6 — Shallow Path Crawl (depth 1)

Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.

Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.


Embedding & Customization

Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters.

Important: To embed the search UI on your website, contact us to have your domain whitelisted and approved for iframe embedding.

Parameter Description
&topbar=off Hide the top search bar
&q=SEARCH_QUERY Set the initial search query
&in=web/media/images Filter by content type
&og=yes/no Show or hide OG images per result
&source=WEBSITE Restrict results to a single domain
&fresh=... Apply result freshness or sentiment bias
&lang=en Filter by language
&pagination_style=scroll/pages Infinite scroll (default) or numbered pages
&ui_theme=light/dark Color theme
&layout=default/fullwidth Container width
&locale=en_us/de_de/ro_ro Filter by OG locale metadata

What's New

  • Data Ingestion API — push documents directly from your CMS, app, or data pipeline. JSON body or file upload, up to 100 docs per batch, with URI-based dedup and rich text extraction. Full reference →
  • Query Elevation — pin results to the top or exclude them from any query, directly from the Search UI. No code needed. Full guide →
  • AI-Hints enabled by default for every crawler index.
  • Automatic Language Detection for every indexed document.
  • Customizable for any language and analysis pipeline.
  • Full support for spellcheck, autocomplete, backup, and replication.
  • Live SEO & crawling stats and sentiment analysis.
  • Pause & Resume with schedule management via UI or REST API.
  • Schedule Optimize — set your index to auto-optimize on a recurring schedule.

Solr Configuration

To enable smooth crawling and full feature support, use the ready-made Solr configs:

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.


Ready to Crawl Your Site?
We handle everything — from crawling to deployment. You just provide the URL.
Read Full Answer

Opensolr Web Crawler Standards

Opensolr Web Crawler Standards

1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.

2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value

3. In order to be indexed, pages should never reflect a meta tag of the form

<meta name="robots" content="noindex" />

4. In order to be followed for other links, pages should never reflect a meta tag of the form:

<meta name="robots" content="nofollow" />

5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.

6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file

7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.

8. Article pages should present a creation date, by either one of the following meta tags:

article:published_time

or

og:updated_time

9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.

10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.

11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.

12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results

Read Full Answer