OPENSOLR-Web Crawler

Find answers to your questions quickly and easily

Opensolr AI-RAG-Driven Web Search

Opensolr Web Crawler – Smarter AI-Powered Site Search 🚀

For setup, integration, and pricing, reach out anytime at support@opensolr.com.


The Complete Web Search Solution 🔍

The Opensolr Web Crawler delivers a fully automated, AI-driven site search platform built for speed, accuracy, and ease of use. In just minutes, it can index your entire website — HTML pages, PDFs, images, and documents — then power it all with Apache Solr for fast and intelligent results.

👉 Learn more: Complete Web Crawler Solution


Smarter Search with AI & NLP 🤖

Opensolr brings AI Vector Search and Hybrid Search to life — combining traditional Solr precision with powerful machine learning context.

  • Vector Search: Understands meaning, not just keywords.
    🔗 AI Vector Search Made Easy
  • Hybrid Search: Blends vector and keyword results for superior relevance.
    🔗 Opensolr + AI Hybrid Search Explained
  • AI Summaries: Automatically condenses long text and extracts entities (names, places, topics) for cleaner, more focused results.
  • Sentiment Analysis: Detects tone and flags negative or biased content.

Key Features

  • Automated Crawling: Fully scheduled, resumable, and SEO-aware.
  • Rich UI Options: Embed a sleek, responsive search interface anywhere.
  • Geo & Metadata Support: Leverage GPS data from images for location-based queries.
  • Full Solr API Access: Integrate easily with existing apps and workflows.
  • Scalable: Handles everything from small blogs to massive enterprise datasets.

Built for Real-World Use 🌐

From basic site indexing to deep semantic AI-driven discovery, Opensolr Web Crawler gives you the control and performance you need — out of the box.

Experience the next generation of intelligent search today.
Start with Opensolr →

Read Full Answer

Opensolr Web Crawler — Site Search Solution

Web Crawler
Opensolr Web Crawler — Site Search Solution
A fully managed platform to crawl, index, enrich, and search your web content — automatically. Point it at your site and get a production-ready AI search engine in minutes.
YOUR URLyoursite.comCRAWL PAGESHTML, PDF, docsNLP + EMBEDNER, vectors, AISOLR INDEXStructured & searchableSEARCH LIVEAI Hints, Vector Search, ReaderFully automated — from URL to production search in minutes

The Opensolr Web Crawler crawls your entire site, extracts structured data, applies NLP + NER, generates vector embeddings, and feeds everything into Solr — fully indexed and ready to search. No manual config. No fiddling with schemas. Just point it at your site and go.

For setup details, assistance, or pricing: support@opensolr.com


Key Features
Every crawler index includes these out of the box:
Full-Site Crawling
Crawls every page, PDF, document, and image. Scheduled recrawling, pause/resume, and incremental indexing built in.
NLP & Entity Recognition
OpenNLP extracts people, locations, organizations. Automatic sentiment analysis, language detection, and AI enrichment.
Hybrid Vector Search
BGE-m3 embeddings (1024-dim) combined with BM25 scoring. Users search by intent, not just keywords. Multilingual.
AI Hints & Document Reader
RAG-powered instant answers. Click "Read" on any result for an AI summary with PDF export. Enabled by default.
PDFDOC
Rich Content Support
HTML, PDF, Word, Excel, images — including metadata, GPS coordinates, and structured data (JSON-LD, microdata).
Live Stats & Scheduling
Live crawl stats, search analytics, query inspector, and Solr debugQuery. Schedule recrawls and index optimization.

Live Demos

Vector Search (AI-powered)

Keyword Search Demos


Crawl Modes

The crawl mode controls how far the crawler follows links from your starting URL. There are three scope types — each available in full depth or shallow (depth 1) variants.

Domain ScopeModes 1 & 4example.comwww.*shop.*help.*All subdomains crawledHost ScopeModes 2 & 5example.comwww.*shop.*help.*Single hostname onlyPath ScopeModes 3 & 6www.example.com/blog/*/about/shopSpecific path section

Mode 1 — Follow Domain Links (full depth)

Crawls all pages across the entire domain, including all subdomains.

Example: Start URL is https://www.example.com/blog The crawler will follow links to www.example.com, shop.example.com, help.example.com — anything on example.com.

Best for: Indexing an entire website including all its subdomains.

Mode 2 — Follow Host Links (full depth)

Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.

Example: Start URL is https://www.example.com/blog The crawler will follow links on www.example.com only. Links to shop.example.com or help.example.com are ignored.

Best for: Indexing one specific subdomain without pulling in content from other parts of the site.

Mode 3 — Follow Path Links (full depth)

Crawls only pages that start with the same URL path on the same host.

Example: Start URL is https://www.example.com/blog/ The crawler will follow www.example.com/blog/2024/my-post and www.example.com/blog/categories, but will skip www.example.com/about or www.example.com/shop/.

Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.

Mode 4 — Shallow Domain Crawl (depth 1)

Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.

Example: Start URL is https://www.example.com The crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.

Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.

Mode 5 — Shallow Host Crawl (depth 1)

Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.

Best for: A quick, shallow index of a single subdomain.

Mode 6 — Shallow Path Crawl (depth 1)

Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.

Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.


Embedding & Customization

Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters.

Important: To embed the search UI on your website, contact us to have your domain whitelisted and approved for iframe embedding.

Parameter Description
&topbar=off Hide the top search bar
&q=SEARCH_QUERY Set the initial search query
&in=web/media/images Filter by content type
&og=yes/no Show or hide OG images per result
&source=WEBSITE Restrict results to a single domain
&fresh=... Apply result freshness or sentiment bias
&lang=en Filter by language
&pagination_style=scroll/pages Infinite scroll (default) or numbered pages
&ui_theme=light/dark Color theme
&layout=default/fullwidth Container width
&locale=en_us/de_de/ro_ro Filter by OG locale metadata

What's New

  • AI-Hints enabled by default for every crawler index.
  • Automatic Language Detection and advanced NER via OpenNLP.
  • Customizable for any language and analysis pipeline.
  • Full support for spellcheck, autocomplete, backup, and replication.
  • Live SEO & crawling stats and sentiment analysis.
  • Pause & Resume with schedule management via UI or REST API.
  • Schedule Optimize — set your index to auto-optimize on a recurring schedule.

Solr Configuration

To enable smooth crawling and full feature support, use the ready-made Solr configs:

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.


Quick Video Demo

Ready to Crawl Your Site?
We handle everything — from crawling to deployment. You just provide the URL.
Read Full Answer

Opensolr Web Crawler Standards

Opensolr Web Crawler Standards

1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.

2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value

3. In order to be indexed, pages should never reflect a meta tag of the form

<meta name="robots" content="noindex" />

4. In order to be followed for other links, pages should never reflect a meta tag of the form:

<meta name="robots" content="nofollow" />

5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.

6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file

7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.

8. Article pages should present a creation date, by either one of the following meta tags:

article:published_time

or

og:updated_time

9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.

10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.

11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.

12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results

Read Full Answer