Documentation

Select a category on the left, to get your answers quickly

Opensolr Web Crawler – Smarter AI-Powered Site Search 🚀

For setup, integration, and pricing, reach out anytime at [email protected].


The Complete Web Search Solution 🔍

The Opensolr Web Crawler delivers a fully automated, AI-driven site search platform built for speed, accuracy, and ease of use. In just minutes, it can index your entire website — HTML pages, PDFs, images, and documents — then power it all with Apache Solr for fast and intelligent results.

👉 Learn more: Complete Web Crawler Solution


Smarter Search with AI & NLP 🤖

Opensolr brings AI Vector Search and Hybrid Search to life — combining traditional Solr precision with powerful machine learning context.

  • Vector Search: Understands meaning, not just keywords.
    🔗 AI Vector Search Made Easy
  • Hybrid Search: Blends vector and keyword results for superior relevance.
    🔗 Opensolr + AI Hybrid Search Explained
  • AI Summaries: Automatically condenses long text and extracts entities (names, places, topics) for cleaner, more focused results.
  • Sentiment Analysis: Detects tone and flags negative or biased content.

Key Features

  • Automated Crawling: Fully scheduled, resumable, and SEO-aware.
  • Rich UI Options: Embed a sleek, responsive search interface anywhere.
  • Geo & Metadata Support: Leverage GPS data from images for location-based queries.
  • Full Solr API Access: Integrate easily with existing apps and workflows.
  • Scalable: Handles everything from small blogs to massive enterprise datasets.

Built for Real-World Use 🌐

From basic site indexing to deep semantic AI-driven discovery, Opensolr Web Crawler gives you the control and performance you need — out of the box.

Experience the next generation of intelligent search today.
Start with Opensolr →

🤖 Opensolr Web Crawler

Discover a seamless, AI-powered way to index, enrich, and search your web content—automatically.
Learn even more, here.


For setup details, assistance, or pricing information, contact us at:


What is the Opensolr Web Crawler?

The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size. It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.

  • 🚀 Instantly searchable: All content becomes instantly searchable via a fully responsive, embeddable search UI.
  • 🤖 AI-driven enrichment: Named entities, sentiment, language detection, and more are extracted on the fly.
  • 🕑 Get started in minutes: Launch a powerful, custom search engine on your data without manual setup.

💻 Test it out:

🔎 See It In Action

Or try the Solr API for a live crawl.


⚡ Key Features

  • Full NLP and NER:
    Extract people, locations, organizations, and more using OpenNLP.

  • Comprehensive Metadata Extraction:
    Collects meta tags, page structure, creation dates, and document fields.

  • AI-Hints:
    Opensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.

  • Automatic Content Language Detection:
    Indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.

  • Responsive, Embeddable Search UI:
    Integrate Opensolr search into your site, customize top bar, filters, and behavior.

  • Scheduled Recrawling & Live Stats:
    Only new and updated content is fetched, with live stats for crawling and SEO.

  • Secure & Flexible:
    Supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.

  • Rich Content Support:
    Indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats—extracting content, meta, GPS/location data, and sentiment.

  • Crawl Resume:
    Pause and resume crawls anytime; supports cron jobs and incremental indexing.


⚙️ Embedding & Customization

You can embed your Opensolr Web Crawler Search Engine on any website.
Customize your search experience with parameters such as:

  • &topbar=off – Hide the top search tool
  • &q=SEARCH_QUERY – Set the initial search
  • &in=web/media/images – Filter by content type
  • &og=yes/no – Show/hide OG images per result
  • &source=WEBSITE – Restrict to a single domain
  • &fresh=... – Apply result freshness or sentiment bias
  • &lang=en – Filter by language

🚀 What's New

  • AI-Hints: Enabled by default for every crawler index.
  • Automatic Language Detection and advanced NER via OpenNLP.
  • Customizable for any language and analysis pipeline.
  • Full support for spellcheck, autocomplete, backup, and replication.
  • Live SEO & crawling stats and sentiment analysis.
  • Automated scheduling and easy management via UI or REST API.

📥 Solr Configuration for Crawling

To enable smooth crawling and full feature support, use our ready-made Solr configs:

Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.


🎬 Quick Video Demo


Opensolr Web Crawler Standards

1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.

2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value

3. In order to be indexed, pages should never reflect a meta tag of the form

<meta name="robots" content="noindex" />

4. In order to be followed for other links, pages should never reflect a meta tag of the form:

<meta name="robots" content="nofollow" />

5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.

6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file

7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.

8. Article pages should present a creation date, by either one of the following meta tags:

article:published_time

or

og:updated_time

9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.

10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.

11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.

12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results






Review us on Google Business
ISO-9001 CERTIFIED ISO-27001 CERTIFIED