Select a category on the left, to get your answers quickly
For setup, integration, and pricing, reach out anytime at [email protected].
The Opensolr Web Crawler delivers a fully automated, AI-driven site search platform built for speed, accuracy, and ease of use. In just minutes, it can index your entire website — HTML pages, PDFs, images, and documents — then power it all with Apache Solr for fast and intelligent results.
👉 Learn more: Complete Web Crawler Solution
Opensolr brings AI Vector Search and Hybrid Search to life — combining traditional Solr precision with powerful machine learning context.
From basic site indexing to deep semantic AI-driven discovery, Opensolr Web Crawler gives you the control and performance you need — out of the box.
Experience the next generation of intelligent search today.
Start with Opensolr →
Discover a seamless, AI-powered way to index, enrich, and search your web content—automatically.
Learn even more, here.
For setup details, assistance, or pricing information, contact us at:
The Opensolr Web Crawler is a robust platform for crawling, indexing, and enriching websites of any size. It automatically extracts key meta-information, applies Natural Language Processing (NLP) and Named Entity Recognition (NER), and injects all content and structure directly into your Solr index.
Or try the Solr API for a live crawl.
Full NLP and NER:
Extract people, locations, organizations, and more using OpenNLP.
Comprehensive Metadata Extraction:
Collects meta tags, page structure, creation dates, and document fields.
AI-Hints:
Opensolr AI-Hints are enabled by default for all crawler indexes, delivering rich context and smart search assistance.
Automatic Content Language Detection:
Indexes and searches in any language, with built-in stopword, synonym, and spellcheck support.
Responsive, Embeddable Search UI:
Integrate Opensolr search into your site, customize top bar, filters, and behavior.
Scheduled Recrawling & Live Stats:
Only new and updated content is fetched, with live stats for crawling and SEO.
Secure & Flexible:
Supports HTTP Auth for protected content, robust backup and replication, and fully managed by API or UI.
Rich Content Support:
Indexes and analyzes HTML, doc, docx, xls, PDF, and most image formats—extracting content, meta, GPS/location data, and sentiment.
Crawl Resume:
Pause and resume crawls anytime; supports cron jobs and incremental indexing.
You can embed your Opensolr Web Crawler Search Engine on any website.
Customize your search experience with parameters such as:
&topbar=off – Hide the top search tool&q=SEARCH_QUERY – Set the initial search&in=web/media/images – Filter by content type&og=yes/no – Show/hide OG images per result&source=WEBSITE – Restrict to a single domain&fresh=... – Apply result freshness or sentiment bias&lang=en – Filter by languageTo enable smooth crawling and full feature support, use our ready-made Solr configs:
Do not manually modify your schema.xml for crawler indexes to ensure all features work as designed.
1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.
2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value
3. In order to be indexed, pages should never reflect a meta tag of the form
<meta name="robots" content="noindex" />
4. In order to be followed for other links, pages should never reflect a meta tag of the form:
<meta name="robots" content="nofollow" />
5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.
6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file
7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.
8. Article pages should present a creation date, by either one of the following meta tags:
article:published_time
or
og:updated_time
9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.
10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.
11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.
12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results