Opensolr Web Crawler – Smarter AI-Powered Site Search 🚀
For setup, integration, and pricing, reach out anytime at support@opensolr.com.
The Complete Web Search Solution 🔍
The Opensolr Web Crawler delivers a fully automated, AI-driven site search platform built for speed, accuracy, and ease of use. In just minutes, it can index your entire website — HTML pages, PDFs, images, and documents — then power it all with Apache Solr for fast and intelligent results.
A fully managed platform to crawl, index, enrich, and search your web content — automatically. Point it at your site and get a production-ready AI search engine in minutes.
The Opensolr Web Crawler crawls your entire site, extracts structured data, applies NLP + NER, generates vector embeddings, and feeds everything into Solr — fully indexed and ready to search. No manual config. No fiddling with schemas. Just point it at your site and go.
The crawl mode controls how far the crawler follows links from your starting URL. There are three scope types — each available in full depth or shallow (depth 1) variants.
Mode 1 — Follow Domain Links (full depth)
Crawls all pages across the entire domain, including all subdomains.
Example: Start URL is https://www.example.com/blog
The crawler will follow links to www.example.com, shop.example.com, help.example.com — anything on example.com.
Best for: Indexing an entire website including all its subdomains.
Mode 2 — Follow Host Links (full depth)
Crawls only pages on the exact same hostname. Subdomains are treated as separate sites.
Example: Start URL is https://www.example.com/blog
The crawler will follow links on www.example.com only. Links to shop.example.com or help.example.com are ignored.
Best for: Indexing one specific subdomain without pulling in content from other parts of the site.
Mode 3 — Follow Path Links (full depth)
Crawls only pages that start with the same URL path on the same host.
Example: Start URL is https://www.example.com/blog/
The crawler will follow www.example.com/blog/2024/my-post and www.example.com/blog/categories, but will skip www.example.com/about or www.example.com/shop/.
Best for: Indexing a specific section of a website, like a blog, documentation area, or product category.
Mode 4 — Shallow Domain Crawl (depth 1)
Same domain-level scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links.
Example: Start URL is https://www.example.com
The crawler reads the homepage, finds 50 links, crawls those 50 pages — but does not follow any links found on those 50 pages.
Best for: A shallow crawl of top-level content — landing pages, product listings, or news homepages where you only want the first layer.
Mode 5 — Shallow Host Crawl (depth 1)
Same host-level scope as Mode 2, combined with depth-1 link discovery. Stays on the exact hostname and only follows links from the start page and its direct children.
Best for: A quick, shallow index of a single subdomain.
Mode 6 — Shallow Path Crawl (depth 1)
Same path-level scope as Mode 3, combined with depth-1 link discovery. Stays within the URL path and only follows links from the start page and its direct children.
Best for: A focused, shallow crawl of a specific section — useful for quickly indexing a product catalog or documentation area without going deep.
Embedding & Customization
Embed your Opensolr Web Crawler search on any website. Customize behavior with URL parameters.
Important: To embed the search UI on your website, contact us to have your domain whitelisted and approved for iframe embedding.
Parameter
Description
&topbar=off
Hide the top search bar
&q=SEARCH_QUERY
Set the initial search query
&in=web/media/images
Filter by content type
&og=yes/no
Show or hide OG images per result
&source=WEBSITE
Restrict results to a single domain
&fresh=...
Apply result freshness or sentiment bias
&lang=en
Filter by language
&pagination_style=scroll/pages
Infinite scroll (default) or numbered pages
&ui_theme=light/dark
Color theme
&layout=default/fullwidth
Container width
&locale=en_us/de_de/ro_ro
Filter by OG locale metadata
What's New
AI-Hints enabled by default for every crawler index.
Automatic Language Detection and advanced NER via OpenNLP.
Customizable for any language and analysis pipeline.
Full support for spellcheck, autocomplete, backup, and replication.
Live SEO & crawling stats and sentiment analysis.
Pause & Resume with schedule management via UI or REST API.
Schedule Optimize — set your index to auto-optimize on a recurring schedule.
Solr Configuration
To enable smooth crawling and full feature support, use the ready-made Solr configs:
1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.
2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value
3. In order to be indexed, pages should never reflect a meta tag of the form
<meta name="robots" content="noindex" />
4. In order to be followed for other links, pages should never reflect a meta tag of the form:
<meta name="robots" content="nofollow" />
5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.
6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file
7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.
8. Article pages should present a creation date, by either one of the following meta tags:
article:published_time
or
og:updated_time
9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.
10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.
11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.
12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results