Select a category on the left, to get your answers quickly
The Opensolr Web Crawler follows and indexes your entire website's HTML and plain text information, and also creates a sample, embedded search engine, ready for plug-and-play, into your website or application.
Update - Nov 16 2021
Important: Only works with Solr version 7+
To learn more about what fields are indexed, simply create a new opensolr index, go to Config Files Editor, and select schema.xml.
In order to preserve your Web Crawler's functionality, please do not edit your schema.xml fields, or any other configuration files.
To make sure crawling is done correctly, only use our Solr 7+ servers, and make sure to apply the Solr configuration archive, corresponding to your Solr version, below:
Important: The Out-of-the-box version of the Web Crawler, is only available for small-mid websites, under 200 pages.
For larger websites, please Contact Us, to get a quote for an Opensolr Indexing and Search Resilient Cluster
1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.
2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value
3. In order to be indexed, pages should never reflect a meta tag of the form
<meta name="robots" content="noindex" />
4. In order to be followed for other links, pages should never reflect a meta tag of the form:
<meta name="robots" content="nofollow" />
5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.
6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file
7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.
8. Article pages should present a creation date, by either one of the following meta tags:
9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.
10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.
11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.
12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results