Documentation

Select a category on the left, to get your answers quickly

What is the Opensolr Web Crawler?

The Opensolr Web Crawler offers a seamless solution, effortlessly indexing websites while leveraging robust Natural Language Processing (NLP) and Named Entity Recognition (NER) capabilities. By crawling every page, it automatically extracts and inserts comprehensive meta-information directly into the Solr index. This process ensures that the content is instantly searchable through a fully responsive, embeddable search engine UI, enabling users to create a powerful and tailored search experience within minutes.

Click here for an example Solr API for one of our Demo Web Crawl projects

Search Engine Demos:

Fresh News Search Engine

Tech News Search Engine

Romanian News Search Engine

German News Search Engine

Swedish News Search Engine

India News Search Engine

Opensolr Search Engine

Documents Search Engine

What's new:

  • Automatic Content Language Detection via OpenNLP.
  • Automatic NER via integration with OpenNLP.
    • Implemented recognition for people, locations, and organizations.
  • Can be customised for any languange analysis, with stopwords, synonyms, spellcheck, etc.
  • Fully responsive, embeddable Search Engine UI
  • Automatic schedule, re-crawling of fresh content only.
  • HTTP Auth, so that you can follow your protected documents/pages.
  • Full support for spellcheck and autocomplete.
  • Follows and indexes full content and meta data of the following rich text formats: doc, docx, xls, pdf, and most image files formats.
  • Adds content sentiment to each page/document indexed, in order to identify potential hateful content for each of the indexed documents (web pages).
  • Adds GPS position for image files meta data, that can be used as location fields in Solr, to perform geo-location radius search requests.
  • Fulll live crawling stats that also serves as an SEO tool, while crawling.
  • Smartly collects page/document creation date and includes the date in the search scoring function for fresh results elevation.
  • Automate crawling and get LIVE stats via the Opensolr Web Crawler UI, or via the Automation REST APIs
  • Supports resume, without losing any data. Crawl parts of your website, every day, or based on your own cron jobs, by taking advantage of the Automation REST API.

To make sure crawling works correctly, only use our Web Crawler Enabled environments, and make sure to apply the below Solr configuration archive, corresponding to the Solr version you are using:
Solr 9 Config Zip Archive

To learn more about what fields are indexed, simply create a new opensolr index, go to Config Files Editor, and select schema.xml.
In order to preserve your Web Crawler's functionality, please do not edit your schema.xml fields, or any other configuration files.

Quick Video Demo

Opensolr Web Crawler Standards

1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.

2. Our web crawler will follow, but will never index dynamic pages (pages with a ? query in the URL). Such as: https://website.com?query=value

3. In order to be indexed, pages should never reflect a meta tag of the form

<meta name="robots" content="noindex" />

4. In order to be followed for other links, pages should never reflect a meta tag of the form:

<meta name="robots" content="nofollow" />

5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.

6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file

7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.

8. Article pages should present a creation date, by either one of the following meta tags:

article:published_time

or

og:updated_time

9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.

10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.

11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.

12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results