FAQ & API Documentation

Select a category on the left, to get your answers quickly

Opensolr Site Search - Google Site Search Replacement

Our new web crawler solution now follows almost any file type found or referenced on your starting ROOT URL, and will index any meta data found in those files.
From HTML to, PDF, DOC, PPT, or even MP3 files, to video files, and any other file type, our web crawler will create your site search engine in just a few minutes.
 
Here are some demo search engines powered by the Opensolr Web Crawler:
- BBC
Ziar
 
And here's a nice clip to help you get started:
 

In case the web crawler doesn't work, please try one of the following:

- Try to remove your index and add another one in a different instance / region

- Look at your schema.xml and make the following modifications:

  1. Add these fields to your schema.xml anywhere where other fields are being defined (next to any other <field... definition)
    1. <field name="og_image" type="string" indexed="true" stored="true" />
      <field name="headings1" type="text_general" indexed="true" stored="true" multiValued="true" required="false" default="" />
      <field name="headings2" type="text_general" indexed="true" stored="true" multiValued="true" required="false" default="" />
      <field name="headings3" type="text_general" indexed="true" stored="true" multiValued="true" required="false" default="" />
      <field name="headings4" type="text_general" indexed="true" stored="true" multiValued="true" required="false" default="" />
      <field name="em" type="text_general" indexed="true" stored="true" multiValued="true" required="false" default="" />
      <field name="strong" type="text_general" indexed="true" stored="true" multiValued="true" required="false" default="" />

  2. ​Find the field: file_size and make it ignored (causes problems for some exif meta data):

    1. <field name="file_size" type="ignored" indexed="false" stored="false" multiValued="true" required="false" default="" />

  3. Find the catchall field definition and change it to this:
    1. <field name="catchall" type="ignored" indexed="false" stored="false" multiValued="true" required="false" default="" />

Email us at support@opensolr.com in case you need further assistance.

Upload a new file for crawling and indexing

  1. ​POST https://opensolr.com/solr_manager/api/index_crawler_file
  2. Parameters:
    1. ​email your opensolr registration email address
    2. api_key - your opensolr api_key
    3. core_name - the name of the core you wish to upload the document for
    4. url - optionally, you can add an URL to crawl and index, alongside the uploaded document. They will both be indexed as separate documents in your opensolr index.
    5. userfile - your local document file to POST and upload to the server (pdf, doc, html, xls, mp3, mp4, etc.)
  3. Example here: https://opensolr.com/solr_manager/index_crawler_file

The opensolr web crawler now indexes and follows any file type in your web root.
To learn more about what fields are indexed, simply create a new opensolr index, go to Config Files Editor, and select schema.xml.

All the fields inside the schema.xml are indexed.

Opensolr Web Crawler Standards

1. Page has to respond within less than 5 seconds (that's not the page download time, it's the page / website response time), otherwise the page in question will be ommited from indexing.

2. Page should never contain dynamic queries in the URL (?var_a=a&var_b=b). Rather, instead of: https://www.site.com/browse/channels?sort=-total_video&page=5 pages should be of the format https://www.site.com/browse/channels/page/5/sort/sort_field/sort_order, or, something similar

3. In order to be indexed, pages should never reflect a meta tag of the form

<meta name="robots" content="noindex" />

4. In order to be followed for other links, pages should never reflect a meta tag of the form:

<meta name="robots" content="nofollow" />

5. Just as in the case of #3 and #4, all pages that are desired to appear in search results should never include "noindex or nofollow or none" as a robots meta tag.

6. Pages that should appear in the search results, and are desired to be indexed and crawled, should never appear as restricted in the generic website.tld/robots.txt file

7. Pages should have a clear, concise title, while also trying to avoid duplicates in the titles, if at all possible. Pages without a title whatsoever, will always be ommited from indexing.

8. Article pages should present a creation date, by either one of the following meta tags:

article:published_time

or

og:updated_time

9. #8 Will apply , as best practice, for any other pages, in order to be able to correctly and consistently present fresh content at the top of the search results, for any given query.

10. Presence of: author, or og:author, or article:creator meta tag is a best practice, even though that will be something generic such as: "Admin", etc, in order to provide better data structure for search in the future.

11. Presence of a category or og:category tag will also help with faceting and more consistent data structure.

12. In case two or more different pages, that reside at two or more different URLs, BUT present the same actual content, they should both have a canonical meta tag, which indicates which one of the URLs should be indexed. Otherwise, search API will present duplicates in the results