API-Web Crawler

Find answers to your questions quickly and easily

API - Get LIVE Web Crawler Stats

Get LIVE Web Crawler Stats

  1. GET https://opensolr.com/solr_manager/api/get_crawl_stats
  2. Parameters:
    1. ​email - your opensolr registration email address
    2. api_key - your opensolr api_key
    3. ​core_name - the name of the index/cluster you wish to get the live web crawler stats for
  3. Example: https://opensolr.com/solr_manager/api/get_crawl_stats?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core
Read Full Answer

API - Start the Web Crawler

Start the Web Crawler

  1. GET https://opensolr.com/solr_manager/api/start_crawl
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to start the web crawler process for
    4. follow_docs - (optional) follow documents and images? (yes/no). Default: no
    5. clean - (optional) start fresh, or resume from where you left off? (yes/no). Default: no.
    6. auth_username - (optional) if your starting URLs are using Basic HTTP Auth, you can enter the username here.
    7. auth_password - (optional) if your starting URLs are using Basic HTTP Auth, you can enter the password here.
    8. mode - (optional) the crawl mode that controls how far the crawler follows links. Possible values: 1 through 6. Default: 1.
      Mode Name Scope Description
      1 Follow Domain Links Full depth Follows all links across the entire domain, including subdomains. Starting from www.site.com, it will also crawl shop.site.com, blog.site.com, etc.
      2 Follow Host Links Full depth Stays on the exact hostname only. Starting from www.site.com, links to shop.site.com are ignored.
      3 Follow Path Links Full depth Stays within the URL path prefix on the same host. Starting from site.com/blog/, it will crawl site.com/blog/2024/post but skip site.com/about.
      4 Shallow Domain Crawl Depth 1 Same domain scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links to the queue.
      5 Shallow Host Crawl Depth 1 Same host scope as Mode 2, but only discovers links from the start page and its direct children. Useful for a quick, shallow index of a single subdomain.
      6 Shallow Path Crawl Depth 1 Same path scope as Mode 3, but only discovers links from the start page and its direct children. Useful for quickly indexing a product catalog or documentation section without going deep.

      Full depth = the crawler keeps discovering and following new links at every level.
      Depth 1 = the crawler reads the start page, follows the links it finds there, but stops discovering new links after that.

    9. max_threads - (optional) number of concurrent crawler threads. Default: 10. Higher values speed up crawling but use more bandwidth.
    10. relax - (optional) delay between requests in microseconds. Controls crawl politeness. Default: 100000 (0.1s). Higher values are more polite to the target server.
    11. max_traffic - (optional) maximum allowed traffic in GB. Default: your plan limit.
    12. max_pages - (optional) maximum number of pages to crawl. Default: your plan limit.
    13. max_filesize - (optional) maximum file size to download per page, in KB. Default: your plan limit.
    14. renderer - (optional) the rendering engine to use for fetching pages. Possible values: curl or chrome. Default: curl.
      Value Name Description
      curl Curl (Fast) Fast HTTP fetch with no browser overhead. Best for most websites — server-rendered HTML, static sites, WordPress, Drupal, etc. Pages are fetched in ~0.2s each.
      chrome Chrome (JS Rendering) Every page is rendered through a headless Chromium browser (Playwright). Use for JavaScript-heavy SPAs built with React, Vue, Angular, Next.js, etc. Adds ~0.5–1s per page.
  3. Example: https://opensolr.com/solr_manager/api/start_crawl?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core

Note: All parameters are saved and preserved across scheduled cron restarts. If you pause and resume the crawler, or if the cron schedule restarts it, your original settings (threads, relax delay, traffic limits, etc.) are retained.

Read Full Answer

​API - Stop the Web Crawler

Stop the Web Crawler

Permanently stops the web crawler by killing all running processes and removing the cron schedule. The crawler will not run again until manually restarted. If you only want to temporarily pause crawling, use the Pause API instead.

  1. GET https://opensolr.com/solr_manager/api/stop_crawl
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to stop the web crawler for
  3. Example: https://opensolr.com/solr_manager/api/stop_crawl?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core
Read Full Answer

API - Pause the Web Crawler

Pause the Web Crawler

Pauses a running web crawler by stopping its processes, but keeps the cron schedule intact. The crawler will automatically resume on the next scheduled cron tick.

  1. GET https://opensolr.com/solr_manager/api/pause_crawl
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to pause the web crawler for
  3. Example: https://opensolr.com/solr_manager/api/pause_crawl?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core

Note: Unlike Stop, Pause does not remove the cron schedule. The crawler will restart automatically on the next scheduled cron tick. Use this when you want to temporarily halt crawling without losing your schedule.

Read Full Answer

API - Resume the Web Crawler

Resume the Web Crawler

Resumes a paused web crawler immediately, without waiting for the next cron tick. This uses the same start_crawl endpoint with clean=no to continue from where it left off.

  1. GET https://opensolr.com/solr_manager/api/start_crawl
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to resume crawling for
    4. clean - set to no (to resume from where the crawler left off, not start fresh)
  3. Example: https://opensolr.com/solr_manager/api/start_crawl?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core&clean=no

Note: If the crawler has finished processing all URLs in its queue (todo is empty), resuming will have no effect. Check the crawler stats first using the Get LIVE Web Crawler Stats API to see if there are remaining pages to crawl.

Read Full Answer

API - Check if Web Crawler is Running

Check if Web Crawler is Running

Check whether the web crawler processes are currently active for a given index. Returns the current running state of the crawler.

  1. GET https://opensolr.com/solr_manager/api/crawler_active
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to check the crawler status for
  3. Example: https://opensolr.com/solr_manager/api/crawler_active?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core

Response: The response message will contain CRAWLER IS READY when the crawler is not running (idle/paused), or a different status message when crawler processes are actively running.

Read Full Answer

API - Flush Crawl Buffer to Solr

Flush Crawl Buffer to Solr

Forces any documents remaining in the crawler's internal batch buffer to be flushed directly into your Solr index. Use this when the crawler is paused or has finished, and you suspect some documents may not have been committed to Solr yet.

The web crawler batches documents internally for performance. If crawling is interrupted (paused, stopped, or an error occurred), some documents may still be sitting in the buffer. This endpoint flushes them immediately.

  1. GET https://opensolr.com/solr_manager/api/flush_crawl_buffer
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to flush the crawl buffer for
  3. Example: https://opensolr.com/solr_manager/api/flush_crawl_buffer?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core

Important: Only use this when the crawler is not actively running. If the crawler is running, it manages its own buffer flushes automatically. This endpoint is designed for cases where crawling was paused or stopped and you want to ensure all discovered content is searchable.

Response: Returns a JSON object with status (true/false) and msg indicating how many documents were flushed, or that the buffer was already empty.

Read Full Answer