Web Crawler

Opensolr Web Crawler — find answers to your questions

API - Get LIVE Web Crawler Stats

Web Crawler API

Get LIVE Web Crawler Stats

Get real-time statistics for your web crawler: pages crawled, pages queued, errors, traffic used, and more.

Endpoint

GET https://opensolr.com/solr_manager/api/get_crawl_stats

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index you wish to get live crawler stats for

Code Examples

cURL

curl -s "https://opensolr.com/solr_manager/api/get_crawl_stats?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

PHP

$params = http_build_query([
    'email'     => 'YOUR_EMAIL',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_solr_core',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/get_crawl_stats?{$params}");
$stats = json_decode($response, true);
print_r($stats);

Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/get_crawl_stats", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
})
stats = response.json()
print(stats)
Use this endpoint to monitor crawler progress in real time. It's useful for building dashboards or deciding when to pause/resume the crawler.

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

API - Start the Web Crawler

Web Crawler API

Start the Web Crawler

Launch the Opensolr web crawler for your index. Configure crawl mode, scope, threads, rendering engine, traffic limits, and more — all via API parameters.

Endpoint

GET https://opensolr.com/solr_manager/api/start_crawl

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index to start crawling for
follow_docsOptionalFollow documents and images? yes / no (default: no)
cleanOptionalStart fresh or resume? yes = start fresh, no = resume (default: no)
auth_usernameOptionalHTTP Basic Auth username if your starting URLs require authentication
auth_passwordOptionalHTTP Basic Auth password
modeOptionalCrawl mode 1–6 (default: 1). See mode table below
max_threadsOptionalNumber of concurrent crawler threads (default: 10)
relaxOptionalDelay between requests in microseconds (default: 100000 = 0.1s)
max_trafficOptionalMaximum traffic in GB (default: plan limit)
max_pagesOptionalMaximum pages to crawl (default: plan limit)
max_filesizeOptionalMaximum file size per page in KB (default: plan limit)
rendererOptionalRendering engine: curl or chrome (default: curl). See table below

Crawl Modes

ModeNameScopeDescription
1Follow Domain LinksFull depthFollows all links across the entire domain, including subdomains. Starting from www.site.com, it will also crawl shop.site.com, blog.site.com, etc.
2Follow Host LinksFull depthStays on the exact hostname only. Starting from www.site.com, links to shop.site.com are ignored.
3Follow Path LinksFull depthStays within the URL path prefix on the same host. Starting from site.com/blog/, it will crawl site.com/blog/2024/post but skip site.com/about.
4Shallow Domain CrawlDepth 1Same domain scope as Mode 1, but only discovers links from the start page and its direct children.
5Shallow Host CrawlDepth 1Same host scope as Mode 2, but only discovers links from the start page and its direct children.
6Shallow Path CrawlDepth 1Same path scope as Mode 3, but only discovers links from the start page and its direct children. Useful for quickly indexing a product catalog or documentation section.

Full depth = the crawler keeps discovering and following new links at every level. Depth 1 = the crawler reads the start page, follows links it finds there, but stops discovering new links after that.

Rendering Engines

ValueNameDescription
curlCurl (Fast)Fast HTTP fetch with no browser overhead. Best for most websites — server-rendered HTML, static sites, WordPress, Drupal, etc. Pages are fetched in ~0.2s each.
chromeChrome (JS Rendering)Every page is rendered through a headless Chromium browser (Playwright). Use for JavaScript-heavy SPAs built with React, Vue, Angular, Next.js, etc. Adds ~0.5–1s per page.
All parameters are saved and preserved across scheduled cron restarts. If you pause and resume the crawler, or if the cron schedule restarts it, your original settings (threads, relax delay, traffic limits, etc.) are retained.

Code Examples

cURL — Start a basic crawl

curl -s "https://opensolr.com/solr_manager/api/start_crawl?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

cURL — Full crawl with all options

curl -s "https://opensolr.com/solr_manager/api/start_crawl?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core&clean=yes&mode=2&max_threads=5&relax=200000&renderer=chrome&follow_docs=yes&max_pages=10000"

PHP

$params = http_build_query([
    'email'       => 'YOUR_EMAIL',
    'api_key'     => 'YOUR_API_KEY',
    'core_name'   => 'my_solr_core',
    'clean'       => 'yes',
    'mode'        => 2,
    'max_threads' => 5,
    'renderer'    => 'chrome',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/start_crawl?{$params}");
$result = json_decode($response, true);
print_r($result);

Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/start_crawl", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
    "clean": "yes",
    "mode": 2,
    "max_threads": 5,
    "renderer": "chrome",
})
print(response.json())

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

​API - Stop the Web Crawler

Stop the Web Crawler

Permanently stops the web crawler by killing all running processes and removing the cron schedule. The crawler will not run again until manually restarted. If you only want to temporarily pause crawling, use the Pause API instead.

  1. GET https://opensolr.com/solr_manager/api/stop_crawl
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to stop the web crawler for
  3. Example: https://opensolr.com/solr_manager/api/stop_crawl?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core
Read Full Answer

API - Pause the Web Crawler

Web Crawler API

Pause the Web Crawler

Temporarily halt crawler processes while keeping the cron schedule intact. The crawler will automatically resume on the next scheduled cron tick.

Endpoint

GET https://opensolr.com/solr_manager/api/pause_crawl

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index to pause the crawler for
Unlike Stop, Pause does not remove the cron schedule. The crawler will restart automatically on the next scheduled cron tick. Use this when you want to temporarily halt crawling without losing your schedule.

Code Examples

cURL

curl -s "https://opensolr.com/solr_manager/api/pause_crawl?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

PHP

$params = http_build_query([
    'email'     => 'YOUR_EMAIL',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_solr_core',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/pause_crawl?{$params}");
$result = json_decode($response, true);
print_r($result);

Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/pause_crawl", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
})
print(response.json())

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

API - Resume the Web Crawler

Web Crawler API

Resume the Web Crawler

Resume a paused web crawler immediately, without waiting for the next cron tick. Continues from where the crawler left off.

Endpoint

GET https://opensolr.com/solr_manager/api/start_crawl
This uses the same start_crawl endpoint with clean=no to continue from where the crawler left off, rather than starting a fresh crawl.

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index to resume crawling for
cleanRequiredSet to no to resume from where the crawler left off
If the crawler has finished processing all URLs in its queue (todo is empty), resuming will have no effect. Check stats first using the Get LIVE Web Crawler Stats API to see if there are remaining pages to crawl.

Code Examples

cURL

curl -s "https://opensolr.com/solr_manager/api/start_crawl?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core&clean=no"

PHP

$params = http_build_query([
    'email'     => 'YOUR_EMAIL',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_solr_core',
    'clean'     => 'no',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/start_crawl?{$params}");
$result = json_decode($response, true);
print_r($result);

Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/start_crawl", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
    "clean": "no",
})
print(response.json())

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

API - Check if Web Crawler is Running

Web Crawler API

Check if Web Crawler is Running

Check whether the web crawler processes are currently active for a given index. Returns the current running state of the crawler.

Endpoint

GET https://opensolr.com/solr_manager/api/crawler_active

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index to check the crawler status for

Response

The response message contains CRAWLER IS READY when the crawler is not running (idle/paused), or a different status message when crawler processes are actively running.

Code Examples

cURL

curl -s "https://opensolr.com/solr_manager/api/crawler_active?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

PHP

$params = http_build_query([
    'email'     => 'YOUR_EMAIL',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_solr_core',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/crawler_active?{$params}");
$result = json_decode($response, true);
if (strpos($result['msg'], 'CRAWLER IS READY') !== false) {
    echo "Crawler is idle/paused\n";
} else {
    echo "Crawler is running\n";
}

Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/crawler_active", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
})
data = response.json()
if "CRAWLER IS READY" in data.get("msg", ""):
    print("Crawler is idle/paused")
else:
    print("Crawler is running")

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

API - Flush Crawl Buffer to Solr

Web Crawler API

Flush Crawl Buffer to Solr

Force any documents remaining in the crawler's internal batch buffer to be flushed directly into your Solr index. Use when the crawler is paused or finished and you suspect uncommitted documents.

Endpoint

GET https://opensolr.com/solr_manager/api/flush_crawl_buffer

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index to flush the crawl buffer for
Only use this when the crawler is not actively running. If the crawler is running, it manages its own buffer flushes automatically. This endpoint is for cases where crawling was paused or stopped and you want to ensure all discovered content is searchable.

Response

Returns a JSON object with status (true/false) and msg indicating how many documents were flushed, or that the buffer was already empty.

Code Examples

cURL

curl -s "https://opensolr.com/solr_manager/api/flush_crawl_buffer?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

PHP

$params = http_build_query([
    'email'     => 'YOUR_EMAIL',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_solr_core',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/flush_crawl_buffer?{$params}");
$result = json_decode($response, true);
echo $result['msg'] . "\n";

Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/flush_crawl_buffer", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
})
data = response.json()
print(data["msg"])

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer