Web Crawler

Opensolr Web Crawler — find answers to your questions

API - Get LIVE Web Crawler Stats

Web Crawler API

Get LIVE Web Crawler Stats

Get real-time statistics for your web crawler: pages crawled, pages queued, errors, traffic used, and more.

Endpoint

GET https://opensolr.com/solr_manager/api/get_crawl_stats

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index you wish to get live crawler stats for

Code Examples

$_ cURL

curl -s "https://opensolr.com/solr_manager/api/get_crawl_stats?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

PHP PHP

$params = http_build_query([
    'email'     => 'YOUR_EMAIL',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_solr_core',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/get_crawl_stats?{$params}");
$stats = json_decode($response, true);
print_r($stats);

Py Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/get_crawl_stats", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
})
stats = response.json()
print(stats)
Use this endpoint to monitor crawler progress in real time. It's useful for building dashboards or deciding when to pause/resume the crawler.

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

API - Start the Web Crawler

Web Crawler API

Start the Web Crawler

Launch the Opensolr web crawler for your index. Configure crawl mode, scope, threads, rendering engine, traffic limits, and more — all via API parameters.

Endpoint

GET https://opensolr.com/solr_manager/api/start_crawl

Parameters

ParameterStatusDescription
emailRequiredYour Opensolr registration email address
api_keyRequiredYour Opensolr API key
core_nameRequiredThe name of the index to start crawling for
follow_docsOptionalFollow documents and images? yes / no (default: no)
cleanOptionalStart fresh or resume? yes = start fresh, no = resume (default: no)
auth_usernameOptionalHTTP Basic Auth username if your starting URLs require authentication
auth_passwordOptionalHTTP Basic Auth password
modeOptionalCrawl mode 16 (default: 1). See mode table below
max_threadsOptionalNumber of concurrent crawler threads (default: 10)
relaxOptionalDelay between requests in microseconds (default: 100000 = 0.1s)
max_trafficOptionalMaximum traffic in GB (default: plan limit)
max_pagesOptionalMaximum pages to crawl (default: plan limit)
max_filesizeOptionalMaximum file size per page in KB (default: plan limit)
rendererOptionalRendering engine: curl or chrome (default: curl). See table below
scheduleOptionalCreate a recurring cron schedule? yes = crawler re-runs automatically to pick up new content, no = one-time crawl only (default: yes). Use no for on-demand reindex operations.

Crawl Modes

ModeNameScopeDescription
1Follow Domain LinksFull depthFollows all links across the entire domain, including subdomains. Starting from www.site.com, it will also crawl shop.site.com, blog.site.com, etc.
2Follow Host LinksFull depthStays on the exact hostname only. Starting from www.site.com, links to shop.site.com are ignored.
3Follow Path LinksFull depthStays within the URL path prefix on the same host. Starting from site.com/blog/, it will crawl site.com/blog/2024/post but skip site.com/about.
4Shallow Domain CrawlDepth 1Same domain scope as Mode 1, but only discovers links from the start page and its direct children.
5Shallow Host CrawlDepth 1Same host scope as Mode 2, but only discovers links from the start page and its direct children.
6Shallow Path CrawlDepth 1Same path scope as Mode 3, but only discovers links from the start page and its direct children. Useful for quickly indexing a product catalog or documentation section.

Full depth = the crawler keeps discovering and following new links at every level. Depth 1 = the crawler reads the start page, follows links it finds there, but stops discovering new links after that.

Rendering Engines

ValueNameDescription
curlCurl (Fast)Fast HTTP fetch with no browser overhead. Best for most websites — server-rendered HTML, static sites, WordPress, Drupal, etc. Pages are fetched in ~0.2s each.
chromeChrome (JS Rendering)Every page is rendered through a headless Chromium browser (Playwright). Use for JavaScript-heavy SPAs built with React, Vue, Angular, Next.js, etc. Adds ~0.5–1s per page.
All parameters are saved and preserved across scheduled cron restarts. Use schedule=no for one-time crawl runs (e.g. Reindex From Scratch) that should not create or modify any cron schedule. If you pause and resume the crawler, or if the cron schedule restarts it, your original settings (threads, relax delay, traffic limits, etc.) are retained.

Code Examples

$_ cURL — Start a basic crawl

curl -s "https://opensolr.com/solr_manager/api/start_crawl?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core"

$_ cURL — Full crawl with all options

curl -s "https://opensolr.com/solr_manager/api/start_crawl?email=YOUR_EMAIL&api_key=YOUR_API_KEY&core_name=my_solr_core&clean=yes&mode=2&max_threads=5&relax=200000&renderer=chrome&follow_docs=yes&max_pages=10000&schedule=no"

PHP PHP

$params = http_build_query([
    'email'       => 'YOUR_EMAIL',
    'api_key'     => 'YOUR_API_KEY',
    'core_name'   => 'my_solr_core',
    'clean'       => 'yes',
    'mode'        => 2,
    'max_threads' => 5,
    'renderer'    => 'chrome',
]);
$response = file_get_contents("https://opensolr.com/solr_manager/api/start_crawl?{$params}");
$result = json_decode($response, true);
print_r($result);

Py Python

import requests

response = requests.get("https://opensolr.com/solr_manager/api/start_crawl", params={
    "email": "YOUR_EMAIL",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_solr_core",
    "clean": "yes",
    "mode": 2,
    "max_threads": 5,
    "renderer": "chrome",
})
print(response.json())

Related Documentation

Need help with the Opensolr Web Crawler? We are here to help.

Contact Support
Read Full Answer

​API - Stop the Web Crawler

Stop the Web Crawler

Permanently stops the web crawler by killing all running processes and removing the cron schedule. The crawler will not run again until manually restarted. If you only want to temporarily pause crawling, use the Pause API instead.

  1. GET https://opensolr.com/solr_manager/api/stop_crawl
  2. Parameters:
    1. email - (required) your opensolr registration email address
    2. api_key - (required) your opensolr api_key
    3. core_name - (required) the name of the index you wish to stop the web crawler for
  3. Example: https://opensolr.com/solr_manager/api/stop_crawl?email=PLEASE_LOG_IN&
    api_key=PLEASE_LOG_IN&core_name=my_solr_core
    
Read Full Answer

Loading more articles...