API - Start the Web Crawler
Start the Web Crawler
- GET https://opensolr.com/solr_manager/api/start_crawl
- Parameters:
- email - (required) your opensolr registration email address
- api_key - (required) your opensolr api_key
- core_name - (required) the name of the index you wish to start the web crawler process for
- follow_docs - (optional) follow documents and images? (yes/no). Default: no
- clean - (optional) start fresh, or resume from where you left off? (yes/no). Default: no.
- auth_username - (optional) if your starting URLs are using Basic HTTP Auth, you can enter the username here.
- auth_password - (optional) if your starting URLs are using Basic HTTP Auth, you can enter the password here.
- mode - (optional) the crawl mode that controls how far the crawler follows links. Possible values: 1 through 6. Default: 1.
Mode Name Scope Description 1 Follow Domain Links Full depth Follows all links across the entire domain, including subdomains. Starting from www.site.com, it will also crawlshop.site.com,blog.site.com, etc.2 Follow Host Links Full depth Stays on the exact hostname only. Starting from www.site.com, links toshop.site.comare ignored.3 Follow Path Links Full depth Stays within the URL path prefix on the same host. Starting from site.com/blog/, it will crawlsite.com/blog/2024/postbut skipsite.com/about.4 Shallow Domain Crawl Depth 1 Same domain scope as Mode 1, but only discovers links from the start page and its direct children. Pages found deeper are crawled but don't contribute new links to the queue. 5 Shallow Host Crawl Depth 1 Same host scope as Mode 2, but only discovers links from the start page and its direct children. Useful for a quick, shallow index of a single subdomain. 6 Shallow Path Crawl Depth 1 Same path scope as Mode 3, but only discovers links from the start page and its direct children. Useful for quickly indexing a product catalog or documentation section without going deep. Full depth = the crawler keeps discovering and following new links at every level.
Depth 1 = the crawler reads the start page, follows the links it finds there, but stops discovering new links after that. - max_threads - (optional) number of concurrent crawler threads. Default: 10. Higher values speed up crawling but use more bandwidth.
- relax - (optional) delay between requests in microseconds. Controls crawl politeness. Default: 100000 (0.1s). Higher values are more polite to the target server.
- max_traffic - (optional) maximum allowed traffic in GB. Default: your plan limit.
- max_pages - (optional) maximum number of pages to crawl. Default: your plan limit.
- max_filesize - (optional) maximum file size to download per page, in KB. Default: your plan limit.
- renderer - (optional) the rendering engine to use for fetching pages. Possible values: curl or chrome. Default: curl.
Value Name Description curl Curl (Fast) Fast HTTP fetch with no browser overhead. Best for most websites — server-rendered HTML, static sites, WordPress, Drupal, etc. Pages are fetched in ~0.2s each. chrome Chrome (JS Rendering) Every page is rendered through a headless Chromium browser (Playwright). Use for JavaScript-heavy SPAs built with React, Vue, Angular, Next.js, etc. Adds ~0.5–1s per page.
-
Example: https://opensolr.com/solr_manager/api/start_crawl?email=PLEASE_LOG_IN& api_key=PLEASE_LOG_IN&core_name=my_solr_core
Note: All parameters are saved and preserved across scheduled cron restarts. If you pause and resume the crawler, or if the cron schedule restarts it, your original settings (threads, relax delay, traffic limits, etc.) are retained.