Data Ingestion API — Push Documents to Your Opensolr Index Programmatically
Data Ingestion API
Push documents directly into any Opensolr Web Crawler index from your CMS, application, or data pipeline. Available exclusively for indexes on Web Crawler servers. Each document is automatically enriched with vector embeddings, sentiment analysis, language detection, and all the derived search fields — identical to what the Web Crawler produces. Works alongside the crawler or as a standalone ingestion method.
ERROR_CORE_NOT_WEBCRAWLER_ENABLED error. To create a Web Crawler index, see Getting Started./update request handler — otherwise ingestion writes will be blocked with a 403 error:5.161.242.87— api.opensolr.com (processes and writes documents)148.251.180.234— opensolr.com (proxies requests)
schema.xml, solrconfig.xml, search.xml + supporting text files). Upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.How It Works
Your App
POST JSON docs
API Gateway
Auth, rate limit, quota check, validate
Job Queue
Queued for async processing
Enrichment
Embeddings, sentiment, language, derived fields
Solr Index
Documents searchable
You submit a batch of up to 50 documents per request. The API validates your payload, checks your disk quota, and queues the job. A background processor then enriches each document and pushes it into your Solr index. You get a job_id immediately so you can poll for progress.
Endpoint Reference
Submit Documents
POST https://api.opensolr.com/solr_manager/api/ingest // Parameters (form-encoded or JSON body): email = your@email.com api_key = your_api_key core_name = your_index_name documents = [JSON array of document objects]
Response:
{ "status": true, "msg": "QUEUED", "job_id": "a1b2c3d4e5f6...", "total_docs": 25, "doc_ids": ["e8392e28...", "a3f1c9d0..."] }
Check Job Status
GET https://api.opensolr.com/solr_manager/api/ingest_status email = your@email.com api_key = your_api_key job_id = a1b2c3d4e5f6...
Response:
{ "status": true, "job": { "state_label": "completed", "total_docs": 25, "processed_docs": 25, "success_docs": 24, "failed_docs": 1, "result": [...per-doc details...] } }
Document Fields
| Field | Type | Status | Description |
|---|---|---|---|
uri | URL | required | Canonical URL that uniquely identifies this document. Must be a valid http or https URL. The document id is always generated as md5(uri). Submitting a document with the same URI will update the existing document. Also used as the deduplication key — duplicate URIs in pending jobs are rejected. |
title | string | required | Document title. Always required. |
description | string | required | Short description or summary. |
text | string | required* | The main body text. Do not send HTML. Required for standard documents. When using rtf:true, this field is optional and will be auto-populated from the extracted document content. |
id | string | auto-generated | Always generated as md5(uri). You do not set this — it is derived from the uri you provide. The same URI always produces the same ID, which is how deduplication and updates work. Returned in the API response as doc_ids. |
timestamp | int or date string | recommended | Content publication date. Unix epoch (1709913600) or parseable date string (2024-03-08 14:00:00). Used to derive creation_date and meta_creation_date for freshness boost in search. |
og_image | URL | optional | Thumbnail image URL. Shown in search results. |
meta_icon | URL | optional | Favicon URL for your site. |
meta_og_locale | string | optional | Locale code (e.g. en_US, de_DE). Used in search UI locale filtering. |
meta_detected_language | string | optional | Language code (e.g. en, fr). If not provided, auto-detected from title + description. |
category | string | optional | Content category for faceted filtering. |
content_type | string | optional | MIME type of the document (e.g. text/html, application/pdf). Defaults to text/html if not provided. This field controls how the document appears in the Search UI — text/html shows as a web result, while other types (PDF, DOCX, etc.) show in the Media/Docs tab. When using rtf:true, the MIME type is auto-detected from the actual file content, so you don't need to set it manually. |
rtf | boolean | optional | Set to true to enable plain text extraction from a remote document. When enabled, the system fetches the file at uri and extracts its plain text content, which is then used to populate only the text field. All other fields (title, description, etc.) must still be provided by you. Supported formats: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, CSV, and plain text files. When rtf:true is set, the text field becomes optional (it will be filled automatically from the extracted content). If the extraction fails, the job will report a detailed error. |
meta_domain | string | optional | Source domain. Auto-derived from uri if not provided. |
price_f | float | optional | Product price as a decimal number (e.g. 29.99). For e-commerce indexes only. Enables price filtering, sorting, and range facets in search results. If not provided, the document is simply excluded from price filters. |
currency_s | string | optional | Currency code (e.g. USD, EUR). Used alongside price_f for price display in search results. Only needed if price_f is set. |
* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).
Auto-Generated Fields
The following fields are generated automatically for every document — you never need to send them:
tags & title_tags
Edge n-gram fields for autocomplete and fuzzy matching. Built from title + description + text.
embeddings
1024-dimensional BGE-m3 vector embeddings of title + description for semantic / hybrid search.
sentiment
VADER sentiment scores: positive, negative, neutral, and compound. Computed from title + description.
language
Auto-detected from title + description using langid if you don't provide meta_detected_language.
spell
Spellcheck field for "did you mean?" suggestions. Built from tags.
phonetic_*
Phonetic title, description, and text for sounds-like matching across languages.
Using It with the Web Crawler
Update Existing Documents
Sending a document with the same uri as one already in the index will completely replace it, since the ID is always md5(uri). This is how you keep your index in sync with your CMS.
uri as an existing one, the entire previous document is deleted and replaced with the new one. Any fields you do not include in the updated document will be lost. Always send the complete document with all fields, even if you only need to change one or two of them.// The crawler indexed https://example.com/products/widget-a
// Submit the same uri via API to update it (id = md5(uri) is generated automatically):
{
"title": "Widget A — Updated Price",
"description": "Now only $29.99",
"text": "Full updated product description...",
"uri": "https://example.com/products/widget-a",
"content_type": "text/html",
"timestamp": 1741392000
}
The document ID is always generated as md5(uri) — which is the same ID the Web Crawler generates. So submitting a document with the same uri that the crawler indexed will naturally update it. You never need to know or manage document IDs manually.
Submission Methods
You can submit documents in two ways: as a JSON body in the POST request, or as a JSON file upload. Both methods support the same payload format. File upload is ideal for large batches or when integrating from systems that generate JSON export files.
Method 1: cURL — JSON Body
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \ -H "Content-Type: application/json" \ -d '{"email":"you@example.com","api_key":"YOUR_API_KEY","core_name":"my_index","documents":[{"uri":"https://example.com/page-1","title":"Page One","description":"First page description","text":"Full text content of page one.","content_type":"text/html"},{"uri":"https://example.com/page-2","title":"Page Two","description":"Second page description","text":"Full text content of page two.","content_type":"text/html","category":"Docs","timestamp":1741392000}]}'
Method 2: cURL — JSON File Upload
Save your payload as a .json file and upload it via the payload_file form field. The file can contain the full payload (with email, api_key, core_name, and documents), or just the documents array (with auth fields as separate form fields).
# Full payload in the file (recommended for large batches): curl -X POST https://api.opensolr.com/solr_manager/api/ingest \ -F "payload_file=@my_documents.json" # Or auth as form fields, documents in the file: curl -X POST https://api.opensolr.com/solr_manager/api/ingest \ -F "email=you@example.com" \ -F "api_key=YOUR_API_KEY" \ -F "core_name=my_index" \ -F "payload_file=@my_documents.json"
PHP Example
// Method 1: JSON body $payload = [ 'email' => 'you@example.com', 'api_key' => 'YOUR_API_KEY', 'core_name' => 'my_index', 'documents' => [ [ 'uri' => 'https://example.com/page-1', 'title' => 'Page One', 'description' => 'First page description', 'text' => 'Full text content of page one.', 'content_type'=> 'text/html', ], [ 'uri' => 'https://example.com/page-2', 'title' => 'Page Two', 'description' => 'Second page description', 'text' => 'Full text content of page two.', 'content_type'=> 'text/html', 'category' => 'Docs', 'timestamp' => 1741392000, ], ], ]; $ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest'); curl_setopt_array($ch, [ CURLOPT_POST => true, CURLOPT_RETURNTRANSFER => true, CURLOPT_HTTPHEADER => ['Content-Type: application/json'], CURLOPT_POSTFIELDS => json_encode($payload), ]); $response = json_decode(curl_exec($ch), true); curl_close($ch); if ($response['status']) { echo "Queued! Job ID: " . $response['job_id'] . "\n"; echo "Doc IDs: " . implode(', ', $response['doc_ids']) . "\n"; } else { echo "Error: " . $response['msg'] . "\n"; if (!empty($response['errors'])) { foreach ($response['errors'] as $err) echo " - $err\n"; } } // Method 2: File upload $ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest'); curl_setopt_array($ch, [ CURLOPT_POST => true, CURLOPT_RETURNTRANSFER => true, CURLOPT_POSTFIELDS => [ 'payload_file' => new CURLFile('/path/to/my_documents.json', 'application/json'), ], ]); $response = json_decode(curl_exec($ch), true); curl_close($ch);
Python Example
import requests, json # Method 1: JSON body payload = { "email": "you@example.com", "api_key": "YOUR_API_KEY", "core_name": "my_index", "documents": [ { "uri": "https://example.com/page-1", "title": "Page One", "description": "First page description", "text": "Full text content of page one.", "content_type": "text/html", }, { "uri": "https://example.com/page-2", "title": "Page Two", "description": "Second page description", "text": "Full text content of page two.", "content_type": "text/html", "category": "Docs", "timestamp": 1741392000, }, ], } resp = requests.post( "https://api.opensolr.com/solr_manager/api/ingest", json=payload, ) data = resp.json() if data["status"]: print(f"Queued! Job ID: {data['job_id']}") print(f"Doc IDs: {data['doc_ids']}") else: print(f"Error: {data['msg']}") for err in data.get("errors", []): print(f" - {err}") # Method 2: File upload with open("my_documents.json", "rb") as f: resp = requests.post( "https://api.opensolr.com/solr_manager/api/ingest", files={"payload_file": ("payload.json", f, "application/json")}, ) print(resp.json()) # Check job status status = requests.get("https://api.opensolr.com/solr_manager/api/ingest_status", params={ "email": "you@example.com", "api_key": "YOUR_API_KEY", "job_id": data["job_id"], }).json() print(f"State: {status['job']['state_label']}, Progress: {status['job']['processed_docs']}/{status['job']['total_docs']}")
Limits & Quotas
documents per request
API rate limit (all endpoints)
API rate limit (all endpoints)
Your index disk quota is also enforced — if your index is at or near its size limit, the ingestion request will be rejected with a clear error showing your current usage vs. maximum allowed.
Error Responses
| Error Code | Meaning |
|---|---|
ERROR_AUTHENTICATION_FAILED | Invalid email or api_key. |
ERROR_NOT_CORE_OWNER | You don't own this index. |
ERROR_BATCH_LIMIT_50_DOCUMENTS_MAX | Reduce your batch to 50 or fewer documents. |
ERROR_DISK_QUOTA_EXCEEDED | Your index is at its size limit. Upgrade your plan or delete old documents. |
VALIDATION_ERRORS | One or more documents failed validation. Check the errors array in the response. |
DUPLICATE_DOCUMENTS | One or more documents have a URI that is already queued for processing in this index. Wait for the pending job to complete or cancel it before resubmitting. |
ERROR_PAYLOAD_TOO_LARGE | Request body exceeds server limits. |
ERROR_CORE_NOT_WEBCRAWLER_ENABLED | This index is not on a Web Crawler server. Data Ingestion is only available for Web Crawler indexes. Create a Web Crawler index → |
* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).
text field should contain clean plain text, not HTML. If you are exporting from a CMS like Drupal or WordPress, strip all HTML tags before sending. The title and description should also be plain text.
Related Documentation
Web Crawler Overview
Complete guide to the Opensolr Web Crawler: crawl modes, features, live demos, analytics, and more.
Getting Started
Step-by-step setup: create an account, pick a Web Crawler server, start crawling your site.
Index Field Reference
Complete reference for every field in a Web Crawler index — what each field stores and how it is used in search.
Crawler Control API
Start, stop, pause, resume the crawler, check stats, and flush the crawl buffer — all via REST API.
Querying the Solr API
Search parameters, filtering, facets, pagination, and sorting — everything you need to query your index.
Manage Your Ingestion Queue
View all your ingestion jobs, monitor progress, pause, resume, retry, or delete queued jobs. Click any Job ID to see full details including payload and errors. Use Run Now per index or per job to trigger immediate processing without waiting for the cron cycle.
Need higher rate limits or larger batch sizes for your integration? We can set custom thresholds.
Contact Us