Data Ingestion API — Push Documents to Your Opensolr Index Programmatically

API Endpoint

Data Ingestion API

Push documents directly into any Opensolr Web Crawler index from your CMS, application, or data pipeline. Available exclusively for indexes on Web Crawler servers. Each document is automatically enriched with vector embeddings, sentiment analysis, language detection, and all the derived search fields — identical to what the Web Crawler produces. Works alongside the crawler or as a standalone ingestion method.

Web Crawler indexes only. The Data Ingestion API is available exclusively for indexes created on Web Crawler servers. If your index is not on a Web Crawler server, you will receive an ERROR_CORE_NOT_WEBCRAWLER_ENABLED error. To create a Web Crawler index, see Getting Started.

IP Restriction: If your index has IP restrictions enabled (under Security in the Index Control Panel), you must add the following IPs to the allowed list for the /update request handler — otherwise ingestion writes will be blocked with a 403 error:

5.161.242.87 — api.opensolr.com (processes and writes documents)
148.251.180.234 — opensolr.com (proxies requests)

Required Config Set: Data Ingestion requires the Web Crawler config set on your index. For Solr 9, download and apply the mandatory config set for Solr 9 (schema.xml, solrconfig.xml, search.xml + supporting text files). Upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.

How It Works

Your App

POST JSON docs

→

API Gateway

Auth, rate limit, quota check, validate

→

Job Queue

Queued for async processing

→

Enrichment

Embeddings, sentiment, language, derived fields

→

Solr Index

Documents searchable

You submit a batch of up to 50 documents per request. The API validates your payload, checks your disk quota, and queues the job. A background processor then enriches each document and pushes it into your Solr index. You get a job_id immediately so you can poll for progress.

Endpoint Reference

Submit Documents

POST https://api.opensolr.com/solr_manager/api/ingest

// Parameters (form-encoded or JSON body):
email       = your@email.com
api_key     = your_api_key
core_name   = your_index_name
documents   = [JSON array of document objects]

Response:

{
  "status": true,
  "msg": "QUEUED",
  "job_id": "a1b2c3d4e5f6...",
  "total_docs": 25,
  "doc_ids": ["e8392e28...", "a3f1c9d0..."]
}

Check Job Status

GET https://api.opensolr.com/solr_manager/api/ingest_status

email       = your@email.com
api_key     = your_api_key
job_id      = a1b2c3d4e5f6...

Response:

{
  "status": true,
  "job": {
    "state_label": "completed",
    "total_docs": 25,
    "processed_docs": 25,
    "success_docs": 24,
    "failed_docs": 1,
    "result": [...per-doc details...]
  }
}

Document Fields

Field	Type	Status	Description
`uri`	URL	required	Canonical URL that uniquely identifies this document. Must be a valid `http` or `https` URL. The document `id` is always generated as `md5(uri)`. Submitting a document with the same URI will update the existing document. Also used as the deduplication key — duplicate URIs in pending jobs are rejected.
`title`	string	required	Document title. Always required.
`description`	string	required	Short description or summary.
`text`	string	required*	The main body text. Do not send HTML. Required for standard documents. When using `rtf:true`, this field is optional and will be auto-populated from the extracted document content.
`id`	string	auto-generated	Always generated as `md5(uri)`. You do not set this — it is derived from the `uri` you provide. The same URI always produces the same ID, which is how deduplication and updates work. Returned in the API response as `doc_ids`.
`timestamp`	int or date string	recommended	Content publication date. Unix epoch (`1709913600`) or parseable date string (`2024-03-08 14:00:00`). Used to derive `creation_date` and `meta_creation_date` for freshness boost in search.
`og_image`	URL	optional	Thumbnail image URL. Shown in search results.
`meta_icon`	URL	optional	Favicon URL for your site.
`meta_og_locale`	string	optional	Locale code (e.g. `en_US`, `de_DE`). Used in search UI locale filtering.
`meta_detected_language`	string	optional	Language code (e.g. `en`, `fr`). If not provided, auto-detected from title + description.
`category`	string	optional	Content category for faceted filtering.
`content_type`	string	optional	MIME type of the document (e.g. `text/html`, `application/pdf`). Defaults to `text/html` if not provided. This field controls how the document appears in the Search UI — `text/html` shows as a web result, while other types (PDF, DOCX, etc.) show in the Media/Docs tab. When using `rtf:true`, the MIME type is auto-detected from the actual file content, so you don't need to set it manually.
`rtf`	boolean	optional	Set to `true` to enable plain text extraction from a remote document. When enabled, the system fetches the file at `uri` and extracts its plain text content, which is then used to populate only the `text` field. All other fields (`title`, `description`, etc.) must still be provided by you. Supported formats: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, CSV, and plain text files. When `rtf:true` is set, the `text` field becomes optional (it will be filled automatically from the extracted content). If the extraction fails, the job will report a detailed error.
`meta_domain`	string	optional	Source domain. Auto-derived from `uri` if not provided.
`price_f`	float	optional	Product price as a decimal number (e.g. `29.99`). For e-commerce indexes only. Enables price filtering, sorting, and range facets in search results. If not provided, the document is simply excluded from price filters.
`currency_s`	string	optional	Currency code (e.g. `USD`, `EUR`). Used alongside `price_f` for price display in search results. Only needed if `price_f` is set.

* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).

Auto-Generated Fields

The following fields are generated automatically for every document — you never need to send them:

tags & title_tags

Edge n-gram fields for autocomplete and fuzzy matching. Built from title + description + text.

embeddings

1024-dimensional BGE-m3 vector embeddings of title + description for semantic / hybrid search.

sentiment

VADER sentiment scores: positive, negative, neutral, and compound. Computed from title + description.

language

Auto-detected from title + description using langid if you don't provide meta_detected_language.

spell

Spellcheck field for "did you mean?" suggestions. Built from tags.

phonetic_*

Phonetic title, description, and text for sounds-like matching across languages.

Using It with the Web Crawler

Works in tandem. The Data Ingestion API and the Web Crawler share the same index and the same document schema. You can use the crawler to index your public website, and the API to push content that the crawler can't reach — like gated pages, internal databases, CMS drafts, or product feeds. Documents from both sources coexist seamlessly in the same index.

Update Existing Documents

Sending a document with the same uri as one already in the index will completely replace it, since the ID is always md5(uri). This is how you keep your index in sync with your CMS.

⚠ Important: Updates are full document replacements, not partial. When you send a document with the same uri as an existing one, the entire previous document is deleted and replaced with the new one. Any fields you do not include in the updated document will be lost. Always send the complete document with all fields, even if you only need to change one or two of them.

// The crawler indexed https://example.com/products/widget-a
// Submit the same uri via API to update it (id = md5(uri) is generated automatically):
{
  "title": "Widget A — Updated Price",
  "description": "Now only $29.99",
  "text": "Full updated product description...",
  "uri": "https://example.com/products/widget-a",
  "content_type": "text/html",
  "timestamp": 1741392000
}

The document ID is always generated as md5(uri) — which is the same ID the Web Crawler generates. So submitting a document with the same uri that the crawler indexed will naturally update it. You never need to know or manage document IDs manually.

Submission Methods

You can submit documents in two ways: as a JSON body in the POST request, or as a JSON file upload. Both methods support the same payload format. File upload is ideal for large batches or when integrating from systems that generate JSON export files.

Method 1: cURL — JSON Body

curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"email":"you@example.com","api_key":"YOUR_API_KEY","core_name":"my_index","documents":[{"uri":"https://example.com/page-1","title":"Page One","description":"First page description","text":"Full text content of page one.","content_type":"text/html"},{"uri":"https://example.com/page-2","title":"Page Two","description":"Second page description","text":"Full text content of page two.","content_type":"text/html","category":"Docs","timestamp":1741392000}]}'

Method 2: cURL — JSON File Upload

Save your payload as a .json file and upload it via the payload_file form field. The file can contain the full payload (with email, api_key, core_name, and documents), or just the documents array (with auth fields as separate form fields).

# Full payload in the file (recommended for large batches):
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "payload_file=@my_documents.json"

# Or auth as form fields, documents in the file:
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "email=you@example.com" \
  -F "api_key=YOUR_API_KEY" \
  -F "core_name=my_index" \
  -F "payload_file=@my_documents.json"

PHP Example

// Method 1: JSON body
$payload = [
    'email'     => 'you@example.com',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_index',
    'documents' => [
        [
            'uri'         => 'https://example.com/page-1',
            'title'       => 'Page One',
            'description' => 'First page description',
            'text'        => 'Full text content of page one.',
            'content_type'=> 'text/html',
        ],
        [
            'uri'         => 'https://example.com/page-2',
            'title'       => 'Page Two',
            'description' => 'Second page description',
            'text'        => 'Full text content of page two.',
            'content_type'=> 'text/html',
            'category'    => 'Docs',
            'timestamp'   => 1741392000,
        ],
    ],
];

$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
    CURLOPT_POSTFIELDS     => json_encode($payload),
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

if ($response['status']) {
    echo "Queued! Job ID: " . $response['job_id'] . "\n";
    echo "Doc IDs: " . implode(', ', $response['doc_ids']) . "\n";
} else {
    echo "Error: " . $response['msg'] . "\n";
    if (!empty($response['errors'])) {
        foreach ($response['errors'] as $err) echo "  - $err\n";
    }
}

// Method 2: File upload
$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POSTFIELDS     => [
        'payload_file' => new CURLFile('/path/to/my_documents.json', 'application/json'),
    ],
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

Python Example

import requests, json

# Method 1: JSON body
payload = {
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_index",
    "documents": [
        {
            "uri": "https://example.com/page-1",
            "title": "Page One",
            "description": "First page description",
            "text": "Full text content of page one.",
            "content_type": "text/html",
        },
        {
            "uri": "https://example.com/page-2",
            "title": "Page Two",
            "description": "Second page description",
            "text": "Full text content of page two.",
            "content_type": "text/html",
            "category": "Docs",
            "timestamp": 1741392000,
        },
    ],
}

resp = requests.post(
    "https://api.opensolr.com/solr_manager/api/ingest",
    json=payload,
)
data = resp.json()

if data["status"]:
    print(f"Queued! Job ID: {data['job_id']}")
    print(f"Doc IDs: {data['doc_ids']}")
else:
    print(f"Error: {data['msg']}")
    for err in data.get("errors", []):
        print(f"  - {err}")

# Method 2: File upload
with open("my_documents.json", "rb") as f:
    resp = requests.post(
        "https://api.opensolr.com/solr_manager/api/ingest",
        files={"payload_file": ("payload.json", f, "application/json")},
    )
    print(resp.json())

# Check job status
status = requests.get("https://api.opensolr.com/solr_manager/api/ingest_status", params={
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "job_id": data["job_id"],
}).json()
print(f"State: {status['job']['state_label']}, Progress: {status['job']['processed_docs']}/{status['job']['total_docs']}")

Limits & Quotas

documents per request

30/min

API rate limit (all endpoints)

500/hr

API rate limit (all endpoints)

Your index disk quota is also enforced — if your index is at or near its size limit, the ingestion request will be rejected with a clear error showing your current usage vs. maximum allowed.

Error Responses

Error Code	Meaning
`ERROR_AUTHENTICATION_FAILED`	Invalid email or api_key.
`ERROR_NOT_CORE_OWNER`	You don't own this index.
`ERROR_BATCH_LIMIT_50_DOCUMENTS_MAX`	Reduce your batch to 50 or fewer documents.
`ERROR_DISK_QUOTA_EXCEEDED`	Your index is at its size limit. Upgrade your plan or delete old documents.
`VALIDATION_ERRORS`	One or more documents failed validation. Check the `errors` array in the response.
`DUPLICATE_DOCUMENTS`	One or more documents have a URI that is already queued for processing in this index. Wait for the pending job to complete or cancel it before resubmitting.
`ERROR_PAYLOAD_TOO_LARGE`	Request body exceeds server limits.
`ERROR_CORE_NOT_WEBCRAWLER_ENABLED`	This index is not on a Web Crawler server. Data Ingestion is only available for Web Crawler indexes. Create a Web Crawler index →

Plain text only. The text field should contain clean plain text, not HTML. If you are exporting from a CMS like Drupal or WordPress, strip all HTML tags before sending. The title and description should also be plain text.

Data Ingestion API — Push Documents to Your Opensolr Index Programmatically