Data Ingestion API — Push Documents to Your Opensolr Index Programmatically

Documentation > API-Data Ingestion > Data Ingestion API — Push Documents to Your Opensolr Index Programmatically
API Endpoint

Data Ingestion API

Push documents directly into any Opensolr Web Crawler index from your CMS, application, or data pipeline. Available exclusively for indexes on Web Crawler servers. Each document is automatically enriched with vector embeddings, sentiment analysis, language detection, and all the derived search fields — identical to what the Web Crawler produces. Works alongside the crawler or as a standalone ingestion method.

Web Crawler indexes only. The Data Ingestion API is available exclusively for indexes created on Web Crawler servers. If your index is not on a Web Crawler server, you will receive an ERROR_CORE_NOT_WEBCRAWLER_ENABLED error. To create a Web Crawler index, see Getting Started.
IP Restriction: If your index has IP restrictions enabled (under Security in the Index Control Panel), you must add the following IPs to the allowed list for the /update request handler — otherwise ingestion writes will be blocked with a 403 error:
  • 5.161.242.87 — api.opensolr.com (processes and writes documents)
  • 148.251.180.234 — opensolr.com (proxies requests)
Required Config Set: Data Ingestion requires the Web Crawler config set on your index. For Solr 9, download and apply the mandatory config set for Solr 9 (schema.xml, solrconfig.xml, search.xml + supporting text files). Upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.

How It Works

Your App

POST JSON docs

API Gateway

Auth, rate limit, quota check, validate

Job Queue

Queued for async processing

Enrichment

Embeddings, sentiment, language, derived fields

Solr Index

Documents searchable

You submit a batch of up to 50 documents per request. The API validates your payload, checks your disk quota, and queues the job. A background processor then enriches each document and pushes it into your Solr index. You get a job_id immediately so you can poll for progress.

Endpoint Reference

Submit Documents

POST https://api.opensolr.com/solr_manager/api/ingest

// Parameters (form-encoded or JSON body):
email       = your@email.com
api_key     = your_api_key
core_name   = your_index_name
documents   = [JSON array of document objects]

Response:

{
  "status": true,
  "msg": "QUEUED",
  "job_id": "a1b2c3d4e5f6...",
  "total_docs": 25,
  "doc_ids": ["e8392e28...", "a3f1c9d0..."]
}

Check Job Status

GET https://api.opensolr.com/solr_manager/api/ingest_status

email       = your@email.com
api_key     = your_api_key
job_id      = a1b2c3d4e5f6...

Response:

{
  "status": true,
  "job": {
    "state_label": "completed",
    "total_docs": 25,
    "processed_docs": 25,
    "success_docs": 24,
    "failed_docs": 1,
    "result": [...per-doc details...]
  }
}

Document Fields

FieldTypeStatusDescription
uriURLrequiredCanonical URL that uniquely identifies this document. Must be a valid http or https URL. The document id is always generated as md5(uri). Submitting a document with the same URI will update the existing document. Also used as the deduplication key — duplicate URIs in pending jobs are rejected.
titlestringrequiredDocument title. Always required.
descriptionstringrequiredShort description or summary.
textstringrequired*The main body text. Do not send HTML. Required for standard documents. When using rtf:true, this field is optional and will be auto-populated from the extracted document content.
idstringauto-generatedAlways generated as md5(uri). You do not set this — it is derived from the uri you provide. The same URI always produces the same ID, which is how deduplication and updates work. Returned in the API response as doc_ids.
timestampint or date stringrecommendedContent publication date. Unix epoch (1709913600) or parseable date string (2024-03-08 14:00:00). Used to derive creation_date and meta_creation_date for freshness boost in search.
og_imageURLoptionalThumbnail image URL. Shown in search results.
meta_iconURLoptionalFavicon URL for your site.
meta_og_localestringoptionalLocale code (e.g. en_US, de_DE). Used in search UI locale filtering.
meta_detected_languagestringoptionalLanguage code (e.g. en, fr). If not provided, auto-detected from title + description.
categorystringoptionalContent category for faceted filtering.
content_typestringoptionalMIME type of the document (e.g. text/html, application/pdf). Defaults to text/html if not provided. This field controls how the document appears in the Search UI — text/html shows as a web result, while other types (PDF, DOCX, etc.) show in the Media/Docs tab. When using rtf:true, the MIME type is auto-detected from the actual file content, so you don't need to set it manually.
rtfbooleanoptionalSet to true to enable plain text extraction from a remote document. When enabled, the system fetches the file at uri and extracts its plain text content, which is then used to populate only the text field. All other fields (title, description, etc.) must still be provided by you. Supported formats: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, CSV, and plain text files. When rtf:true is set, the text field becomes optional (it will be filled automatically from the extracted content). If the extraction fails, the job will report a detailed error.
meta_domainstringoptionalSource domain. Auto-derived from uri if not provided.
price_ffloatoptionalProduct price as a decimal number (e.g. 29.99). For e-commerce indexes only. Enables price filtering, sorting, and range facets in search results. If not provided, the document is simply excluded from price filters.
currency_sstringoptionalCurrency code (e.g. USD, EUR). Used alongside price_f for price display in search results. Only needed if price_f is set.

* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).

Auto-Generated Fields

The following fields are generated automatically for every document — you never need to send them:

tags & title_tags

Edge n-gram fields for autocomplete and fuzzy matching. Built from title + description + text.

embeddings

1024-dimensional BGE-m3 vector embeddings of title + description for semantic / hybrid search.

sentiment

VADER sentiment scores: positive, negative, neutral, and compound. Computed from title + description.

language

Auto-detected from title + description using langid if you don't provide meta_detected_language.

spell

Spellcheck field for "did you mean?" suggestions. Built from tags.

phonetic_*

Phonetic title, description, and text for sounds-like matching across languages.

Using It with the Web Crawler

Works in tandem. The Data Ingestion API and the Web Crawler share the same index and the same document schema. You can use the crawler to index your public website, and the API to push content that the crawler can't reach — like gated pages, internal databases, CMS drafts, or product feeds. Documents from both sources coexist seamlessly in the same index.

Update Existing Documents

Sending a document with the same uri as one already in the index will completely replace it, since the ID is always md5(uri). This is how you keep your index in sync with your CMS.

⚠ Important: Updates are full document replacements, not partial. When you send a document with the same uri as an existing one, the entire previous document is deleted and replaced with the new one. Any fields you do not include in the updated document will be lost. Always send the complete document with all fields, even if you only need to change one or two of them.
// The crawler indexed https://example.com/products/widget-a
// Submit the same uri via API to update it (id = md5(uri) is generated automatically):
{
  "title": "Widget A — Updated Price",
  "description": "Now only $29.99",
  "text": "Full updated product description...",
  "uri": "https://example.com/products/widget-a",
  "content_type": "text/html",
  "timestamp": 1741392000
}

The document ID is always generated as md5(uri) — which is the same ID the Web Crawler generates. So submitting a document with the same uri that the crawler indexed will naturally update it. You never need to know or manage document IDs manually.

Submission Methods

You can submit documents in two ways: as a JSON body in the POST request, or as a JSON file upload. Both methods support the same payload format. File upload is ideal for large batches or when integrating from systems that generate JSON export files.

Method 1: cURL — JSON Body

curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"email":"you@example.com","api_key":"YOUR_API_KEY","core_name":"my_index","documents":[{"uri":"https://example.com/page-1","title":"Page One","description":"First page description","text":"Full text content of page one.","content_type":"text/html"},{"uri":"https://example.com/page-2","title":"Page Two","description":"Second page description","text":"Full text content of page two.","content_type":"text/html","category":"Docs","timestamp":1741392000}]}'

Method 2: cURL — JSON File Upload

Save your payload as a .json file and upload it via the payload_file form field. The file can contain the full payload (with email, api_key, core_name, and documents), or just the documents array (with auth fields as separate form fields).

# Full payload in the file (recommended for large batches):
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "payload_file=@my_documents.json"

# Or auth as form fields, documents in the file:
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "email=you@example.com" \
  -F "api_key=YOUR_API_KEY" \
  -F "core_name=my_index" \
  -F "payload_file=@my_documents.json"

PHP Example

// Method 1: JSON body
$payload = [
    'email'     => 'you@example.com',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_index',
    'documents' => [
        [
            'uri'         => 'https://example.com/page-1',
            'title'       => 'Page One',
            'description' => 'First page description',
            'text'        => 'Full text content of page one.',
            'content_type'=> 'text/html',
        ],
        [
            'uri'         => 'https://example.com/page-2',
            'title'       => 'Page Two',
            'description' => 'Second page description',
            'text'        => 'Full text content of page two.',
            'content_type'=> 'text/html',
            'category'    => 'Docs',
            'timestamp'   => 1741392000,
        ],
    ],
];

$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
    CURLOPT_POSTFIELDS     => json_encode($payload),
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

if ($response['status']) {
    echo "Queued! Job ID: " . $response['job_id'] . "\n";
    echo "Doc IDs: " . implode(', ', $response['doc_ids']) . "\n";
} else {
    echo "Error: " . $response['msg'] . "\n";
    if (!empty($response['errors'])) {
        foreach ($response['errors'] as $err) echo "  - $err\n";
    }
}

// Method 2: File upload
$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POSTFIELDS     => [
        'payload_file' => new CURLFile('/path/to/my_documents.json', 'application/json'),
    ],
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

Python Example

import requests, json

# Method 1: JSON body
payload = {
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_index",
    "documents": [
        {
            "uri": "https://example.com/page-1",
            "title": "Page One",
            "description": "First page description",
            "text": "Full text content of page one.",
            "content_type": "text/html",
        },
        {
            "uri": "https://example.com/page-2",
            "title": "Page Two",
            "description": "Second page description",
            "text": "Full text content of page two.",
            "content_type": "text/html",
            "category": "Docs",
            "timestamp": 1741392000,
        },
    ],
}

resp = requests.post(
    "https://api.opensolr.com/solr_manager/api/ingest",
    json=payload,
)
data = resp.json()

if data["status"]:
    print(f"Queued! Job ID: {data['job_id']}")
    print(f"Doc IDs: {data['doc_ids']}")
else:
    print(f"Error: {data['msg']}")
    for err in data.get("errors", []):
        print(f"  - {err}")

# Method 2: File upload
with open("my_documents.json", "rb") as f:
    resp = requests.post(
        "https://api.opensolr.com/solr_manager/api/ingest",
        files={"payload_file": ("payload.json", f, "application/json")},
    )
    print(resp.json())

# Check job status
status = requests.get("https://api.opensolr.com/solr_manager/api/ingest_status", params={
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "job_id": data["job_id"],
}).json()
print(f"State: {status['job']['state_label']}, Progress: {status['job']['processed_docs']}/{status['job']['total_docs']}")

Limits & Quotas

50

documents per request

30/min

API rate limit (all endpoints)

500/hr

API rate limit (all endpoints)

Your index disk quota is also enforced — if your index is at or near its size limit, the ingestion request will be rejected with a clear error showing your current usage vs. maximum allowed.

Error Responses

Error CodeMeaning
ERROR_AUTHENTICATION_FAILEDInvalid email or api_key.
ERROR_NOT_CORE_OWNERYou don't own this index.
ERROR_BATCH_LIMIT_50_DOCUMENTS_MAXReduce your batch to 50 or fewer documents.
ERROR_DISK_QUOTA_EXCEEDEDYour index is at its size limit. Upgrade your plan or delete old documents.
VALIDATION_ERRORSOne or more documents failed validation. Check the errors array in the response.
DUPLICATE_DOCUMENTSOne or more documents have a URI that is already queued for processing in this index. Wait for the pending job to complete or cancel it before resubmitting.
ERROR_PAYLOAD_TOO_LARGERequest body exceeds server limits.
ERROR_CORE_NOT_WEBCRAWLER_ENABLEDThis index is not on a Web Crawler server. Data Ingestion is only available for Web Crawler indexes. Create a Web Crawler index →

* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).

Plain text only. The text field should contain clean plain text, not HTML. If you are exporting from a CMS like Drupal or WordPress, strip all HTML tags before sending. The title and description should also be plain text.

Related Documentation

Web Crawler Overview

Complete guide to the Opensolr Web Crawler: crawl modes, features, live demos, analytics, and more.

Getting Started

Step-by-step setup: create an account, pick a Web Crawler server, start crawling your site.

Index Field Reference

Complete reference for every field in a Web Crawler index — what each field stores and how it is used in search.

Crawler Control API

Start, stop, pause, resume the crawler, check stats, and flush the crawl buffer — all via REST API.

Querying the Solr API

Search parameters, filtering, facets, pagination, and sorting — everything you need to query your index.

Manage Your Ingestion Queue

View all your ingestion jobs, monitor progress, pause, resume, retry, or delete queued jobs. Click any Job ID to see full details including payload and errors. Use Run Now per index or per job to trigger immediate processing without waiting for the cron cycle.

Need higher rate limits or larger batch sizes for your integration? We can set custom thresholds.

Contact Us