API-Data Ingestion

Find answers to your questions quickly and easily

Data Ingestion API — Push Documents to Your Opensolr Index...

API Endpoint

Data Ingestion API

Push documents directly into any Opensolr Web Crawler index from your CMS, application, or data pipeline. Available exclusively for indexes on Web Crawler servers. Each document is automatically enriched with vector embeddings, sentiment analysis, language detection, and all the derived search fields — identical to what the Web Crawler produces. Works alongside the crawler or as a standalone ingestion method.

Web Crawler indexes only. The Data Ingestion API is available exclusively for indexes created on Web Crawler servers. If your index is not on a Web Crawler server, you will receive an ERROR_CORE_NOT_WEBCRAWLER_ENABLED error. To create a Web Crawler index, see Getting Started.

IP Restriction: If your index has IP restrictions enabled (under Security in the Index Control Panel), you must add the following IPs to the allowed list for the /update request handler — otherwise ingestion writes will be blocked with a 403 error:

5.161.242.87 — api.opensolr.com (processes and writes documents)
148.251.180.234 — opensolr.com (proxies requests)

Required Config Set: Data Ingestion requires the Web Crawler config set on your index. For Solr 9, download and apply the mandatory config set for Solr 9 (schema.xml, solrconfig.xml, search.xml + supporting text files). Upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.

How It Works

Your App

POST JSON docs

→

API Gateway

Auth, rate limit, quota check, validate

→

Job Queue

Queued for async processing

→

Enrichment

Embeddings, sentiment, language, derived fields

→

Solr Index

Documents searchable

You submit a batch of up to 50 documents per request. The API validates your payload, checks your disk quota, and queues the job. A background processor then enriches each document and pushes it into your Solr index. You get a job_id immediately so you can poll for progress.

Endpoint Reference

Submit Documents

POST https://api.opensolr.com/solr_manager/api/ingest

// Parameters (form-encoded or JSON body):
email       = your@email.com
api_key     = your_api_key
core_name   = your_index_name
documents   = [JSON array of document objects]

Response:

{
  "status": true,
  "msg": "QUEUED",
  "job_id": "a1b2c3d4e5f6...",
  "total_docs": 25,
  "doc_ids": ["e8392e28...", "a3f1c9d0..."]
}

Check Job Status

GET https://api.opensolr.com/solr_manager/api/ingest_status

email       = your@email.com
api_key     = your_api_key
job_id      = a1b2c3d4e5f6...

Response:

{
  "status": true,
  "job": {
    "state_label": "completed",
    "total_docs": 25,
    "processed_docs": 25,
    "success_docs": 24,
    "failed_docs": 1,
    "result": [...per-doc details...]
  }
}

Document Fields

Field	Type	Status	Description
`uri`	URL	required	Canonical URL that uniquely identifies this document. Must be a valid `http` or `https` URL. The document `id` is always generated as `md5(uri)`. Submitting a document with the same URI will update the existing document. Also used as the deduplication key — duplicate URIs in pending jobs are rejected.
`title`	string	required	Document title. Always required.
`description`	string	required	Short description or summary.
`text`	string	required*	The main body text. Do not send HTML. Required for standard documents. When using `rtf:true`, this field is optional and will be auto-populated from the extracted document content.
`id`	string	auto-generated	Always generated as `md5(uri)`. You do not set this — it is derived from the `uri` you provide. The same URI always produces the same ID, which is how deduplication and updates work. Returned in the API response as `doc_ids`.
`timestamp`	int or date string	recommended	Content publication date. Unix epoch (`1709913600`) or parseable date string (`2024-03-08 14:00:00`). Used to derive `creation_date` and `meta_creation_date` for freshness boost in search.
`og_image`	URL	optional	Thumbnail image URL. Shown in search results.
`meta_icon`	URL	optional	Favicon URL for your site.
`meta_og_locale`	string	optional	Locale code (e.g. `en_US`, `de_DE`). Used in search UI locale filtering.
`meta_detected_language`	string	optional	Language code (e.g. `en`, `fr`). If not provided, auto-detected from title + description.
`category`	string	optional	Content category for faceted filtering.
`content_type`	string	optional	MIME type of the document (e.g. `text/html`, `application/pdf`). Defaults to `text/html` if not provided. This field controls how the document appears in the Search UI — `text/html` shows as a web result, while other types (PDF, DOCX, etc.) show in the Media/Docs tab. When using `rtf:true`, the MIME type is auto-detected from the actual file content, so you don't need to set it manually.
`rtf`	boolean	optional	Set to `true` to enable plain text extraction from a remote document. When enabled, the system fetches the file at `uri` and extracts its plain text content, which is then used to populate only the `text` field. All other fields (`title`, `description`, etc.) must still be provided by you. Supported formats: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, CSV, and plain text files. When `rtf:true` is set, the `text` field becomes optional (it will be filled automatically from the extracted content). If the extraction fails, the job will report a detailed error.
`meta_domain`	string	optional	Source domain. Auto-derived from `uri` if not provided.
`price_f`	float	optional	Product price as a decimal number (e.g. `29.99`). For e-commerce indexes only. Enables price filtering, sorting, and range facets in search results. If not provided, the document is simply excluded from price filters.
`currency_s`	string	optional	Currency code (e.g. `USD`, `EUR`). Used alongside `price_f` for price display in search results. Only needed if `price_f` is set.

* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).

Auto-Generated Fields

The following fields are generated automatically for every document — you never need to send them:

tags & title_tags

Edge n-gram fields for autocomplete and fuzzy matching. Built from title + description + text.

embeddings

1024-dimensional BGE-m3 vector embeddings of title + description for semantic / hybrid search.

sentiment

VADER sentiment scores: positive, negative, neutral, and compound. Computed from title + description.

language

Auto-detected from title + description using langid if you don't provide meta_detected_language.

spell

Spellcheck field for "did you mean?" suggestions. Built from tags.

phonetic_*

Phonetic title, description, and text for sounds-like matching across languages.

Using It with the Web Crawler

Works in tandem. The Data Ingestion API and the Web Crawler share the same index and the same document schema. You can use the crawler to index your public website, and the API to push content that the crawler can't reach — like gated pages, internal databases, CMS drafts, or product feeds. Documents from both sources coexist seamlessly in the same index.

Update Existing Documents

Sending a document with the same uri as one already in the index will completely replace it, since the ID is always md5(uri). This is how you keep your index in sync with your CMS.

⚠ Important: Updates are full document replacements, not partial. When you send a document with the same uri as an existing one, the entire previous document is deleted and replaced with the new one. Any fields you do not include in the updated document will be lost. Always send the complete document with all fields, even if you only need to change one or two of them.

// The crawler indexed https://example.com/products/widget-a
// Submit the same uri via API to update it (id = md5(uri) is generated automatically):
{
  "title": "Widget A — Updated Price",
  "description": "Now only $29.99",
  "text": "Full updated product description...",
  "uri": "https://example.com/products/widget-a",
  "content_type": "text/html",
  "timestamp": 1741392000
}

The document ID is always generated as md5(uri) — which is the same ID the Web Crawler generates. So submitting a document with the same uri that the crawler indexed will naturally update it. You never need to know or manage document IDs manually.

Submission Methods

You can submit documents in two ways: as a JSON body in the POST request, or as a JSON file upload. Both methods support the same payload format. File upload is ideal for large batches or when integrating from systems that generate JSON export files.

Method 1: cURL — JSON Body

curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"email":"you@example.com","api_key":"YOUR_API_KEY","core_name":"my_index","documents":[{"uri":"https://example.com/page-1","title":"Page One","description":"First page description","text":"Full text content of page one.","content_type":"text/html"},{"uri":"https://example.com/page-2","title":"Page Two","description":"Second page description","text":"Full text content of page two.","content_type":"text/html","category":"Docs","timestamp":1741392000}]}'

Method 2: cURL — JSON File Upload

Save your payload as a .json file and upload it via the payload_file form field. The file can contain the full payload (with email, api_key, core_name, and documents), or just the documents array (with auth fields as separate form fields).

# Full payload in the file (recommended for large batches):
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "payload_file=@my_documents.json"

# Or auth as form fields, documents in the file:
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "email=you@example.com" \
  -F "api_key=YOUR_API_KEY" \
  -F "core_name=my_index" \
  -F "payload_file=@my_documents.json"

PHP Example

// Method 1: JSON body
$payload = [
    'email'     => 'you@example.com',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_index',
    'documents' => [
        [
            'uri'         => 'https://example.com/page-1',
            'title'       => 'Page One',
            'description' => 'First page description',
            'text'        => 'Full text content of page one.',
            'content_type'=> 'text/html',
        ],
        [
            'uri'         => 'https://example.com/page-2',
            'title'       => 'Page Two',
            'description' => 'Second page description',
            'text'        => 'Full text content of page two.',
            'content_type'=> 'text/html',
            'category'    => 'Docs',
            'timestamp'   => 1741392000,
        ],
    ],
];

$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
    CURLOPT_POSTFIELDS     => json_encode($payload),
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

if ($response['status']) {
    echo "Queued! Job ID: " . $response['job_id'] . "\n";
    echo "Doc IDs: " . implode(', ', $response['doc_ids']) . "\n";
} else {
    echo "Error: " . $response['msg'] . "\n";
    if (!empty($response['errors'])) {
        foreach ($response['errors'] as $err) echo "  - $err\n";
    }
}

// Method 2: File upload
$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POSTFIELDS     => [
        'payload_file' => new CURLFile('/path/to/my_documents.json', 'application/json'),
    ],
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

Python Example

import requests, json

# Method 1: JSON body
payload = {
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_index",
    "documents": [
        {
            "uri": "https://example.com/page-1",
            "title": "Page One",
            "description": "First page description",
            "text": "Full text content of page one.",
            "content_type": "text/html",
        },
        {
            "uri": "https://example.com/page-2",
            "title": "Page Two",
            "description": "Second page description",
            "text": "Full text content of page two.",
            "content_type": "text/html",
            "category": "Docs",
            "timestamp": 1741392000,
        },
    ],
}

resp = requests.post(
    "https://api.opensolr.com/solr_manager/api/ingest",
    json=payload,
)
data = resp.json()

if data["status"]:
    print(f"Queued! Job ID: {data['job_id']}")
    print(f"Doc IDs: {data['doc_ids']}")
else:
    print(f"Error: {data['msg']}")
    for err in data.get("errors", []):
        print(f"  - {err}")

# Method 2: File upload
with open("my_documents.json", "rb") as f:
    resp = requests.post(
        "https://api.opensolr.com/solr_manager/api/ingest",
        files={"payload_file": ("payload.json", f, "application/json")},
    )
    print(resp.json())

# Check job status
status = requests.get("https://api.opensolr.com/solr_manager/api/ingest_status", params={
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "job_id": data["job_id"],
}).json()
print(f"State: {status['job']['state_label']}, Progress: {status['job']['processed_docs']}/{status['job']['total_docs']}")

Limits & Quotas

documents per request

30/min

API rate limit (all endpoints)

500/hr

API rate limit (all endpoints)

Your index disk quota is also enforced — if your index is at or near its size limit, the ingestion request will be rejected with a clear error showing your current usage vs. maximum allowed.

Error Responses

Error Code	Meaning
`ERROR_AUTHENTICATION_FAILED`	Invalid email or api_key.
`ERROR_NOT_CORE_OWNER`	You don't own this index.
`ERROR_BATCH_LIMIT_50_DOCUMENTS_MAX`	Reduce your batch to 50 or fewer documents.
`ERROR_DISK_QUOTA_EXCEEDED`	Your index is at its size limit. Upgrade your plan or delete old documents.
`VALIDATION_ERRORS`	One or more documents failed validation. Check the `errors` array in the response.
`DUPLICATE_DOCUMENTS`	One or more documents have a URI that is already queued for processing in this index. Wait for the pending job to complete or cancel it before resubmitting.
`ERROR_PAYLOAD_TOO_LARGE`	Request body exceeds server limits.
`ERROR_CORE_NOT_WEBCRAWLER_ENABLED`	This index is not on a Web Crawler server. Data Ingestion is only available for Web Crawler indexes. Create a Web Crawler index →

Plain text only. The text field should contain clean plain text, not HTML. If you are exporting from a CMS like Drupal or WordPress, strip all HTML tags before sending. The title and description should also be plain text.

Ingestion Queue Management

Every document you submit through the Data Ingestion API goes into a processing queue. The Ingestion Queue page in your account gives you full control: monitor progress in real time, pause jobs mid-processing, resume them later, retry failed or completed jobs, edit payloads to fix errors, or delete jobs you no longer need. Only indexes with active or recent jobs appear — no clutter.

How to Access

Log in to your Opensolr account at opensolr.com

Click Account in the top navigation bar

Select Data Ingestion from the dropdown

Or navigate directly to /admin/solr_manager/my_ingestion_queue

What You See

Grouped by Index

Jobs are grouped by index name. Only indexes that have jobs in the queue are shown — if an index has no pending or recent jobs, it won’t appear.

Live Progress

Each job shows a progress bar with the count of processed, successful, and failed documents. The page auto-refreshes every 10 seconds while jobs are processing.

Pause & Resume

Pause a job mid-processing. It remembers where it left off and resumes from the exact same document when you continue.

Retry Jobs

Re-run completed, failed, or stopped jobs from scratch. Retry resets progress to zero and re-processes every document in the payload.

Edit Payload

Fix errors directly in the queue. Click any failed, stopped, or paused job to open its detail view, edit the JSON payload in place, save it, then retry. No need to re-submit via the API.

Error Details

Failed jobs show the error message. Completed jobs with partial failures show per-document success and error counts.

Queue Interface

▶ my_products_index

1 processing 1 failed 2 completed

Job IDStatusProgressDocsActions

a1b2c3d4... processing

60% 30/50 Pause

f9e8d7c6... failed

40% 20/50 Retry Delete

e5f6a7b8... completed

100% 50/50 Retry Delete

Job States

Pending

Waiting in queue

Processing

Enriching & indexing

Completed

All docs processed

Paused

User paused

Stopped

User cancelled

Failed

Error occurred

Available Actions

Action	Available When	What It Does
Pause	Pending, Processing	Pauses the job. Processing stops at the current document. Progress is preserved.
Resume	Paused, Stopped	Re-queues the job. Processing picks up from where it left off.
Stop / Cancel	Pending, Processing	Stops the job permanently. Documents already indexed remain in the index.
Retry	Completed, Failed, Stopped	Resets the job back to Pending and clears all progress counters. The entire payload is re-processed from scratch. Useful after fixing errors in the payload or when you want to re-index all documents.
Edit Payload	Failed, Stopped, Paused	Opens the job detail view where you can directly edit the JSON payload in a full-size editor. Fix field names, correct values, add or remove documents — then save and retry.
Delete	Any state	Removes the job from the queue. Does not remove already-indexed documents from Solr.

Edit Payload & Retry Workflow

When an ingestion job fails — bad field names, malformed data, missing required fields — you don’t have to re-submit the entire request through the API. You can fix the problem directly in the queue:

Fix & Retry in 3 Steps

Click the failed
job row

→

Edit the JSON
payload & Save

→

Click Retry

→

Job re-processes
with fixed data

Open the job detail — click any row in the queue to open the detail modal. For failed, stopped, or paused jobs, the payload section shows an editable text area with a “— editable” label.

Edit the JSON payload — the full JSON array of documents is displayed in a monospace editor. Fix field names, correct values, remove bad documents, or add new ones. Click Save Payload when done. The system validates that the payload is valid JSON and automatically recalculates document IDs from their uri fields and updates the total document count.

Retry the job — click the Retry button. The job is reset to Pending, all progress counters go back to zero, and processing starts fresh with your corrected payload.

Note: Editing the payload is only available for jobs in the Failed, Stopped, or Paused states. Jobs that are currently processing or pending cannot be edited — pause or stop them first. Retry is available for Completed, Failed, and Stopped jobs.

Automatic cleanup. Completed, failed, and stopped jobs are automatically removed from the queue after 7 days. You can delete them manually at any time.

API Queue Management

You can also manage the queue programmatically:

Endpoint	Purpose
`GET /api/ingest_status?job_id=...`	Check progress of a specific job
`GET /api/ingest_queue?core_name=...`	List all jobs for a core
`POST /api/ingest_queue_action`	Send `job_id` + `queue_action`. Available actions: `pause`, `resume`, `stop`, `delete`, `retry`, `save_payload`

For the save_payload action, include a payload parameter with the full JSON array of documents. The system validates it, regenerates document IDs from uri fields, and updates the total document count automatically.

All API endpoints require email and api_key authentication. You can only see and manage your own jobs.

Ready to push documents into your index? Check out the full API reference.

API Documentation

Read Full Answer

Document Extraction (rtf:true) — Index PDFs, Word, Excel, ...

API Feature

Document Extraction (rtf:true)

Index PDFs, Word documents, spreadsheets, presentations, and other rich document formats through the Data Ingestion API. Just add rtf:true to any document in your payload and point uri at the file. Opensolr fetches it, extracts the text, and indexes it with all the same enrichment as any other document.

How It Works

Your App

POST with
rtf:true
+ document URI

→

Validate

Check URI, MIME type, file size

→

Fetch

Download file from URI

→

Extract

Detect format, extract plaintext

→

Enrich

Embeddings, sentiment, language, derived fields

→

Index

Searchable in Solr

The extracted text fills the text field automatically. You provide the title, description, and any other metadata you have. The full enrichment pipeline (embeddings, sentiment, language detection, derived fields) runs on the extracted text just like any other document.

Supported Formats

PDF

.pdf

Word

.docx / .doc

Excel

.xlsx / .xls

PowerPoint

.pptx

OpenDocument

.odt / .ods / .odp

Plain Text

.txt

Example Payload

// Mix regular docs and RTF docs in the same batch:
{
  "email": "you@example.com",
  "api_key": "your_api_key",
  "core_name": "my_index",
  "documents": [
    {
      // Regular document — you provide the text
      "title": "Product Announcement",
      "description": "New features for Q1 2026",
      "text": "We are excited to announce...",
      "uri": "https://example.com/blog/announcement"
    },
    {
      // RTF document — text extracted from the PDF automatically
      "rtf": true,
      "title": "2025 Annual Report",
      "description": "Company financials and key metrics",
      "uri": "https://example.com/docs/annual-report-2025.pdf",
      "timestamp": 1735689600,
      "category": "Reports"
    },
    {
      // RTF document — Word file from an internal server
      "rtf": true,
      "title": "Employee Handbook",
      "description": "Company policies and procedures",
      "uri": "https://intranet.example.com/hr/handbook.docx",
      "og_image": "https://example.com/img/handbook-cover.png"
    }
  ]
}

What Happens

The rtf:true flag is detected on the document
The file at uri is fetched over HTTP/HTTPS
The file type is detected from the actual content (not just the extension)
Text is extracted using format-specific parsers (pdfminer for PDF, python-docx for Word, openpyxl for Excel, etc.)
The extracted text populates the text field
All other fields you provided (title, description, timestamp, etc.) are kept as-is
The full enrichment pipeline runs: embeddings, sentiment, language detection, derived fields
The document is pushed to your Solr index

Mix and match. You can combine regular documents and rtf:true documents in the same batch. Regular docs use the text you provide. RTF docs extract text from the file. Both go through the same enrichment pipeline.

Requirements for RTF Documents

rtf must be set to true (boolean, not a string)
uri must be a valid http:// or https:// URL pointing to the document file
The file must be publicly accessible (or accessible from the Opensolr server)
Maximum file size: 50 MB
title and description are recommended but the text field will be extracted from the document, so at minimum you need rtf:true and uri

Security. Only supported document formats are processed. Executable files, scripts, and unknown file types are rejected. The file type is verified from the actual file content, not from the URL extension.

If Extraction Fails

If the file cannot be fetched or the text cannot be extracted (corrupted file, unsupported format, empty document), that specific document is marked as failed in the job results with a descriptive error message. Other documents in the same batch continue processing normally.

Check the job status via the API or the Ingestion Queue page to see per-document results.

Need to index a large document library? Combine rtf:true with batch uploads for maximum efficiency.

Full API Docs

Read Full Answer

API-Data Ingestion

Data Ingestion API

How It Works

Your App

API Gateway

Job Queue

Enrichment

Solr Index

Endpoint Reference

Submit Documents

Check Job Status

Document Fields

Auto-Generated Fields

tags & title_tags

embeddings

sentiment

language

spell

phonetic_*

Using It with the Web Crawler

Update Existing Documents

Submission Methods

Method 1: cURL — JSON Body

Method 2: cURL — JSON File Upload

PHP Example

Python Example

Limits & Quotas

Error Responses

Related Documentation

Ingestion Queue Management

How to Access

What You See

Grouped by Index

Live Progress

Pause & Resume

Retry Jobs

Edit Payload

Error Details

Queue Interface

Job States

Pending

Processing

Completed

Paused

Stopped

Failed

Available Actions

Edit Payload & Retry Workflow

Fix & Retry in 3 Steps

API Queue Management

Document Extraction (rtf:true)

How It Works

Your App

Validate

Fetch

Extract

Enrich

Index

Supported Formats

Example Payload

What Happens

Requirements for RTF Documents

If Extraction Fails