API-Data Ingestion

Find answers to your questions quickly and easily

Data Ingestion API — Push Documents to Your Opensolr Index...

API Endpoint

Data Ingestion API

Push documents directly into any Opensolr Web Crawler index from your CMS, application, or data pipeline. Available exclusively for indexes on Web Crawler servers. Each document is automatically enriched with vector embeddings, sentiment analysis, language detection, and all the derived search fields — identical to what the Web Crawler produces. Works alongside the crawler or as a standalone ingestion method.

Web Crawler indexes only. The Data Ingestion API is available exclusively for indexes created on Web Crawler servers. If your index is not on a Web Crawler server, you will receive an ERROR_CORE_NOT_WEBCRAWLER_ENABLED error. To create a Web Crawler index, see Getting Started.
IP Restriction: If your index has IP restrictions enabled (under Security in the Index Control Panel), you must add the following IPs to the allowed list for the /update request handler — otherwise ingestion writes will be blocked with a 403 error:
  • 5.161.242.87 — api.opensolr.com (processes and writes documents)
  • 148.251.180.234 — opensolr.com (proxies requests)
Required Config Set: Data Ingestion requires the Web Crawler config set on your index. For Solr 9, download and apply the mandatory config set for Solr 9 (schema.xml, solrconfig.xml, search.xml + supporting text files). Upload it via Config in your Index Control Panel. A Solr 10 config set will be provided when available.

How It Works

Your App

POST JSON docs

API Gateway

Auth, rate limit, quota check, validate

Job Queue

Queued for async processing

Enrichment

Embeddings, sentiment, language, derived fields

Solr Index

Documents searchable

You submit a batch of up to 50 documents per request. The API validates your payload, checks your disk quota, and queues the job. A background processor then enriches each document and pushes it into your Solr index. You get a job_id immediately so you can poll for progress.

Endpoint Reference

Submit Documents

POST https://api.opensolr.com/solr_manager/api/ingest

// Parameters (form-encoded or JSON body):
email       = your@email.com
api_key     = your_api_key
core_name   = your_index_name
documents   = [JSON array of document objects]

Response:

{
  "status": true,
  "msg": "QUEUED",
  "job_id": "a1b2c3d4e5f6...",
  "total_docs": 25,
  "doc_ids": ["e8392e28...", "a3f1c9d0..."]
}

Check Job Status

GET https://api.opensolr.com/solr_manager/api/ingest_status

email       = your@email.com
api_key     = your_api_key
job_id      = a1b2c3d4e5f6...

Response:

{
  "status": true,
  "job": {
    "state_label": "completed",
    "total_docs": 25,
    "processed_docs": 25,
    "success_docs": 24,
    "failed_docs": 1,
    "result": [...per-doc details...]
  }
}

Document Fields

FieldTypeStatusDescription
uriURLrequiredCanonical URL that uniquely identifies this document. Must be a valid http or https URL. The document id is always generated as md5(uri). Submitting a document with the same URI will update the existing document. Also used as the deduplication key — duplicate URIs in pending jobs are rejected.
titlestringrequiredDocument title. Always required.
descriptionstringrequiredShort description or summary.
textstringrequired*The main body text. Do not send HTML. Required for standard documents. When using rtf:true, this field is optional and will be auto-populated from the extracted document content.
idstringauto-generatedAlways generated as md5(uri). You do not set this — it is derived from the uri you provide. The same URI always produces the same ID, which is how deduplication and updates work. Returned in the API response as doc_ids.
timestampint or date stringrecommendedContent publication date. Unix epoch (1709913600) or parseable date string (2024-03-08 14:00:00). Used to derive creation_date and meta_creation_date for freshness boost in search.
og_imageURLoptionalThumbnail image URL. Shown in search results.
meta_iconURLoptionalFavicon URL for your site.
meta_og_localestringoptionalLocale code (e.g. en_US, de_DE). Used in search UI locale filtering.
meta_detected_languagestringoptionalLanguage code (e.g. en, fr). If not provided, auto-detected from title + description.
categorystringoptionalContent category for faceted filtering.
content_typestringoptionalMIME type of the document (e.g. text/html, application/pdf). Defaults to text/html if not provided. This field controls how the document appears in the Search UI — text/html shows as a web result, while other types (PDF, DOCX, etc.) show in the Media/Docs tab. When using rtf:true, the MIME type is auto-detected from the actual file content, so you don't need to set it manually.
rtfbooleanoptionalSet to true to enable plain text extraction from a remote document. When enabled, the system fetches the file at uri and extracts its plain text content, which is then used to populate only the text field. All other fields (title, description, etc.) must still be provided by you. Supported formats: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP, RTF, CSV, and plain text files. When rtf:true is set, the text field becomes optional (it will be filled automatically from the extracted content). If the extraction fails, the job will report a detailed error.
meta_domainstringoptionalSource domain. Auto-derived from uri if not provided.
price_ffloatoptionalProduct price as a decimal number (e.g. 29.99). For e-commerce indexes only. Enables price filtering, sorting, and range facets in search results. If not provided, the document is simply excluded from price filters.
currency_sstringoptionalCurrency code (e.g. USD, EUR). Used alongside price_f for price display in search results. Only needed if price_f is set.

* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).

Auto-Generated Fields

The following fields are generated automatically for every document — you never need to send them:

tags & title_tags

Edge n-gram fields for autocomplete and fuzzy matching. Built from title + description + text.

embeddings

1024-dimensional BGE-m3 vector embeddings of title + description for semantic / hybrid search.

sentiment

VADER sentiment scores: positive, negative, neutral, and compound. Computed from title + description.

language

Auto-detected from title + description using langid if you don't provide meta_detected_language.

spell

Spellcheck field for "did you mean?" suggestions. Built from tags.

phonetic_*

Phonetic title, description, and text for sounds-like matching across languages.

Using It with the Web Crawler

Works in tandem. The Data Ingestion API and the Web Crawler share the same index and the same document schema. You can use the crawler to index your public website, and the API to push content that the crawler can't reach — like gated pages, internal databases, CMS drafts, or product feeds. Documents from both sources coexist seamlessly in the same index.

Update Existing Documents

Sending a document with the same uri as one already in the index will completely replace it, since the ID is always md5(uri). This is how you keep your index in sync with your CMS.

⚠ Important: Updates are full document replacements, not partial. When you send a document with the same uri as an existing one, the entire previous document is deleted and replaced with the new one. Any fields you do not include in the updated document will be lost. Always send the complete document with all fields, even if you only need to change one or two of them.
// The crawler indexed https://example.com/products/widget-a
// Submit the same uri via API to update it (id = md5(uri) is generated automatically):
{
  "title": "Widget A — Updated Price",
  "description": "Now only $29.99",
  "text": "Full updated product description...",
  "uri": "https://example.com/products/widget-a",
  "content_type": "text/html",
  "timestamp": 1741392000
}

The document ID is always generated as md5(uri) — which is the same ID the Web Crawler generates. So submitting a document with the same uri that the crawler indexed will naturally update it. You never need to know or manage document IDs manually.

Submission Methods

You can submit documents in two ways: as a JSON body in the POST request, or as a JSON file upload. Both methods support the same payload format. File upload is ideal for large batches or when integrating from systems that generate JSON export files.

Method 1: cURL — JSON Body

curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"email":"you@example.com","api_key":"YOUR_API_KEY","core_name":"my_index","documents":[{"uri":"https://example.com/page-1","title":"Page One","description":"First page description","text":"Full text content of page one.","content_type":"text/html"},{"uri":"https://example.com/page-2","title":"Page Two","description":"Second page description","text":"Full text content of page two.","content_type":"text/html","category":"Docs","timestamp":1741392000}]}'

Method 2: cURL — JSON File Upload

Save your payload as a .json file and upload it via the payload_file form field. The file can contain the full payload (with email, api_key, core_name, and documents), or just the documents array (with auth fields as separate form fields).

# Full payload in the file (recommended for large batches):
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "payload_file=@my_documents.json"

# Or auth as form fields, documents in the file:
curl -X POST https://api.opensolr.com/solr_manager/api/ingest \
  -F "email=you@example.com" \
  -F "api_key=YOUR_API_KEY" \
  -F "core_name=my_index" \
  -F "payload_file=@my_documents.json"

PHP Example

// Method 1: JSON body
$payload = [
    'email'     => 'you@example.com',
    'api_key'   => 'YOUR_API_KEY',
    'core_name' => 'my_index',
    'documents' => [
        [
            'uri'         => 'https://example.com/page-1',
            'title'       => 'Page One',
            'description' => 'First page description',
            'text'        => 'Full text content of page one.',
            'content_type'=> 'text/html',
        ],
        [
            'uri'         => 'https://example.com/page-2',
            'title'       => 'Page Two',
            'description' => 'Second page description',
            'text'        => 'Full text content of page two.',
            'content_type'=> 'text/html',
            'category'    => 'Docs',
            'timestamp'   => 1741392000,
        ],
    ],
];

$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
    CURLOPT_POSTFIELDS     => json_encode($payload),
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

if ($response['status']) {
    echo "Queued! Job ID: " . $response['job_id'] . "\n";
    echo "Doc IDs: " . implode(', ', $response['doc_ids']) . "\n";
} else {
    echo "Error: " . $response['msg'] . "\n";
    if (!empty($response['errors'])) {
        foreach ($response['errors'] as $err) echo "  - $err\n";
    }
}

// Method 2: File upload
$ch = curl_init('https://api.opensolr.com/solr_manager/api/ingest');
curl_setopt_array($ch, [
    CURLOPT_POST           => true,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POSTFIELDS     => [
        'payload_file' => new CURLFile('/path/to/my_documents.json', 'application/json'),
    ],
]);
$response = json_decode(curl_exec($ch), true);
curl_close($ch);

Python Example

import requests, json

# Method 1: JSON body
payload = {
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "core_name": "my_index",
    "documents": [
        {
            "uri": "https://example.com/page-1",
            "title": "Page One",
            "description": "First page description",
            "text": "Full text content of page one.",
            "content_type": "text/html",
        },
        {
            "uri": "https://example.com/page-2",
            "title": "Page Two",
            "description": "Second page description",
            "text": "Full text content of page two.",
            "content_type": "text/html",
            "category": "Docs",
            "timestamp": 1741392000,
        },
    ],
}

resp = requests.post(
    "https://api.opensolr.com/solr_manager/api/ingest",
    json=payload,
)
data = resp.json()

if data["status"]:
    print(f"Queued! Job ID: {data['job_id']}")
    print(f"Doc IDs: {data['doc_ids']}")
else:
    print(f"Error: {data['msg']}")
    for err in data.get("errors", []):
        print(f"  - {err}")

# Method 2: File upload
with open("my_documents.json", "rb") as f:
    resp = requests.post(
        "https://api.opensolr.com/solr_manager/api/ingest",
        files={"payload_file": ("payload.json", f, "application/json")},
    )
    print(resp.json())

# Check job status
status = requests.get("https://api.opensolr.com/solr_manager/api/ingest_status", params={
    "email": "you@example.com",
    "api_key": "YOUR_API_KEY",
    "job_id": data["job_id"],
}).json()
print(f"State: {status['job']['state_label']}, Progress: {status['job']['processed_docs']}/{status['job']['total_docs']}")

Limits & Quotas

50

documents per request

30/min

API rate limit (all endpoints)

500/hr

API rate limit (all endpoints)

Your index disk quota is also enforced — if your index is at or near its size limit, the ingestion request will be rejected with a clear error showing your current usage vs. maximum allowed.

Error Responses

Error CodeMeaning
ERROR_AUTHENTICATION_FAILEDInvalid email or api_key.
ERROR_NOT_CORE_OWNERYou don't own this index.
ERROR_BATCH_LIMIT_50_DOCUMENTS_MAXReduce your batch to 50 or fewer documents.
ERROR_DISK_QUOTA_EXCEEDEDYour index is at its size limit. Upgrade your plan or delete old documents.
VALIDATION_ERRORSOne or more documents failed validation. Check the errors array in the response.
DUPLICATE_DOCUMENTSOne or more documents have a URI that is already queued for processing in this index. Wait for the pending job to complete or cancel it before resubmitting.
ERROR_PAYLOAD_TOO_LARGERequest body exceeds server limits.
ERROR_CORE_NOT_WEBCRAWLER_ENABLEDThis index is not on a Web Crawler server. Data Ingestion is only available for Web Crawler indexes. Create a Web Crawler index →

* text is required for standard documents but optional when rtf:true is set (text is extracted automatically from the document URL). The four absolutely mandatory fields are: uri, title, description, and text (unless rtf:true).

Plain text only. The text field should contain clean plain text, not HTML. If you are exporting from a CMS like Drupal or WordPress, strip all HTML tags before sending. The title and description should also be plain text.

Related Documentation

Web Crawler Overview

Complete guide to the Opensolr Web Crawler: crawl modes, features, live demos, analytics, and more.

Getting Started

Step-by-step setup: create an account, pick a Web Crawler server, start crawling your site.

Index Field Reference

Complete reference for every field in a Web Crawler index — what each field stores and how it is used in search.

Crawler Control API

Start, stop, pause, resume the crawler, check stats, and flush the crawl buffer — all via REST API.

Querying the Solr API

Search parameters, filtering, facets, pagination, and sorting — everything you need to query your index.

Manage Your Ingestion Queue

View all your ingestion jobs, monitor progress, pause, resume, retry, or delete queued jobs. Click any Job ID to see full details including payload and errors. Use Run Now per index or per job to trigger immediate processing without waiting for the cron cycle.

Need higher rate limits or larger batch sizes for your integration? We can set custom thresholds.

Contact Us
Read Full Answer

Data Ingestion Queue — Monitor, Pause, Resume, Retry and M...

Account Feature

Ingestion Queue Management

Every document you submit through the Data Ingestion API goes into a processing queue. The Ingestion Queue page in your account gives you full control: monitor progress in real time, pause jobs mid-processing, resume them later, retry failed or completed jobs, edit payloads to fix errors, or delete jobs you no longer need. Only indexes with active or recent jobs appear — no clutter.

How to Access

Log in to your Opensolr account at opensolr.com
Click Account in the top navigation bar
Select Data Ingestion from the dropdown

Or navigate directly to /admin/solr_manager/my_ingestion_queue

What You See

Grouped by Index

Jobs are grouped by index name. Only indexes that have jobs in the queue are shown — if an index has no pending or recent jobs, it won’t appear.

Live Progress

Each job shows a progress bar with the count of processed, successful, and failed documents. The page auto-refreshes every 10 seconds while jobs are processing.

Pause & Resume

Pause a job mid-processing. It remembers where it left off and resumes from the exact same document when you continue.

Retry Jobs

Re-run completed, failed, or stopped jobs from scratch. Retry resets progress to zero and re-processes every document in the payload.

Edit Payload

Fix errors directly in the queue. Click any failed, stopped, or paused job to open its detail view, edit the JSON payload in place, save it, then retry. No need to re-submit via the API.

Error Details

Failed jobs show the error message. Completed jobs with partial failures show per-document success and error counts.

Queue Interface

my_products_index
1 processing 1 failed 2 completed
Job IDStatusProgressDocsActions
a1b2c3d4... processing
60%
30/50 Pause
f9e8d7c6... failed
40%
20/50 Retry Delete
e5f6a7b8... completed
100%
50/50 Retry Delete

Job States

Pending

Waiting in queue

Processing

Enriching & indexing

Completed

All docs processed

Paused

User paused

Stopped

User cancelled

Failed

Error occurred

Available Actions

ActionAvailable WhenWhat It Does
PausePending, ProcessingPauses the job. Processing stops at the current document. Progress is preserved.
ResumePaused, StoppedRe-queues the job. Processing picks up from where it left off.
Stop / CancelPending, ProcessingStops the job permanently. Documents already indexed remain in the index.
RetryCompleted, Failed, StoppedResets the job back to Pending and clears all progress counters. The entire payload is re-processed from scratch. Useful after fixing errors in the payload or when you want to re-index all documents.
Edit PayloadFailed, Stopped, PausedOpens the job detail view where you can directly edit the JSON payload in a full-size editor. Fix field names, correct values, add or remove documents — then save and retry.
DeleteAny stateRemoves the job from the queue. Does not remove already-indexed documents from Solr.

Edit Payload & Retry Workflow

When an ingestion job fails — bad field names, malformed data, missing required fields — you don’t have to re-submit the entire request through the API. You can fix the problem directly in the queue:

Fix & Retry in 3 Steps

Click the failed
job row
Edit the JSON
payload & Save
Click Retry
Job re-processes
with fixed data
Open the job detail — click any row in the queue to open the detail modal. For failed, stopped, or paused jobs, the payload section shows an editable text area with a “— editable” label.
Edit the JSON payload — the full JSON array of documents is displayed in a monospace editor. Fix field names, correct values, remove bad documents, or add new ones. Click Save Payload when done. The system validates that the payload is valid JSON and automatically recalculates document IDs from their uri fields and updates the total document count.
Retry the job — click the Retry button. The job is reset to Pending, all progress counters go back to zero, and processing starts fresh with your corrected payload.
Note: Editing the payload is only available for jobs in the Failed, Stopped, or Paused states. Jobs that are currently processing or pending cannot be edited — pause or stop them first. Retry is available for Completed, Failed, and Stopped jobs.
Automatic cleanup. Completed, failed, and stopped jobs are automatically removed from the queue after 7 days. You can delete them manually at any time.

API Queue Management

You can also manage the queue programmatically:

EndpointPurpose
GET /api/ingest_status?job_id=...Check progress of a specific job
GET /api/ingest_queue?core_name=...List all jobs for a core
POST /api/ingest_queue_actionSend job_id + queue_action. Available actions: pause, resume, stop, delete, retry, save_payload

For the save_payload action, include a payload parameter with the full JSON array of documents. The system validates it, regenerates document IDs from uri fields, and updates the total document count automatically.

All API endpoints require email and api_key authentication. You can only see and manage your own jobs.

Ready to push documents into your index? Check out the full API reference.

API Documentation
Read Full Answer

Document Extraction (rtf:true) — Index PDFs, Word, Excel, ...

API Feature

Document Extraction (rtf:true)

Index PDFs, Word documents, spreadsheets, presentations, and other rich document formats through the Data Ingestion API. Just add rtf:true to any document in your payload and point uri at the file. Opensolr fetches it, extracts the text, and indexes it with all the same enrichment as any other document.

How It Works

Your App

POST with
rtf:true
+ document URI

Validate

Check URI, MIME type, file size

Fetch

Download file from URI

Extract

Detect format, extract plaintext

Enrich

Embeddings, sentiment, language, derived fields

Index

Searchable in Solr

The extracted text fills the text field automatically. You provide the title, description, and any other metadata you have. The full enrichment pipeline (embeddings, sentiment, language detection, derived fields) runs on the extracted text just like any other document.

Supported Formats

PDF
.pdf
Word
.docx / .doc
Excel
.xlsx / .xls
PowerPoint
.pptx
OpenDocument
.odt / .ods / .odp
Plain Text
.txt

Example Payload

// Mix regular docs and RTF docs in the same batch:
{
  "email": "you@example.com",
  "api_key": "your_api_key",
  "core_name": "my_index",
  "documents": [
    {
      // Regular document — you provide the text
      "title": "Product Announcement",
      "description": "New features for Q1 2026",
      "text": "We are excited to announce...",
      "uri": "https://example.com/blog/announcement"
    },
    {
      // RTF document — text extracted from the PDF automatically
      "rtf": true,
      "title": "2025 Annual Report",
      "description": "Company financials and key metrics",
      "uri": "https://example.com/docs/annual-report-2025.pdf",
      "timestamp": 1735689600,
      "category": "Reports"
    },
    {
      // RTF document — Word file from an internal server
      "rtf": true,
      "title": "Employee Handbook",
      "description": "Company policies and procedures",
      "uri": "https://intranet.example.com/hr/handbook.docx",
      "og_image": "https://example.com/img/handbook-cover.png"
    }
  ]
}

What Happens

  1. The rtf:true flag is detected on the document
  2. The file at uri is fetched over HTTP/HTTPS
  3. The file type is detected from the actual content (not just the extension)
  4. Text is extracted using format-specific parsers (pdfminer for PDF, python-docx for Word, openpyxl for Excel, etc.)
  5. The extracted text populates the text field
  6. All other fields you provided (title, description, timestamp, etc.) are kept as-is
  7. The full enrichment pipeline runs: embeddings, sentiment, language detection, derived fields
  8. The document is pushed to your Solr index
Mix and match. You can combine regular documents and rtf:true documents in the same batch. Regular docs use the text you provide. RTF docs extract text from the file. Both go through the same enrichment pipeline.

Requirements for RTF Documents

  • rtf must be set to true (boolean, not a string)
  • uri must be a valid http:// or https:// URL pointing to the document file
  • The file must be publicly accessible (or accessible from the Opensolr server)
  • Maximum file size: 50 MB
  • title and description are recommended but the text field will be extracted from the document, so at minimum you need rtf:true and uri
Security. Only supported document formats are processed. Executable files, scripts, and unknown file types are rejected. The file type is verified from the actual file content, not from the URL extension.

If Extraction Fails

If the file cannot be fetched or the text cannot be extracted (corrupted file, unsupported format, empty document), that specific document is marked as failed in the job results with a descriptive error message. Other documents in the same batch continue processing normally.

Check the job status via the API or the Ingestion Queue page to see per-document results.

Need to index a large document library? Combine rtf:true with batch uploads for maximum efficiency.

Full API Docs
Read Full Answer