Document Extraction (rtf:true) — Index PDFs, Word, Excel, and Other Files via the Ingestion API

API Feature

Document Extraction (rtf:true)

Index PDFs, Word documents, spreadsheets, presentations, and other rich document formats through the Data Ingestion API. Just add rtf:true to any document in your payload and point uri at the file. Opensolr fetches it, extracts the text, and indexes it with all the same enrichment as any other document.

How It Works

Your App

POST with
rtf:true
+ document URI

→

Validate

Check URI, MIME type, file size

→

Fetch

Download file from URI

→

Extract

Detect format, extract plaintext

→

Enrich

Embeddings, sentiment, language, derived fields

→

Index

Searchable in Solr

The extracted text fills the text field automatically. You provide the title, description, and any other metadata you have. The full enrichment pipeline (embeddings, sentiment, language detection, derived fields) runs on the extracted text just like any other document.

Supported Formats

PDF

.pdf

Word

.docx / .doc

Excel

.xlsx / .xls

PowerPoint

.pptx

OpenDocument

.odt / .ods / .odp

Plain Text

.txt

Example Payload

// Mix regular docs and RTF docs in the same batch:
{
  "email": "you@example.com",
  "api_key": "your_api_key",
  "core_name": "my_index",
  "documents": [
    {
      // Regular document — you provide the text
      "title": "Product Announcement",
      "description": "New features for Q1 2026",
      "text": "We are excited to announce...",
      "uri": "https://example.com/blog/announcement"
    },
    {
      // RTF document — text extracted from the PDF automatically
      "rtf": true,
      "title": "2025 Annual Report",
      "description": "Company financials and key metrics",
      "uri": "https://example.com/docs/annual-report-2025.pdf",
      "timestamp": 1735689600,
      "category": "Reports"
    },
    {
      // RTF document — Word file from an internal server
      "rtf": true,
      "title": "Employee Handbook",
      "description": "Company policies and procedures",
      "uri": "https://intranet.example.com/hr/handbook.docx",
      "og_image": "https://example.com/img/handbook-cover.png"
    }
  ]
}

What Happens

The rtf:true flag is detected on the document
The file at uri is fetched over HTTP/HTTPS
The file type is detected from the actual content (not just the extension)
Text is extracted using format-specific parsers (pdfminer for PDF, python-docx for Word, openpyxl for Excel, etc.)
The extracted text populates the text field
All other fields you provided (title, description, timestamp, etc.) are kept as-is
The full enrichment pipeline runs: embeddings, sentiment, language detection, derived fields
The document is pushed to your Solr index

Mix and match. You can combine regular documents and rtf:true documents in the same batch. Regular docs use the text you provide. RTF docs extract text from the file. Both go through the same enrichment pipeline.

Requirements for RTF Documents

rtf must be set to true (boolean, not a string)
uri must be a valid http:// or https:// URL pointing to the document file
The file must be publicly accessible (or accessible from the Opensolr server)
Maximum file size: 50 MB
title and description are recommended but the text field will be extracted from the document, so at minimum you need rtf:true and uri

Security. Only supported document formats are processed. Executable files, scripts, and unknown file types are rejected. The file type is verified from the actual file content, not from the URL extension.

If Extraction Fails

If the file cannot be fetched or the text cannot be extracted (corrupted file, unsupported format, empty document), that specific document is marked as failed in the job results with a descriptive error message. Other documents in the same batch continue processing normally.

Check the job status via the API or the Ingestion Queue page to see per-document results.

Need to index a large document library? Combine rtf:true with batch uploads for maximum efficiency.

Full API Docs