Document Extraction (rtf:true) — Index PDFs, Word, Excel, and Other Files via the Ingestion API

Documentation > API-Data Ingestion > Document Extraction (rtf:true) — Index PDFs, Word, Excel, and Other Files via the Ingestion API
API Feature

Document Extraction (rtf:true)

Index PDFs, Word documents, spreadsheets, presentations, and other rich document formats through the Data Ingestion API. Just add rtf:true to any document in your payload and point uri at the file. Opensolr fetches it, extracts the text, and indexes it with all the same enrichment as any other document.

How It Works

Your App

POST with
rtf:true
+ document URI

Validate

Check URI, MIME type, file size

Fetch

Download file from URI

Extract

Detect format, extract plaintext

Enrich

Embeddings, sentiment, language, derived fields

Index

Searchable in Solr

The extracted text fills the text field automatically. You provide the title, description, and any other metadata you have. The full enrichment pipeline (embeddings, sentiment, language detection, derived fields) runs on the extracted text just like any other document.

Supported Formats

PDF
.pdf
Word
.docx / .doc
Excel
.xlsx / .xls
PowerPoint
.pptx
OpenDocument
.odt / .ods / .odp
Plain Text
.txt

Example Payload

// Mix regular docs and RTF docs in the same batch:
{
  "email": "you@example.com",
  "api_key": "your_api_key",
  "core_name": "my_index",
  "documents": [
    {
      // Regular document — you provide the text
      "title": "Product Announcement",
      "description": "New features for Q1 2026",
      "text": "We are excited to announce...",
      "uri": "https://example.com/blog/announcement"
    },
    {
      // RTF document — text extracted from the PDF automatically
      "rtf": true,
      "title": "2025 Annual Report",
      "description": "Company financials and key metrics",
      "uri": "https://example.com/docs/annual-report-2025.pdf",
      "timestamp": 1735689600,
      "category": "Reports"
    },
    {
      // RTF document — Word file from an internal server
      "rtf": true,
      "title": "Employee Handbook",
      "description": "Company policies and procedures",
      "uri": "https://intranet.example.com/hr/handbook.docx",
      "og_image": "https://example.com/img/handbook-cover.png"
    }
  ]
}

What Happens

  1. The rtf:true flag is detected on the document
  2. The file at uri is fetched over HTTP/HTTPS
  3. The file type is detected from the actual content (not just the extension)
  4. Text is extracted using format-specific parsers (pdfminer for PDF, python-docx for Word, openpyxl for Excel, etc.)
  5. The extracted text populates the text field
  6. All other fields you provided (title, description, timestamp, etc.) are kept as-is
  7. The full enrichment pipeline runs: embeddings, sentiment, language detection, derived fields
  8. The document is pushed to your Solr index
Mix and match. You can combine regular documents and rtf:true documents in the same batch. Regular docs use the text you provide. RTF docs extract text from the file. Both go through the same enrichment pipeline.

Requirements for RTF Documents

  • rtf must be set to true (boolean, not a string)
  • uri must be a valid http:// or https:// URL pointing to the document file
  • The file must be publicly accessible (or accessible from the Opensolr server)
  • Maximum file size: 50 MB
  • title and description are recommended but the text field will be extracted from the document, so at minimum you need rtf:true and uri
Security. Only supported document formats are processed. Executable files, scripts, and unknown file types are rejected. The file type is verified from the actual file content, not from the URL extension.

If Extraction Fails

If the file cannot be fetched or the text cannot be extracted (corrupted file, unsupported format, empty document), that specific document is marked as failed in the job results with a descriptive error message. Other documents in the same batch continue processing normally.

Check the job status via the API or the Ingestion Queue page to see per-document results.

Need to index a large document library? Combine rtf:true with batch uploads for maximum efficiency.

Full API Docs