Document Extraction (rtf:true) — Index PDFs, Word, Excel, and Other Files via the Ingestion API
Document Extraction (rtf:true)
Index PDFs, Word documents, spreadsheets, presentations, and other rich document formats through the Data Ingestion API. Just add rtf:true to any document in your payload and point uri at the file. Opensolr fetches it, extracts the text, and indexes it with all the same enrichment as any other document.
How It Works
Your App
POST withrtf:true
+ document URI
Validate
Check URI, MIME type, file size
Fetch
Download file from URI
Extract
Detect format, extract plaintext
Enrich
Embeddings, sentiment, language, derived fields
Index
Searchable in Solr
The extracted text fills the text field automatically. You provide the title, description, and any other metadata you have. The full enrichment pipeline (embeddings, sentiment, language detection, derived fields) runs on the extracted text just like any other document.
Supported Formats
Example Payload
// Mix regular docs and RTF docs in the same batch:
{
"email": "you@example.com",
"api_key": "your_api_key",
"core_name": "my_index",
"documents": [
{
// Regular document — you provide the text
"title": "Product Announcement",
"description": "New features for Q1 2026",
"text": "We are excited to announce...",
"uri": "https://example.com/blog/announcement"
},
{
// RTF document — text extracted from the PDF automatically
"rtf": true,
"title": "2025 Annual Report",
"description": "Company financials and key metrics",
"uri": "https://example.com/docs/annual-report-2025.pdf",
"timestamp": 1735689600,
"category": "Reports"
},
{
// RTF document — Word file from an internal server
"rtf": true,
"title": "Employee Handbook",
"description": "Company policies and procedures",
"uri": "https://intranet.example.com/hr/handbook.docx",
"og_image": "https://example.com/img/handbook-cover.png"
}
]
}
What Happens
- The
rtf:trueflag is detected on the document - The file at
uriis fetched over HTTP/HTTPS - The file type is detected from the actual content (not just the extension)
- Text is extracted using format-specific parsers (pdfminer for PDF, python-docx for Word, openpyxl for Excel, etc.)
- The extracted text populates the
textfield - All other fields you provided (
title,description,timestamp, etc.) are kept as-is - The full enrichment pipeline runs: embeddings, sentiment, language detection, derived fields
- The document is pushed to your Solr index
rtf:true documents in the same batch. Regular docs use the text you provide. RTF docs extract text from the file. Both go through the same enrichment pipeline.
Requirements for RTF Documents
rtfmust be set totrue(boolean, not a string)urimust be a validhttp://orhttps://URL pointing to the document file- The file must be publicly accessible (or accessible from the Opensolr server)
- Maximum file size: 50 MB
titleanddescriptionare recommended but the text field will be extracted from the document, so at minimum you needrtf:trueanduri
If Extraction Fails
If the file cannot be fetched or the text cannot be extracted (corrupted file, unsupported format, empty document), that specific document is marked as failed in the job results with a descriptive error message. Other documents in the same batch continue processing normally.
Check the job status via the API or the Ingestion Queue page to see per-document results.
Need to index a large document library? Combine rtf:true with batch uploads for maximum efficiency.
Full API Docs