Data Ingestion API
Push your Drupal content directly to the search index β no crawler needed, no firewall whitelisting, no IP restrictions. Documents are enriched with AI embeddings, sentiment analysis, and language detection automatically on the Opensolr server.
How It Works
Two Indexing Methods β Use Both
Web Crawler
Fetches pages from your sitemap, extracts content from HTML. Runs on a schedule every few minutes. Best for: comprehensive indexing, discovering attached files, periodic safety net.
Data Ingestion
Pushes structured data directly from Drupal. Instant on node save. No firewall needed. Best for: real-time updates, private content, precise field mapping, no server load.
id = md5(uri) deduplication ensures they never conflict. Use both: ingestion for instant updates, crawler as the periodic sweep.
Real-Time Sync
When Enable real-time sync is checked (on by default), every node create/update/delete automatically pushes to the search index:
Works for nodes and Commerce products. On multilingual sites, each translation is a separate document with its own locale.
Bulk Ingestion β "Ingest All Now"
For the initial load or a full re-index, click Ingest All Now on the Data Ingestion tab. This creates an async job that Drupal cron processes in batches of 50 documents per run:
- Click Ingest All Now β counts all published content of your selected types
- Creates a job in the Drupal database β shows a progress bar
- Drupal cron picks up the job every minute β builds 50 docs, POSTs to the API
- Progress bar updates automatically (polls every 5 seconds)
- You can navigate away β processing continues in the background
- Monitor on Opensolr: Manage Ingestion Queue β link opens the queue dashboard
Field Mapping β Automatic
The module builds each document using the exact same fields the Web Crawler extracts from your pages. No manual mapping needed:
| Solr Field | Drupal Source |
|---|---|
uri | Node canonical URL (language-prefixed on multilingual sites) |
title | Node title |
description | Body summary β trimmed body (first 320 chars) |
text | Full body field (HTML stripped) |
category | Content type label (e.g., "Blog Post", "Product") |
author | Node author display name |
og_image | First image field URL |
meta_og_locale | Current language as og:locale (e.g., en_gb, de_de) |
meta_detected_language | Base language code (e.g., en, de, fi) |
price_f / currency_s | Commerce product price + currency |
meta_* custom fields | From your Facet Mapping configuration |
The API enrichment pipeline automatically adds: embeddings (1024-dim BGE-m3), sentiment scores, language detection, spell fields, and autocomplete tags.
Attached Files (PDFs, Documents)
When Include attached files is checked in Settings, the module also sends file attachments with rtf:true. The Opensolr API fetches the file, extracts plain text (PDF, DOCX, XLSX, PPTX, ODT), and indexes it alongside your pages.