Data Ingestion β€” Push Content Directly to Search

Push content directly β€” no crawler needed

Indexing Method #2

Data Ingestion API

Push your Drupal content directly to the search index β€” no crawler needed, no firewall whitelisting, no IP restrictions. Documents are enriched with AI embeddings, sentiment analysis, and language detection automatically on the Opensolr server.

How It Works

Node Saved
Create / Edit / Delete
β†’
API Push
JSON doc β†’ Opensolr
β†’
Enrichment
Embeddings + Sentiment
β†’
Searchable
Within ~1 minute

Two Indexing Methods β€” Use Both

Web Crawler

Fetches pages from your sitemap, extracts content from HTML. Runs on a schedule every few minutes. Best for: comprehensive indexing, discovering attached files, periodic safety net.

Data Ingestion

Pushes structured data directly from Drupal. Instant on node save. No firewall needed. Best for: real-time updates, private content, precise field mapping, no server load.

Identical results. Both methods produce the same Solr documents with the same field structure. The same id = md5(uri) deduplication ensures they never conflict. Use both: ingestion for instant updates, crawler as the periodic sweep.

Real-Time Sync

When Enable real-time sync is checked (on by default), every node create/update/delete automatically pushes to the search index:

βœ“ Publish / Create
Pushes doc to ingest API β†’ appears in search within ~1 minute
βœ“ Edit / Update
Pushes updated doc β†’ overwrites previous version in index
βœ“ Unpublish / Delete
Removes doc from index β†’ disappears from search immediately

Works for nodes and Commerce products. On multilingual sites, each translation is a separate document with its own locale.

Bulk Ingestion β€” "Ingest All Now"

For the initial load or a full re-index, click Ingest All Now on the Data Ingestion tab. This creates an async job that Drupal cron processes in batches of 50 documents per run:

  1. Click Ingest All Now β€” counts all published content of your selected types
  2. Creates a job in the Drupal database β€” shows a progress bar
  3. Drupal cron picks up the job every minute β€” builds 50 docs, POSTs to the API
  4. Progress bar updates automatically (polls every 5 seconds)
  5. You can navigate away β€” processing continues in the background
  6. Monitor on Opensolr: Manage Ingestion Queue β†’ link opens the queue dashboard
Large sites (200K+ pages): The async cron approach handles any site size. Each cron run processes 50 docs β€” no PHP timeout, no memory issues. A 200K site processes in ~67 hours of cron time (at 1 batch/minute).

Field Mapping β€” Automatic

The module builds each document using the exact same fields the Web Crawler extracts from your pages. No manual mapping needed:

Solr Field Drupal Source
uriNode canonical URL (language-prefixed on multilingual sites)
titleNode title
descriptionBody summary β†’ trimmed body (first 320 chars)
textFull body field (HTML stripped)
categoryContent type label (e.g., "Blog Post", "Product")
authorNode author display name
og_imageFirst image field URL
meta_og_localeCurrent language as og:locale (e.g., en_gb, de_de)
meta_detected_languageBase language code (e.g., en, de, fi)
price_f / currency_sCommerce product price + currency
meta_* custom fieldsFrom your Facet Mapping configuration

The API enrichment pipeline automatically adds: embeddings (1024-dim BGE-m3), sentiment scores, language detection, spell fields, and autocomplete tags.

Attached Files (PDFs, Documents)

When Include attached files is checked in Settings, the module also sends file attachments with rtf:true. The Opensolr API fetches the file, extracts plain text (PDF, DOCX, XLSX, PPTX, ODT), and indexes it alongside your pages.

Manage Your Ingestion Queue
Monitor job progress, retry failed batches, pause/resume/cancel jobs.
Open Queue Dashboard β†’