Data Ingestion

Indexing Method #2

Data Ingestion API

Push your Drupal content directly to the search index — no crawler needed, no firewall whitelisting, no IP restrictions. Documents are enriched with AI embeddings, sentiment analysis, and language detection automatically on the Opensolr server.

How It Works

Node Saved

Create / Edit / Delete

→

API Push

JSON doc → Opensolr

→

Enrichment

Embeddings + Sentiment

→

Searchable

Within ~1 minute

Two Indexing Methods — Use Both

Web Crawler

Fetches pages from your sitemap, extracts content from HTML. Runs on a schedule every few minutes. Best for: comprehensive indexing, discovering attached files, periodic safety net.

Pushes structured data directly from Drupal. Instant on node save. No firewall needed. Best for: real-time updates, private content, precise field mapping, no server load.

Identical results. Both methods produce the same Solr documents with the same field structure. The same id = md5(uri) deduplication ensures they never conflict. Use both: ingestion for instant updates, crawler as the periodic sweep.

Real-Time Sync

When Enable real-time sync is checked (on by default), every node create/update/delete automatically pushes to the search index:

✓ Publish / Create

Pushes doc to ingest API → appears in search within ~1 minute

✓ Edit / Update

Pushes updated doc → overwrites previous version in index

✓ Unpublish / Delete

Removes doc from index → disappears from search immediately

Works for nodes and Commerce products. On multilingual sites, each translation is a separate document with its own locale.

Bulk Ingestion — "Ingest All Now"

For the initial load or a full re-index, click Ingest All Now on the Data Ingestion tab. This creates an async job that Drupal cron processes in batches of 50 documents per run:

Click Ingest All Now — counts all published content of your selected types
Creates a job in the Drupal database — shows a progress bar
Drupal cron picks up the job every minute — builds 50 docs, POSTs to the API
Progress bar updates automatically (polls every 5 seconds)
You can navigate away — processing continues in the background
Monitor on Opensolr: Manage Ingestion Queue → link opens the queue dashboard

Large sites (200K+ pages): The async cron approach handles any site size. Each cron run processes 50 docs — no PHP timeout, no memory issues. A 200K site processes in ~67 hours of cron time (at 1 batch/minute).

Field Mapping — Automatic

The module builds each document using the exact same fields the Web Crawler extracts from your pages. No manual mapping needed:

Solr Field	Drupal Source
`uri`	Node canonical URL (language-prefixed on multilingual sites)
`title`	Node title
`description`	Body summary → trimmed body (first 320 chars)
`text`	Full body field (HTML stripped)
`category`	Content type label (e.g., "Blog Post", "Product")
`author`	Node author display name
`og_image`	First image field URL
`meta_og_locale`	Current language as og:locale (e.g., en_gb, de_de)
`meta_detected_language`	Base language code (e.g., en, de, fi)
`price_f` / `currency_s`	Commerce product price + currency
`meta_*` custom fields	From your Facet Mapping configuration

The API enrichment pipeline automatically adds: sentiment scores, language detection, spell fields, and autocomplete tags on every plan. AI embeddings (1024-dim vector representations for semantic search) are available on vector-enabled plans — contact us to get started.

🔗 Multi-level taxonomy facets: Want breadcrumb-style drill-down (Electronics › Computers › Laptops)? See the dedicated Hierarchical Categories guide for the full setup.

Attached Files (PDFs, Documents)

When Include attached files is checked in Settings, the module also sends file attachments with rtf:true. The Opensolr API fetches the file, extracts plain text (PDF, DOCX, XLSX, PPTX, ODT), and indexes it alongside your pages.

Manage Your Ingestion Queue

Monitor job progress, retry failed batches, pause/resume/cancel jobs.

Open Queue Dashboard →