Data Ingestion
Data Ingestion pushes your WordPress content directly to Opensolr via the API β no crawler needed. Posts are sent the moment you publish or update them, and bulk ingestion handles your entire library asynchronously in the background.
Two Ingestion Modes
Real-Time Sync
Every time you save, update, or delete a post, the plugin automatically pushes the changes to your Opensolr index. Fires on save_post and delete_post WordPress hooks with a 5-second API timeout β fast enough to keep your index current, short enough to never block your editing workflow.
Async Bulk Ingestion
Processes your entire content library in batches of 50 documents via WordPress cron. Runs every minute, picking up where it left off until all selected content types are fully ingested. Designed for initial setup or re-ingesting after configuration changes.
Ingest All Now
Click Ingest All Now on the Data Ingestion tab to queue all content from your selected content types for async bulk processing. A live progress bar appears showing:
- Percentage β overall completion of the ingestion job
- Batch count β how many batches of 50 documents have been processed
- Error count β any documents that failed to ingest (network timeouts, malformed content, etc.)
The progress bar polls every 3 seconds. If you leave the page and come back, the progress bar picks up from where it left off β a completed job shows a green checkmark, a failed job shows the error message.
How Async Bulk Works
- You click Ingest All Now β the plugin creates a job in the database with status "pending"
- WordPress cron picks up the job within 60 seconds and starts processing
- Each batch loads 50 posts from the database, converts them to Solr documents, and pushes them to the Opensolr API
- Progress is saved after each batch β if cron is interrupted, it resumes from the last completed batch
- When all documents are processed, the job status changes to "completed"
System Cron Required
Bulk ingestion relies on WordPress cron events firing every minute. The built-in WordPress pseudo-cron (WP-Cron) only runs when someone visits your site β if no one visits, nothing gets processed. You must set up a real system cron for reliable ingestion.
Add this to your wp-config.php to disable WordPress pseudo-cron:
define('DISABLE_WP_CRON', true);
Then add a system cron job that fires every minute:
* * * * * cd /path/to/wordpress && wp cron event run --due-now --allow-root > /dev/null 2>&1
If your system cron runs as root, you must include --allow-root in the command. Without it, WP-CLI silently refuses to execute and your ingestion jobs will never progress.
Crawler + Ingestion Together
The Web Crawler and Data Ingestion are designed to work together. Both produce identical Solr documents with the same document ID (md5(uri)), so there are no duplicates β whichever method writes last simply updates the document.
Crawler β Comprehensive Coverage
Crawls your entire sitemap on a schedule (daily, weekly, monthly). Best for: initial indexing, picking up structural changes, re-indexing everything after config changes.
Ingestion β Real-Time Freshness
Pushes changes the instant you save a post. Best for: keeping search results up-to-date between crawls, new posts appearing immediately, deleted posts removed instantly.
Enrichment Pipeline
Documents sent via ingestion go through the same enrichment pipeline as crawled pages. On vector-enabled plans, each document is automatically enriched with:
Embeddings
BGE-m3 1024-dimensional vectors computed from the first 10 sentences of your content. Powers hybrid vector + keyword search for dramatically better relevance.
Sentiment Analysis
VADER-based sentiment scoring β positive, negative, neutral, and compound scores. Useful for filtering and analytics.
Language Detection
Automatic language identification per document β powers the locale filter for multilingual sites using WPML or Polylang.
Document Fields
Each ingested document includes these fields in the Solr index:
uriβ the canonical URL of the post (trailing slash normalized)titleβ post titledescriptionβ meta description or post excerpttextβ full body text, stripped of HTML tagstext_tβ structured text from JSON-LD and taxonomy data (preferred for AI features)og_imageβ featured image URL for result thumbnailsmeta_iconβ site favicon URLcreation_dateβ post publication datetimestampβ last modified datemeta_og_localeβ language/locale code (e.g.,en_US)meta_domainβ your site domain (used for automatic domain filtering in search)authorβ post author namecategoryβ primary categoryquality_fβ content quality score (based on content length and structure)price_fβ product price (WooCommerce only)currency_sβ price currency code (WooCommerce only)
Both the web crawler and data ingestion use md5(uri) as the Solr document ID. A page indexed by the crawler and later pushed by ingestion remains a single document. URIs are normalized (trailing slashes stripped) to guarantee consistent IDs regardless of which method writes first.
Your content is now being pushed to Opensolr in real time. Visit your site's /search/ page to see your indexed content in action, or explore Search Page configuration to customize how results appear.