Data Ingestion β€” Push Content to Search

Push content directly β€” real-time sync on save

Data Ingestion

Data Ingestion pushes your WordPress content directly to Opensolr via the API β€” no crawler needed. Posts are sent the moment you publish or update them, and bulk ingestion handles your entire library asynchronously in the background.

Two Ingestion Modes

Real-Time Sync

Every time you save, update, or delete a post, the plugin automatically pushes the changes to your Opensolr index. Fires on save_post and delete_post WordPress hooks with a 5-second API timeout β€” fast enough to keep your index current, short enough to never block your editing workflow.

Async Bulk Ingestion

Processes your entire content library in batches of 50 documents via WordPress cron. Runs every minute, picking up where it left off until all selected content types are fully ingested. Designed for initial setup or re-ingesting after configuration changes.

Ingest All Now

Click Ingest All Now on the Data Ingestion tab to queue all content from your selected content types for async bulk processing. A live progress bar appears showing:

  • Percentage β€” overall completion of the ingestion job
  • Batch count β€” how many batches of 50 documents have been processed
  • Error count β€” any documents that failed to ingest (network timeouts, malformed content, etc.)

The progress bar polls every 3 seconds. If you leave the page and come back, the progress bar picks up from where it left off β€” a completed job shows a green checkmark, a failed job shows the error message.

How Async Bulk Works

  1. You click Ingest All Now β€” the plugin creates a job in the database with status "pending"
  2. WordPress cron picks up the job within 60 seconds and starts processing
  3. Each batch loads 50 posts from the database, converts them to Solr documents, and pushes them to the Opensolr API
  4. Progress is saved after each batch β€” if cron is interrupted, it resumes from the last completed batch
  5. When all documents are processed, the job status changes to "completed"

System Cron Required

WordPress pseudo-cron is not reliable

Bulk ingestion relies on WordPress cron events firing every minute. The built-in WordPress pseudo-cron (WP-Cron) only runs when someone visits your site β€” if no one visits, nothing gets processed. You must set up a real system cron for reliable ingestion.

Add this to your wp-config.php to disable WordPress pseudo-cron:

define('DISABLE_WP_CRON', true);

Then add a system cron job that fires every minute:

* * * * * cd /path/to/wordpress && wp cron event run --due-now --allow-root > /dev/null 2>&1

Running as root?

If your system cron runs as root, you must include --allow-root in the command. Without it, WP-CLI silently refuses to execute and your ingestion jobs will never progress.

Crawler + Ingestion Together

The Web Crawler and Data Ingestion are designed to work together. Both produce identical Solr documents with the same document ID (md5(uri)), so there are no duplicates β€” whichever method writes last simply updates the document.

Crawler β€” Comprehensive Coverage

Crawls your entire sitemap on a schedule (daily, weekly, monthly). Best for: initial indexing, picking up structural changes, re-indexing everything after config changes.

Ingestion β€” Real-Time Freshness

Pushes changes the instant you save a post. Best for: keeping search results up-to-date between crawls, new posts appearing immediately, deleted posts removed instantly.

Enrichment Pipeline

Documents sent via ingestion go through the same enrichment pipeline as crawled pages. On vector-enabled plans, each document is automatically enriched with:

Embeddings

BGE-m3 1024-dimensional vectors computed from the first 10 sentences of your content. Powers hybrid vector + keyword search for dramatically better relevance.

Sentiment Analysis

VADER-based sentiment scoring β€” positive, negative, neutral, and compound scores. Useful for filtering and analytics.

Language Detection

Automatic language identification per document β€” powers the locale filter for multilingual sites using WPML or Polylang.

Document Fields

Each ingested document includes these fields in the Solr index:

  • uri β€” the canonical URL of the post (trailing slash normalized)
  • title β€” post title
  • description β€” meta description or post excerpt
  • text β€” full body text, stripped of HTML tags
  • text_t β€” structured text from JSON-LD and taxonomy data (preferred for AI features)
  • og_image β€” featured image URL for result thumbnails
  • meta_icon β€” site favicon URL
  • creation_date β€” post publication date
  • timestamp β€” last modified date
  • meta_og_locale β€” language/locale code (e.g., en_US)
  • meta_domain β€” your site domain (used for automatic domain filtering in search)
  • author β€” post author name
  • category β€” primary category
  • quality_f β€” content quality score (based on content length and structure)
  • price_f β€” product price (WooCommerce only)
  • currency_s β€” price currency code (WooCommerce only)
Same document ID β€” no duplicates

Both the web crawler and data ingestion use md5(uri) as the Solr document ID. A page indexed by the crawler and later pushed by ingestion remains a single document. URIs are normalized (trailing slashes stripped) to guarantee consistent IDs regardless of which method writes first.

Ingestion set up?

Your content is now being pushed to Opensolr in real time. Visit your site's /search/ page to see your indexed content in action, or explore Search Page configuration to customize how results appear.