True Parallel Architecture - Pre-fetch Partitioning

Pre-fetch partitioning for true parallel processing

True Parallel Architecture (v2.1)

Fetch Items → Partition → Spawn Workers → Wait & Repeat

Version 2.1 introduces a completely rewritten parallelism model using pre-fetch partitioning. This eliminates race conditions and achieves true parallel processing with 5-10x speedup.

🚀 What Changed

Previous versions used indexItems() which has internal locking — only one worker could claim items at a time. Now items are pre-fetched and distributed, so workers never compete.

The Problem with Traditional Approach

Search API's $index->indexItems() method queries the tracker for pending items, marks them as in-progress, and indexes them. When multiple workers call this simultaneously:

Only one worker successfully claims items (database locking)
Other workers get 0 items and spin/retry
Result: 10 workers running, but only 1 actually indexing
No speedup despite spawning multiple processes

The Solution: Pre-fetch Partitioning

The master process now handles item fetching and distribution:

Master Process Loop:
│
├── 1. Fetch N items directly from database
│      SELECT item_id FROM search_api_item
│      WHERE status=1 LIMIT (workers × batch)
│
├── 2. Split into chunks (one per worker)
│      Worker 1: items 0-199
│      Worker 2: items 200-399
│      Worker 3: items 400-599
│      ... and so on
│
├── 3. Write chunks to temp files
│      /tmp/opensolr_chunks/chunk_index_0.json
│      /tmp/opensolr_chunks/chunk_index_1.json
│      ...
│
├── 4. Spawn workers (each gets a chunk file)
│      pcntl_fork() + pcntl_exec(drush opensolr:worker)
│
├── 5. Wait for ALL workers to complete
│      pcntl_waitpid() for each PID
│
├── 6. Clean up chunk files
│
└── 7. Repeat until no items remaining

Worker Process

Each worker now has a simple, focused job:

Read assigned item IDs from chunk file
Load entities via $datasource->loadMultiple()
Index items via $index->indexSpecificItems()
Exit

💡 Why This Works

Each worker has a dedicated, non-overlapping set of items. No contention, no locking, no wasted cycles. True parallel processing.

Resume Support

A key benefit of this architecture is full resume support. You can stop and restart the indexer at any time without missing or duplicating items.

How Resume Works

Tracker is source of truth — The search_api_item table tracks each item's status
Items marked after success — Status changes from 1 (pending) to 0 (indexed) only after successful indexing
Safe to stop anytime — Incomplete items remain status=1 and are picked up on restart

What Happens on Stop

Item State	Status	On Restart
Already indexed	`status = 0`	Skipped (not re-fetched)
In-progress when stopped	`status = 1`	Will be retried
Not yet fetched	`status = 1`	Will be fetched

✅ Safe Guarantee

No duplicates, no missed items. The tracker ensures exactly-once processing across any number of stop/restart cycles.

Resume Example

# Start indexing (background by default)
drush ost --server=my_solr_server

# Check progress
drush oss
# Shows: 45% complete, 300,000 remaining

# Stop for maintenance
drush osstop --server=my_solr_server

# ... later ...

# Resume - picks up where it left off
drush ost --server=my_solr_server

# Check progress
drush oss
# Shows: 45% complete, 300,000 remaining (same as before)

Performance Comparison

Metric	v2.0 (Old)	v2.1 (New)
10 workers requested	~1 actually working	10 actually working
Throughput	~100-200 items/sec	~500-1500 items/sec
Worker contention	High (competing for items)	None (pre-partitioned)
Resume support	Yes	Yes

Temp Files

Chunk files are stored in /tmp/opensolr_chunks_{server}/ (or /tmp/opensolr_chunks_{server}__{index}/ for index-specific sessions) and are automatically cleaned up after each round. If the indexer crashes, you can safely delete them:

# Clean up all chunk directories
rm -rf /tmp/opensolr_chunks_*/