Content Types & Document Indexing

Choose what gets indexed and searchable

Content Types

Choose exactly which WordPress content types get indexed β€” posts, pages, WooCommerce products, custom post types, and more. Each indexing method has its own content type selection so you can fine-tune what gets crawled vs. what gets pushed via the API.

Two Separate Content Type Selections

The plugin maintains two independent sets of content type checkboxes β€” one for the Web Crawler and one for Data Ingestion. This gives you full control over what each method indexes.

Data Crawler Tab

Controls which content types appear in your XML sitemap. The web crawler fetches pages listed in the sitemap, so unchecking a type here removes it from the sitemap and the crawler skips it.

Data Ingestion Tab

Controls which content types get pushed directly to Solr via the API. Only checked types are included when you click Ingest All Now or when real-time sync fires on post save.

Available Content Types

The plugin automatically detects all public post types registered in your WordPress installation. Common types include:

Posts

Standard WordPress blog posts β€” the most common content type

Pages

Static WordPress pages β€” About, Contact, Landing pages, etc.

Products

WooCommerce products β€” indexed with prices, categories, SKUs, and structured data

Custom Post Types

Any custom post type registered by your theme or plugins (e.g., Portfolios, Events, Testimonials)

Dynamic Document Counts

Each content type shows a live document count next to its checkbox β€” the number of published items of that type. As you check and uncheck types, the total counter at the bottom updates in real time to show you exactly how many documents will be indexed.

  • Checked β€” shows the actual count of published items (e.g., "Posts (47)")
  • Unchecked β€” shows 0, since those items will not be indexed
  • Total β€” the sum of all checked types, updated live as you toggle checkboxes

Include Attached Files

Below the content type checkboxes, you will find an "Include attached files" option. When enabled, the crawler and ingestion system will also process files attached to your posts:

  • PDFs β€” full-text extraction from PDF documents
  • Word Documents β€” .docx and .doc files
  • Spreadsheets β€” .xlsx and .xls files
  • Presentations β€” .pptx files
  • Text files β€” .txt, .odt, and other plain-text formats

File contents are extracted and indexed alongside the post they are attached to, making them fully searchable.

Recommended Setup

Use both methods together

For best results, enable the same content types on both tabs. The web crawler handles comprehensive indexing on a schedule, while data ingestion provides real-time sync when you publish or update a post. Both methods produce identical Solr documents with the same document ID, so there are no duplicates β€” whichever method writes last simply updates the document.

How Content Types Affect Indexing

Sitemap Generation

Only checked content types in the Data Crawler tab are included in your auto-generated /opensolr-sitemap.xml. The crawler only indexes URLs it finds in the sitemap.

API Ingestion

Only checked content types in the Data Ingestion tab are queued for bulk ingestion and real-time sync hooks. Unchecked types are ignored even when saved.

No Duplicates

Both methods use the same document ID: md5(uri). A post indexed by the crawler and later updated via ingestion remains a single document in your index β€” no cleanup needed.

Content types selected?

Next, start the web crawler to index your content, or enable data ingestion for real-time sync.