Web Crawler Index Field Reference

This is the complete reference for every field available in your Opensolr Web Crawler index. When you build a custom search UI or query the Solr API directly, these are the fields you can search, filter, sort, and display.

📝 Core Content Fields

These are the main fields you will use in almost every query. They contain the actual content extracted from each crawled page.

Field	Type	Description
`id`	string	Unique document identifier (MD5 hash of the URL). This is the primary key.
`uri`	text	The full URL of the crawled page (e.g., `https://yoursite.com/about`). Searchable — you can search within URLs.
`title`	text	The page title, extracted from the HTML `<title>` tag. This is the most important search field — it receives the highest weight in search queries.
`description`	text	The meta description of the page, extracted from `<meta name="description">`. Second most important search field. Used for result snippets.
`text`	text	The full body text content of the page, extracted and cleaned from the HTML. This is the largest field — it contains everything visible on the page, stripped of HTML tags, navigation, and boilerplate.
`author`	text	The page author, if available (from `<meta name="author">` or other sources).

Tip: All text fields support full-text search with stemming, accent folding, synonym expansion, and highlighting. When searching, Solr automatically matches variations of your search terms (e.g., searching "running" also matches "run", "runs", "runner").

🏷️ Metadata Fields

These fields contain structured metadata about each page. They are stored as exact strings, meaning they are perfect for filtering and faceting (grouping results by value).

Field	Type	Description
`content_type`	string	The MIME type of the page (e.g., `text/html`, `application/pdf`, `image/png`). Use this to filter by content type.
`content_status`	string	The HTTP status code returned by the page (e.g., `200`, `301`, `404`). Only pages with status `200` are typically useful for search.
`category`	string	A category label assigned to the content, if applicable. Useful for faceted navigation.
`signature`	string	An MD5 content hash used internally for deduplication. If two pages have the same content, they share the same signature.
`og_image`	string	The Open Graph image URL (`og:image` meta tag). This is the thumbnail you see in search results and social media shares. Not searchable — stored only for display.

📅 Date and Time Fields

Use these to sort by date or filter by time range (e.g., "show only articles from the last week").

Field	Type	Description
`creation_date`	date	When the content was published or created. Format: ISO 8601 (e.g., `2026-02-25T20:11:00Z`).
`timestamp`	long (integer)	Unix timestamp of the creation date (seconds since 1970-01-01). Useful for numeric sorting and range queries.

Example: Filter by Date Range

fq=creation_date:[2026-01-01T00:00:00Z TO 2026-12-31T23:59:59Z]

Example: Sort by Newest First

sort=creation_date desc

Numeric Fields

Field	Type	Description
`rank`	integer	A page rank or custom scoring value assigned by the crawler. Higher rank = more important page.
`size`	integer	The size of the extracted content in bytes. Default: `0`.

🔀 Dynamic Metadata Fields (meta_*)

The crawler automatically extracts all meta tags from your HTML pages and stores them as dynamic fields with the meta_ prefix. These are incredibly useful because they capture everything your site already puts in its HTML.

Dynamic Field Pattern	Type	Common Examples
*`meta_`**	string	`meta_domain`, `meta_og_url`, `meta_og_site_name`, `meta_og_locale`, `meta_twitter_card`, `meta_twitter_site`, `meta_twitter_description`, `meta_viewport`, `meta_icon`, `meta_detected_language`, `meta_md5`

Commonly Available meta_ Fields

meta_domain — The domain name of the crawled page (e.g., opensolr.com). Very useful for filtering results by domain if you crawled multiple sites.
meta_og_url — The canonical URL from Open Graph tags
meta_og_site_name — The site name from Open Graph tags
meta_twitter_description — The Twitter card description
meta_detected_language — The automatically detected language of the page content (e.g., en, de, fr, es). The crawler uses AI-powered language detection. Great for faceting — lets users filter results by language.
meta_icon — The favicon URL of the site
meta_md5 — MD5 hash of the content (same as id)

Note: The exact meta_* fields available depend on what meta tags your website uses. If your site has <meta property="og:image" content="...">, you will have meta_og_image in your index. The crawler captures them all automatically.

💬 Sentiment Analysis Fields

The crawler runs sentiment analysis on every page using the VADER algorithm. These fields let you filter or sort results by the emotional tone of the content.

Field	Type	Range	Description
`sent_pos`	double	0.0 — 1.0	Positive sentiment score. Higher = more positive content.
`sent_neu`	double	0.0 — 1.0	Neutral sentiment score.
`sent_neg`	double	0.0 — 1.0	Negative sentiment score. Higher = more negative content.
`sent_com`	double	-1.0 — 1.0	Compound sentiment score. This is the overall sentiment — positive values mean positive content, negative values mean negative content.

Example: Show Only Positive Content

fq=sent_com:[0.5 TO 1.0]

🌍 Geospatial Fields

If your pages contain geographic coordinates (e.g., from GPS data in images or structured data), these fields are populated:

Field	Type	Description
`coords`	location	Latitude/longitude pair for geo queries
`lat`	double	Latitude value
`lon`	double	Longitude value

🧠 Vector Embeddings Field

This is what powers the AI semantic search capability.

Field	Type	Description
`embeddings`	vector (1024 dimensions)	A dense vector representation of the page content, generated by a multilingual AI model. Used for semantic/similarity search via Solr's KNN (K-Nearest Neighbors) algorithm.

The embeddings are generated using a state-of-the-art multilingual transformer model that understands meaning, not just keywords. This means a search for "warm beverage for cold weather" can match a page about "hot chocolate recipes" even though they share no keywords.

Note: You do not need to worry about embeddings when building a basic keyword search UI. They are used automatically when you use the hybrid search parameters (covered in the Query Parameters article).

Autocomplete and Spellcheck Fields

These fields are used internally by the search engine and are not returned in results, but they power important features:

Field	Purpose
`tags` / `title_tags`	Edge-ngram indexed fields for typeahead autocomplete on full phrases
`tags_ws` / `title_tags_ws`	Edge-ngram indexed fields for word-level autocomplete
`spell`	Spellcheck dictionary field — powers "Did you mean...?" suggestions

String Copy Fields (uri_s, title_s)

Field	Type	Description
`uri_s`	string	An exact-match copy of the `uri` field. Used for precise URL filtering with wildcards (e.g., `-uri_s:/taxonomy` to exclude taxonomy pages).
`title_s`	string	An exact-match copy of the `title` field.

These _s fields are useful when you need exact string matching or wildcard filtering, as opposed to the full-text analyzed versions.

Dynamic Fields for Custom Data

Your index also supports dynamic fields, which let Solr accept additional data without schema changes:

Pattern	Type	Use Case
`*_s`	string (multi-valued)	Custom string metadata
`*_ss`	string (single-valued)	Custom single string
`*_i`	integer	Custom integer data
`*_l`	long	Custom long integer data
`*_f`	float	Custom float data (e.g., `price_f`)
`*_d`	double (multi-valued)	Custom double data
`*_dt`	date	Custom date fields
`*_b`	boolean	Custom boolean flags
`*_t`	text (full-text searchable)	Custom text content

Example: Price Data

If your pages contain product prices, the crawler extracts them into:

price_f — The price as a float number (e.g., 29.99)
currency_s — The currency code (e.g., USD, EUR)

Fields Returned by Default

When you query the Solr API, you control which fields are returned using the fl (field list) parameter. A typical query returns:

fl=id,uri,title,description,text,og_image,meta_icon,content_type,creation_date,timestamp,meta_domain,meta_*,score,price_f,currency_s

The score field is special — it is not stored in the index but is calculated at query time and represents how relevant each result is to your search query.