Web Crawler Index Field Reference
Web Crawler Index Field Reference
This is the complete reference for every field available in your Opensolr Web Crawler index. When you build a custom search UI or query the Solr API directly, these are the fields you can search, filter, sort, and display.
📝 Core Content Fields
These are the main fields you will use in almost every query. They contain the actual content extracted from each crawled page.
| Field | Type | Description |
|---|---|---|
id |
string | Unique document identifier (MD5 hash of the URL). This is the primary key. |
uri |
text | The full URL of the crawled page (e.g., https://yoursite.com/about). Searchable — you can search within URLs. |
title |
text | The page title, extracted from the HTML <title> tag. This is the most important search field — it receives the highest weight in search queries. |
description |
text | The meta description of the page, extracted from <meta name="description">. Second most important search field. Used for result snippets. |
text |
text | The full body text content of the page, extracted and cleaned from the HTML. This is the largest field — it contains everything visible on the page, stripped of HTML tags, navigation, and boilerplate. |
author |
text | The page author, if available (from <meta name="author"> or other sources). |
Tip: All text fields support full-text search with stemming, accent folding, synonym expansion, and highlighting. When searching, Solr automatically matches variations of your search terms (e.g., searching "running" also matches "run", "runs", "runner").
🏷️ Metadata Fields
These fields contain structured metadata about each page. They are stored as exact strings, meaning they are perfect for filtering and faceting (grouping results by value).
| Field | Type | Description |
|---|---|---|
content_type |
string | The MIME type of the page (e.g., text/html, application/pdf, image/png). Use this to filter by content type. |
content_status |
string | The HTTP status code returned by the page (e.g., 200, 301, 404). Only pages with status 200 are typically useful for search. |
category |
string | A category label assigned to the content, if applicable. Useful for faceted navigation. |
signature |
string | An MD5 content hash used internally for deduplication. If two pages have the same content, they share the same signature. |
og_image |
string | The Open Graph image URL (og:image meta tag). This is the thumbnail you see in search results and social media shares. Not searchable — stored only for display. |
📅 Date and Time Fields
Use these to sort by date or filter by time range (e.g., "show only articles from the last week").
| Field | Type | Description |
|---|---|---|
creation_date |
date | When the content was published or created. Format: ISO 8601 (e.g., 2026-02-25T20:11:00Z). |
timestamp |
long (integer) | Unix timestamp of the creation date (seconds since 1970-01-01). Useful for numeric sorting and range queries. |
Example: Filter by Date Range
fq=creation_date:[2026-01-01T00:00:00Z TO 2026-12-31T23:59:59Z]
Example: Sort by Newest First
sort=creation_date desc
Numeric Fields
| Field | Type | Description |
|---|---|---|
rank |
integer | A page rank or custom scoring value assigned by the crawler. Higher rank = more important page. |
size |
integer | The size of the extracted content in bytes. Default: 0. |
🔀 Dynamic Metadata Fields (meta_*)
The crawler automatically extracts all meta tags from your HTML pages and stores them as dynamic fields with the meta_ prefix. These are incredibly useful because they capture everything your site already puts in its HTML.
| Dynamic Field Pattern | Type | Common Examples |
|---|---|---|
meta_* |
string | meta_domain, meta_og_url, meta_og_site_name, meta_og_locale, meta_twitter_card, meta_twitter_site, meta_twitter_description, meta_viewport, meta_icon, meta_detected_language, meta_md5 |
Commonly Available meta_ Fields
meta_domain— The domain name of the crawled page (e.g.,opensolr.com). Very useful for filtering results by domain if you crawled multiple sites.meta_og_url— The canonical URL from Open Graph tagsmeta_og_site_name— The site name from Open Graph tagsmeta_twitter_description— The Twitter card descriptionmeta_detected_language— The automatically detected language of the page content (e.g.,en,de,fr,es). The crawler uses AI-powered language detection. Great for faceting — lets users filter results by language.meta_icon— The favicon URL of the sitemeta_md5— MD5 hash of the content (same asid)
Note: The exact
meta_*fields available depend on what meta tags your website uses. If your site has<meta property="og:image" content="...">, you will havemeta_og_imagein your index. The crawler captures them all automatically.
💬 Sentiment Analysis Fields
The crawler runs sentiment analysis on every page using the VADER algorithm. These fields let you filter or sort results by the emotional tone of the content.
| Field | Type | Range | Description |
|---|---|---|---|
sent_pos |
double | 0.0 — 1.0 | Positive sentiment score. Higher = more positive content. |
sent_neu |
double | 0.0 — 1.0 | Neutral sentiment score. |
sent_neg |
double | 0.0 — 1.0 | Negative sentiment score. Higher = more negative content. |
sent_com |
double | -1.0 — 1.0 | Compound sentiment score. This is the overall sentiment — positive values mean positive content, negative values mean negative content. |
Example: Show Only Positive Content
fq=sent_com:[0.5 TO 1.0]
🌍 Geospatial Fields
If your pages contain geographic coordinates (e.g., from GPS data in images or structured data), these fields are populated:
| Field | Type | Description |
|---|---|---|
coords |
location | Latitude/longitude pair for geo queries |
lat |
double | Latitude value |
lon |
double | Longitude value |
🧠 Vector Embeddings Field
This is what powers the AI semantic search capability.
| Field | Type | Description |
|---|---|---|
embeddings |
vector (1024 dimensions) | A dense vector representation of the page content, generated by a multilingual AI model. Used for semantic/similarity search via Solr's KNN (K-Nearest Neighbors) algorithm. |
The embeddings are generated using a state-of-the-art multilingual transformer model that understands meaning, not just keywords. This means a search for "warm beverage for cold weather" can match a page about "hot chocolate recipes" even though they share no keywords.
Note: You do not need to worry about embeddings when building a basic keyword search UI. They are used automatically when you use the hybrid search parameters (covered in the Query Parameters article).
Autocomplete and Spellcheck Fields
These fields are used internally by the search engine and are not returned in results, but they power important features:
| Field | Purpose |
|---|---|
tags / title_tags |
Edge-ngram indexed fields for typeahead autocomplete on full phrases |
tags_ws / title_tags_ws |
Edge-ngram indexed fields for word-level autocomplete |
spell |
Spellcheck dictionary field — powers "Did you mean...?" suggestions |
String Copy Fields (uri_s, title_s)
| Field | Type | Description |
|---|---|---|
uri_s |
string | An exact-match copy of the uri field. Used for precise URL filtering with wildcards (e.g., -uri_s:*/taxonomy* to exclude taxonomy pages). |
title_s |
string | An exact-match copy of the title field. |
These
_sfields are useful when you need exact string matching or wildcard filtering, as opposed to the full-text analyzed versions.
Dynamic Fields for Custom Data
Your index also supports dynamic fields, which let Solr accept additional data without schema changes:
| Pattern | Type | Use Case |
|---|---|---|
*_s |
string (multi-valued) | Custom string metadata |
*_ss |
string (single-valued) | Custom single string |
*_i |
integer | Custom integer data |
*_l |
long | Custom long integer data |
*_f |
float | Custom float data (e.g., price_f) |
*_d |
double (multi-valued) | Custom double data |
*_dt |
date | Custom date fields |
*_b |
boolean | Custom boolean flags |
*_t |
text (full-text searchable) | Custom text content |
Example: Price Data
If your pages contain product prices, the crawler extracts them into:
price_f— The price as a float number (e.g.,29.99)currency_s— The currency code (e.g.,USD,EUR)
Fields Returned by Default
When you query the Solr API, you control which fields are returned using the fl (field list) parameter. A typical query returns:
fl=id,uri,title,description,text,og_image,meta_icon,content_type,creation_date,timestamp,meta_domain,meta_*,score,price_f,currency_s
The score field is special — it is not stored in the index but is calculated at query time and represents how relevant each result is to your search query.