Web Crawler Index Field Reference

Documentation > WEB CRAWLER-Index Fields > Web Crawler Index Field Reference

Web Crawler Index Field Reference

This is the complete reference for every field available in your Opensolr Web Crawler index. When you build a custom search UI or query the Solr API directly, these are the fields you can search, filter, sort, and display.


📝 Core Content Fields

These are the main fields you will use in almost every query. They contain the actual content extracted from each crawled page.

Field Type Description
id string Unique document identifier (MD5 hash of the URL). This is the primary key.
uri text The full URL of the crawled page (e.g., https://yoursite.com/about). Searchable — you can search within URLs.
title text The page title, extracted from the HTML <title> tag. This is the most important search field — it receives the highest weight in search queries.
description text The meta description of the page, extracted from <meta name="description">. Second most important search field. Used for result snippets.
text text The full body text content of the page, extracted and cleaned from the HTML. This is the largest field — it contains everything visible on the page, stripped of HTML tags, navigation, and boilerplate.
author text The page author, if available (from <meta name="author"> or other sources).

Tip: All text fields support full-text search with stemming, accent folding, synonym expansion, and highlighting. When searching, Solr automatically matches variations of your search terms (e.g., searching "running" also matches "run", "runs", "runner").


🏷️ Metadata Fields

These fields contain structured metadata about each page. They are stored as exact strings, meaning they are perfect for filtering and faceting (grouping results by value).

Field Type Description
content_type string The MIME type of the page (e.g., text/html, application/pdf, image/png). Use this to filter by content type.
content_status string The HTTP status code returned by the page (e.g., 200, 301, 404). Only pages with status 200 are typically useful for search.
category string A category label assigned to the content, if applicable. Useful for faceted navigation.
signature string An MD5 content hash used internally for deduplication. If two pages have the same content, they share the same signature.
og_image string The Open Graph image URL (og:image meta tag). This is the thumbnail you see in search results and social media shares. Not searchable — stored only for display.

📅 Date and Time Fields

Use these to sort by date or filter by time range (e.g., "show only articles from the last week").

Field Type Description
creation_date date When the content was published or created. Format: ISO 8601 (e.g., 2026-02-25T20:11:00Z).
timestamp long (integer) Unix timestamp of the creation date (seconds since 1970-01-01). Useful for numeric sorting and range queries.

Example: Filter by Date Range

fq=creation_date:[2026-01-01T00:00:00Z TO 2026-12-31T23:59:59Z]

Example: Sort by Newest First

sort=creation_date desc

Numeric Fields

Field Type Description
rank integer A page rank or custom scoring value assigned by the crawler. Higher rank = more important page.
size integer The size of the extracted content in bytes. Default: 0.

🔀 Dynamic Metadata Fields (meta_*)

The crawler automatically extracts all meta tags from your HTML pages and stores them as dynamic fields with the meta_ prefix. These are incredibly useful because they capture everything your site already puts in its HTML.

Dynamic Field Pattern Type Common Examples
meta_* string meta_domain, meta_og_url, meta_og_site_name, meta_og_locale, meta_twitter_card, meta_twitter_site, meta_twitter_description, meta_viewport, meta_icon, meta_detected_language, meta_md5

Commonly Available meta_ Fields

  • meta_domain — The domain name of the crawled page (e.g., opensolr.com). Very useful for filtering results by domain if you crawled multiple sites.
  • meta_og_url — The canonical URL from Open Graph tags
  • meta_og_site_name — The site name from Open Graph tags
  • meta_twitter_description — The Twitter card description
  • meta_detected_language — The automatically detected language of the page content (e.g., en, de, fr, es). The crawler uses AI-powered language detection. Great for faceting — lets users filter results by language.
  • meta_icon — The favicon URL of the site
  • meta_md5 — MD5 hash of the content (same as id)

Note: The exact meta_* fields available depend on what meta tags your website uses. If your site has <meta property="og:image" content="...">, you will have meta_og_image in your index. The crawler captures them all automatically.


💬 Sentiment Analysis Fields

The crawler runs sentiment analysis on every page using the VADER algorithm. These fields let you filter or sort results by the emotional tone of the content.

Field Type Range Description
sent_pos double 0.0 — 1.0 Positive sentiment score. Higher = more positive content.
sent_neu double 0.0 — 1.0 Neutral sentiment score.
sent_neg double 0.0 — 1.0 Negative sentiment score. Higher = more negative content.
sent_com double -1.0 — 1.0 Compound sentiment score. This is the overall sentiment — positive values mean positive content, negative values mean negative content.

Example: Show Only Positive Content

fq=sent_com:[0.5 TO 1.0]

🌍 Geospatial Fields

If your pages contain geographic coordinates (e.g., from GPS data in images or structured data), these fields are populated:

Field Type Description
coords location Latitude/longitude pair for geo queries
lat double Latitude value
lon double Longitude value

🧠 Vector Embeddings Field

This is what powers the AI semantic search capability.

Field Type Description
embeddings vector (1024 dimensions) A dense vector representation of the page content, generated by a multilingual AI model. Used for semantic/similarity search via Solr's KNN (K-Nearest Neighbors) algorithm.

The embeddings are generated using a state-of-the-art multilingual transformer model that understands meaning, not just keywords. This means a search for "warm beverage for cold weather" can match a page about "hot chocolate recipes" even though they share no keywords.

Note: You do not need to worry about embeddings when building a basic keyword search UI. They are used automatically when you use the hybrid search parameters (covered in the Query Parameters article).


Autocomplete and Spellcheck Fields

These fields are used internally by the search engine and are not returned in results, but they power important features:

Field Purpose
tags / title_tags Edge-ngram indexed fields for typeahead autocomplete on full phrases
tags_ws / title_tags_ws Edge-ngram indexed fields for word-level autocomplete
spell Spellcheck dictionary field — powers "Did you mean...?" suggestions

String Copy Fields (uri_s, title_s)

Field Type Description
uri_s string An exact-match copy of the uri field. Used for precise URL filtering with wildcards (e.g., -uri_s:*/taxonomy* to exclude taxonomy pages).
title_s string An exact-match copy of the title field.

These _s fields are useful when you need exact string matching or wildcard filtering, as opposed to the full-text analyzed versions.


Dynamic Fields for Custom Data

Your index also supports dynamic fields, which let Solr accept additional data without schema changes:

Pattern Type Use Case
*_s string (multi-valued) Custom string metadata
*_ss string (single-valued) Custom single string
*_i integer Custom integer data
*_l long Custom long integer data
*_f float Custom float data (e.g., price_f)
*_d double (multi-valued) Custom double data
*_dt date Custom date fields
*_b boolean Custom boolean flags
*_t text (full-text searchable) Custom text content

Example: Price Data

If your pages contain product prices, the crawler extracts them into:

  • price_f — The price as a float number (e.g., 29.99)
  • currency_s — The currency code (e.g., USD, EUR)

Fields Returned by Default

When you query the Solr API, you control which fields are returned using the fl (field list) parameter. A typical query returns:

fl=id,uri,title,description,text,og_image,meta_icon,content_type,creation_date,timestamp,meta_domain,meta_*,score,price_f,currency_s

The score field is special — it is not stored in the index but is calculated at query time and represents how relevant each result is to your search query.