Document Contains Immense Term — Field Value Exceeds Maximum Length

The Error

You're trying to index a document and Solr throws this at you:

java.lang.IllegalArgumentException: Document contains at least one
immense term in field="your_field_name" (whose UTF8 encoding is
longer than the max length 32766), all of which were skipped.
Please correct the analyzer to not produce such terms.

This means a single term (token) in one of your fields exceeds 32,766 bytes — the hard limit imposed by Lucene's inverted index format. Lucene literally cannot store a term that large, so the document gets rejected.

What's Actually Happening

When Solr indexes text, it breaks it into tokens (terms) using your analyzer chain. Each token becomes an entry in the inverted index, and Lucene caps each entry at 32,766 bytes of UTF-8.

The problem is: if your field type does not tokenize the input (or barely tokenizes it), the entire field value is treated as a single giant term.

Common Causes

1. Using a `string` Field Type for Long Text

This is the number one cause. The string field type (solr.StrField) stores the entire value as a single token — no analysis, no tokenization. It's designed for short, exact-match values like IDs, tags, or status codes.

If you accidentally assign a string type to a field that receives full HTML pages, article bodies, or concatenated text, you'll hit the limit fast.

<!-- ❌ This will break on large content -->
<field name="sm_aggregated_field" type="string" indexed="true" stored="true"/>

2. A `KeywordTokenizer` with No Further Processing

The KeywordTokenizer treats the entire input as one token — same problem as string, just wrapped in a field type definition.

<!-- ❌ Still one giant token -->
<fieldType name="text_keyword" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
</fieldType>

3. Raw HTML or Encoded Data in the Field

Even with a proper tokenizer, if your application is sending raw HTML, Base64-encoded blobs, or serialized objects into a text field, you can end up with enormous single tokens — especially from long URLs in src or href attributes, inline CSS/JS, or data URIs.

The prefix in your error message is a clue. Decoding those bytes:

[60, 112, 62, 60, 105, 109, 103, ...] → "<p><img alt="" src="image/png;..."

That's HTML with an embedded image — a classic sign of raw HTML being pushed into a field that can't handle it.

4. Aggregated / Concatenated Fields

Some applications (like Drupal's Search API) create aggregated fields that combine multiple source fields into one. If the combined content is huge and the field type doesn't tokenize, you get the immense term error.

How to Fix It

Solution 1: Change the Field Type to a Tokenized Type (Recommended)

The most straightforward fix. Switch your field from string to a text-based type that has a proper tokenizer:

<!-- ✅ Standard tokenized text field -->
<field name="sm_aggregated_field" type="text_general" indexed="true" stored="true"/>

Or if you need n-gram partial matching:

<!-- ✅ N-gram tokenized field -->
<field name="sm_aggregated_field" type="text_ngram" indexed="true" stored="true"/>

Common tokenized field types available in most Solr schemas:

Field Type	Tokenizer	Best For
`text_general`	`StandardTokenizer`	General full-text search
`text_en`	`StandardTokenizer` + stemming	English language content
`text_ws`	`WhitespaceTokenizer`	Whitespace-delimited text
`text_ngram`	`NGramTokenizer`	Partial / substring matching

Solution 2: Strip HTML Before Indexing

If the field is receiving raw HTML, strip it at the application level before sending it to Solr:

// PHP example
$clean = strip_tags($rawHtml);
$doc["sm_aggregated_field"] = $clean;

Or use Solr's built-in HTMLStripCharFilterFactory in your field type:

<!-- ✅ Strip HTML during analysis -->
<fieldType name="text_html" class="solr.TextField">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Solution 3: Truncate the Field Value

If you don't need the full content indexed (e.g., it's just for display), you can truncate at the application level or use Solr's LengthFilterFactory to drop oversized tokens:

<filter class="solr.LengthFilterFactory" min="1" max="32000"/>

This silently drops any token longer than 32,000 characters — a safety net, though fixing the root cause is always better.

Solution 4: Use `docValues` Instead of the Inverted Index

If you need the field for sorting or faceting (not full-text search), docValues doesn't have the 32,766-byte term limit:

<field name="sm_aggregated_field" type="string" indexed="false"
       stored="true" docValues="true"/>

Note: this only works if you don't need to search within the field.

Drupal / Search API Users

If you're using Drupal with Search API and the field is sm_aggregated_field, this is almost certainly an aggregated fulltext field combining multiple content fields. The fix:

In your Opensolr schema.xml, find the field definition for sm_aggregated_field
Change its type from string to text_general (or another tokenized type)
Save and reload your Opensolr Index
Re-index your content in Drupal

Quick Checklist

☐ Check the field type in your schema.xml — is it string or text_*?
☐ Check for raw HTML — are you stripping tags before indexing?
☐ Check aggregated fields — are multiple fields concatenated into one?
☐ Check for binary/encoded data — Base64, data URIs, serialized blobs?
☐ After changing schema — always reload the Opensolr Index and re-index your data

Got a field type question? Reach out to us at support@opensolr.com — we're happy to help you pick the right analyzer chain for your data. 🙌