Document Contains Immense Term — Field Value Exceeds Maximum Length
The Error
You're trying to index a document and Solr throws this at you:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="your_field_name" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.
This means a single term (token) in one of your fields exceeds 32,766 bytes — the hard limit imposed by Lucene's inverted index format. Lucene literally cannot store a term that large, so the document gets rejected.
What's Actually Happening
When Solr indexes text, it breaks it into tokens (terms) using your analyzer chain. Each token becomes an entry in the inverted index, and Lucene caps each entry at 32,766 bytes of UTF-8.
The problem is: if your field type does not tokenize the input (or barely tokenizes it), the entire field value is treated as a single giant term.
Common Causes
1. Using a string Field Type for Long Text
This is the number one cause. The string field type (solr.StrField) stores the entire value as a single token — no analysis, no tokenization. It's designed for short, exact-match values like IDs, tags, or status codes.
If you accidentally assign a string type to a field that receives full HTML pages, article bodies, or concatenated text, you'll hit the limit fast.
<!-- ❌ This will break on large content --> <field name="sm_aggregated_field" type="string" indexed="true" stored="true"/>
2. A KeywordTokenizer with No Further Processing
The KeywordTokenizer treats the entire input as one token — same problem as string, just wrapped in a field type definition.
<!-- ❌ Still one giant token --> <fieldType name="text_keyword" class="solr.TextField"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </fieldType>
3. Raw HTML or Encoded Data in the Field
Even with a proper tokenizer, if your application is sending raw HTML, Base64-encoded blobs, or serialized objects into a text field, you can end up with enormous single tokens — especially from long URLs in src or href attributes, inline CSS/JS, or data URIs.
The prefix in your error message is a clue. Decoding those bytes:
[60, 112, 62, 60, 105, 109, 103, ...] → "<p><img alt="" src="image/png;..."
That's HTML with an embedded image — a classic sign of raw HTML being pushed into a field that can't handle it.
4. Aggregated / Concatenated Fields
Some applications (like Drupal's Search API) create aggregated fields that combine multiple source fields into one. If the combined content is huge and the field type doesn't tokenize, you get the immense term error.
How to Fix It
Solution 1: Change the Field Type to a Tokenized Type (Recommended)
The most straightforward fix. Switch your field from string to a text-based type that has a proper tokenizer:
<!-- ✅ Standard tokenized text field --> <field name="sm_aggregated_field" type="text_general" indexed="true" stored="true"/>
Or if you need n-gram partial matching:
<!-- ✅ N-gram tokenized field --> <field name="sm_aggregated_field" type="text_ngram" indexed="true" stored="true"/>
Common tokenized field types available in most Solr schemas:
| Field Type | Tokenizer | Best For |
|---|---|---|
text_general |
StandardTokenizer |
General full-text search |
text_en |
StandardTokenizer + stemming |
English language content |
text_ws |
WhitespaceTokenizer |
Whitespace-delimited text |
text_ngram |
NGramTokenizer |
Partial / substring matching |
Solution 2: Strip HTML Before Indexing
If the field is receiving raw HTML, strip it at the application level before sending it to Solr:
// PHP example $clean = strip_tags($rawHtml); $doc["sm_aggregated_field"] = $clean;
Or use Solr's built-in HTMLStripCharFilterFactory in your field type:
<!-- ✅ Strip HTML during analysis --> <fieldType name="text_html" class="solr.TextField"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Solution 3: Truncate the Field Value
If you don't need the full content indexed (e.g., it's just for display), you can truncate at the application level or use Solr's LengthFilterFactory to drop oversized tokens:
<filter class="solr.LengthFilterFactory" min="1" max="32000"/>
This silently drops any token longer than 32,000 characters — a safety net, though fixing the root cause is always better.
Solution 4: Use docValues Instead of the Inverted Index
If you need the field for sorting or faceting (not full-text search), docValues doesn't have the 32,766-byte term limit:
<field name="sm_aggregated_field" type="string" indexed="false" stored="true" docValues="true"/>
Note: this only works if you don't need to search within the field.
Drupal / Search API Users
If you're using Drupal with Search API and the field is sm_aggregated_field, this is almost certainly an aggregated fulltext field combining multiple content fields. The fix:
- In your Opensolr
schema.xml, find the field definition forsm_aggregated_field - Change its type from
stringtotext_general(or another tokenized type) - Save and reload your Opensolr Index
- Re-index your content in Drupal
Quick Checklist
- ☐ Check the field type in your
schema.xml— is itstringortext_*? - ☐ Check for raw HTML — are you stripping tags before indexing?
- ☐ Check aggregated fields — are multiple fields concatenated into one?
- ☐ Check for binary/encoded data — Base64, data URIs, serialized blobs?
- ☐ After changing schema — always reload the Opensolr Index and re-index your data
Got a field type question? Reach out to us at support@opensolr.com — we're happy to help you pick the right analyzer chain for your data. 🙌