How Text Analysis Works: Stemming, Synonyms & Accent Folding
How Text Analysis Works
Understanding how your Opensolr Web Crawler index analyzes text helps you understand why certain searches match certain results — and how to take full advantage of it.
When a page is crawled and its text is indexed, and when a user types a search query, the text goes through an analysis pipeline — a series of processing steps that normalize and transform the text to maximize search quality.
The Analysis Pipeline
Every text field (title, description, text, uri, author) goes through these steps, in order:
1. HTML Stripping
Any remaining HTML tags are stripped out. You never accidentally search for <div> or <span>.
2. Accent Folding (ISO Latin)
Accented characters are normalized to their ASCII equivalents:
| Input | Becomes |
|---|---|
| café | cafe |
| naïve | naive |
| résumé | resume |
| München | Munchen |
| São Paulo | Sao Paulo |
This means a search for cafe will find pages containing café, and vice versa. You do not need to worry about typing accents.
3. ICU Tokenization
Text is split into individual tokens (words) using the Unicode-aware ICU tokenizer. This handles:
- Standard word boundaries in Western languages
- CJK (Chinese, Japanese, Korean) character segmentation
- Hyphenated words, numbers, email addresses, URLs
4. CJK Width Normalization
Full-width Asian characters are normalized to their half-width equivalents, ensuring consistent matching for CJK content.
5. English Possessive Removal
Removes the possessive 's from English words:
| Input | Becomes |
|---|---|
| John's | John |
| company's | company |
6. Lowercasing
Everything is converted to lowercase. Searches are case-insensitive — OpenSolr, opensolr, and OPENSOLR all match the same results.
7. ASCII Folding
A second pass of character normalization that catches any remaining special characters:
| Input | Becomes |
|---|---|
| ü | u |
| ñ | n |
| ø | o |
8. Stop Word Removal
Common words that add no search value are removed:
a, an, the, is, are, was, were, be, been, being, have, has, had, do, does, did, will, would, could, should, may, might, shall, can, need, dare, ought, used, to, of, in, for, on, with, at, by, from, as, into, through, during, before, after, above, below, between, out, off, over, under, again, further, then, once, ...
This means searching for the best restaurants in New York effectively searches for best restaurants New York — the filler words are ignored.
9. Word Delimiter Processing
Compound words, camelCase, and mixed formats are split intelligently:
| Input | Produces |
|---|---|
Wi-Fi |
Wi, Fi, WiFi (split + concatenated) |
camelCase |
camel, Case, camelCase |
iPhone13 |
iPhone, 13, iPhone13 |
3.5-inch |
3, 5, inch, 35, 3.5inch |
The original form is also preserved, so exact matches still work.
10. Snowball Stemming (English)
Words are reduced to their root form using the Snowball stemmer:
| Input | Stem |
|---|---|
| running | run |
| runs | run |
| runner | runner |
| swimming | swim |
| swam | swam |
| better | better |
| organization | organ |
| organizations | organ |
| configured | configur |
| configuring | configur |
This means a search for running will also match pages containing run, runs, etc. You get broader recall without having to think about word forms.
11. Synonym Expansion (Query Time Only)
When you search, your query terms are expanded using a synonym dictionary. For example:
- Searching
laptopmight also matchnotebook - Searching
carmight also matchautomobile
Synonyms are only applied at query time (not during indexing), so the index stays clean while searches get broader.
Note: You can customize the synonyms dictionary for your index via the Config Files section in your Opensolr Control Panel.
What This Means for You
When building your search UI, you get all of this for free. You do not need to:
- Worry about uppercase vs. lowercase
- Handle accented characters specially
- Stem words yourself
- Remove stop words from queries
- Split compound words
Just send the raw search query to Solr and the analysis pipeline handles the rest.
Practical Examples
| User Searches For | Also Matches |
|---|---|
running shoes |
"run shoe", "runner shoes", "Running Shoes" |
café menu |
"cafe menu", "Café Menu", "CAFE MENU" |
Wi-Fi setup |
"wifi setup", "WiFi Setup", "wi fi setup" |
John's blog |
"john blog", "Johns blog" |
São Paulo restaurants |
"sao paulo restaurants" |
The Spellcheck Field
Separately from the main analysis pipeline, the content is also indexed in a spellcheck field (spell) with minimal processing — just tokenization and lowercasing. This builds the dictionary that powers the "Did you mean...?" suggestions when a user misspells a word.
The Autocomplete Fields
The tags and title_tags fields use edge n-gram analysis, which creates prefix tokens:
| Input | Indexed As |
|---|---|
opensolr |
o, op, ope, open, opens, openso, opensol, opensolr |
This is what powers typeahead / autocomplete suggestions — as the user types each character, results are found instantly because the prefixes are pre-indexed.