How Text Analysis Works: Stemming, Synonyms & Accent Folding

How Text Analysis Works

Understanding how your Opensolr Web Crawler index analyzes text helps you understand why certain searches match certain results — and how to take full advantage of it.

When a page is crawled and its text is indexed, and when a user types a search query, the text goes through an analysis pipeline — a series of processing steps that normalize and transform the text to maximize search quality.

The Analysis Pipeline

Every text field (title, description, text, uri, author) goes through these steps, in order:

1. HTML Stripping

Any remaining HTML tags are stripped out. You never accidentally search for <div> or <span>.

2. Accent Folding (ISO Latin)

Accented characters are normalized to their ASCII equivalents:

Input	Becomes
café	cafe
naïve	naive
résumé	resume
München	Munchen
São Paulo	Sao Paulo

This means a search for cafe will find pages containing café, and vice versa. You do not need to worry about typing accents.

3. ICU Tokenization

Text is split into individual tokens (words) using the Unicode-aware ICU tokenizer. This handles:

Standard word boundaries in Western languages
CJK (Chinese, Japanese, Korean) character segmentation
Hyphenated words, numbers, email addresses, URLs

4. CJK Width Normalization

Full-width Asian characters are normalized to their half-width equivalents, ensuring consistent matching for CJK content.

5. English Possessive Removal

Removes the possessive 's from English words:

Input	Becomes
John's	John
company's	company

6. Lowercasing

Everything is converted to lowercase. Searches are case-insensitive — OpenSolr, opensolr, and OPENSOLR all match the same results.

7. ASCII Folding

A second pass of character normalization that catches any remaining special characters:

Input	Becomes
ü	u
ñ	n
ø	o

8. Stop Word Removal

Common words that add no search value are removed:

a, an, the, is, are, was, were, be, been, being, have, has, had, do, does, did, will, would, could, should, may, might, shall, can, need, dare, ought, used, to, of, in, for, on, with, at, by, from, as, into, through, during, before, after, above, below, between, out, off, over, under, again, further, then, once, ...

This means searching for the best restaurants in New York effectively searches for best restaurants New York — the filler words are ignored.

9. Word Delimiter Processing

Compound words, camelCase, and mixed formats are split intelligently:

Input	Produces
`Wi-Fi`	`Wi`, `Fi`, `WiFi` (split + concatenated)
`camelCase`	`camel`, `Case`, `camelCase`
`iPhone13`	`iPhone`, `13`, `iPhone13`
`3.5-inch`	`3`, `5`, `inch`, `35`, `3.5inch`

The original form is also preserved, so exact matches still work.

10. Snowball Stemming (English)

Words are reduced to their root form using the Snowball stemmer:

Input	Stem
running	run
runs	run
runner	runner
swimming	swim
swam	swam
better	better
organization	organ
organizations	organ
configured	configur
configuring	configur

This means a search for running will also match pages containing run, runs, etc. You get broader recall without having to think about word forms.

11. Synonym Expansion (Query Time Only)

When you search, your query terms are expanded using a synonym dictionary. For example:

Searching laptop might also match notebook
Searching car might also match automobile

Synonyms are only applied at query time (not during indexing), so the index stays clean while searches get broader.

Note: You can customize the synonyms dictionary for your index via the Config Files section in your Opensolr Control Panel.

What This Means for You

When building your search UI, you get all of this for free. You do not need to:

Worry about uppercase vs. lowercase
Handle accented characters specially
Stem words yourself
Remove stop words from queries
Split compound words

Just send the raw search query to Solr and the analysis pipeline handles the rest.

Practical Examples

User Searches For	Also Matches
`running shoes`	"run shoe", "runner shoes", "Running Shoes"
`café menu`	"cafe menu", "Café Menu", "CAFE MENU"
`Wi-Fi setup`	"wifi setup", "WiFi Setup", "wi fi setup"
`John's blog`	"john blog", "Johns blog"
`São Paulo restaurants`	"sao paulo restaurants"

The Spellcheck Field

Separately from the main analysis pipeline, the content is also indexed in a spellcheck field (spell) with minimal processing — just tokenization and lowercasing. This builds the dictionary that powers the "Did you mean...?" suggestions when a user misspells a word.

The Autocomplete Fields

The tags and title_tags fields use edge n-gram analysis, which creates prefix tokens:

Input	Indexed As
`opensolr`	`o`, `op`, `ope`, `open`, `opens`, `openso`, `opensol`, `opensolr`

This is what powers typeahead / autocomplete suggestions — as the user types each character, results are found instantly because the prefixes are pre-indexed.