How Text Analysis Works: Stemming, Synonyms & Accent Folding

Documentation > WEB CRAWLER-Index Fields > How Text Analysis Works: Stemming, Synonyms & Accent Folding

How Text Analysis Works

Understanding how your Opensolr Web Crawler index analyzes text helps you understand why certain searches match certain results — and how to take full advantage of it.

When a page is crawled and its text is indexed, and when a user types a search query, the text goes through an analysis pipeline — a series of processing steps that normalize and transform the text to maximize search quality.


The Analysis Pipeline

Every text field (title, description, text, uri, author) goes through these steps, in order:

1. HTML Stripping

Any remaining HTML tags are stripped out. You never accidentally search for <div> or <span>.

2. Accent Folding (ISO Latin)

Accented characters are normalized to their ASCII equivalents:

Input Becomes
café cafe
naïve naive
résumé resume
München Munchen
São Paulo Sao Paulo

This means a search for cafe will find pages containing café, and vice versa. You do not need to worry about typing accents.

3. ICU Tokenization

Text is split into individual tokens (words) using the Unicode-aware ICU tokenizer. This handles:

  • Standard word boundaries in Western languages
  • CJK (Chinese, Japanese, Korean) character segmentation
  • Hyphenated words, numbers, email addresses, URLs

4. CJK Width Normalization

Full-width Asian characters are normalized to their half-width equivalents, ensuring consistent matching for CJK content.

5. English Possessive Removal

Removes the possessive 's from English words:

Input Becomes
John's John
company's company

6. Lowercasing

Everything is converted to lowercase. Searches are case-insensitiveOpenSolr, opensolr, and OPENSOLR all match the same results.

7. ASCII Folding

A second pass of character normalization that catches any remaining special characters:

Input Becomes
ü u
ñ n
ø o

8. Stop Word Removal

Common words that add no search value are removed:

a, an, the, is, are, was, were, be, been, being, have, has, had, do, does, did, will, would, could, should, may, might, shall, can, need, dare, ought, used, to, of, in, for, on, with, at, by, from, as, into, through, during, before, after, above, below, between, out, off, over, under, again, further, then, once, ...

This means searching for the best restaurants in New York effectively searches for best restaurants New York — the filler words are ignored.

9. Word Delimiter Processing

Compound words, camelCase, and mixed formats are split intelligently:

Input Produces
Wi-Fi Wi, Fi, WiFi (split + concatenated)
camelCase camel, Case, camelCase
iPhone13 iPhone, 13, iPhone13
3.5-inch 3, 5, inch, 35, 3.5inch

The original form is also preserved, so exact matches still work.

10. Snowball Stemming (English)

Words are reduced to their root form using the Snowball stemmer:

Input Stem
running run
runs run
runner runner
swimming swim
swam swam
better better
organization organ
organizations organ
configured configur
configuring configur

This means a search for running will also match pages containing run, runs, etc. You get broader recall without having to think about word forms.

11. Synonym Expansion (Query Time Only)

When you search, your query terms are expanded using a synonym dictionary. For example:

  • Searching laptop might also match notebook
  • Searching car might also match automobile

Synonyms are only applied at query time (not during indexing), so the index stays clean while searches get broader.

Note: You can customize the synonyms dictionary for your index via the Config Files section in your Opensolr Control Panel.


What This Means for You

When building your search UI, you get all of this for free. You do not need to:

  • Worry about uppercase vs. lowercase
  • Handle accented characters specially
  • Stem words yourself
  • Remove stop words from queries
  • Split compound words

Just send the raw search query to Solr and the analysis pipeline handles the rest.

Practical Examples

User Searches For Also Matches
running shoes "run shoe", "runner shoes", "Running Shoes"
café menu "cafe menu", "Café Menu", "CAFE MENU"
Wi-Fi setup "wifi setup", "WiFi Setup", "wi fi setup"
John's blog "john blog", "Johns blog"
São Paulo restaurants "sao paulo restaurants"

The Spellcheck Field

Separately from the main analysis pipeline, the content is also indexed in a spellcheck field (spell) with minimal processing — just tokenization and lowercasing. This builds the dictionary that powers the "Did you mean...?" suggestions when a user misspells a word.


The Autocomplete Fields

The tags and title_tags fields use edge n-gram analysis, which creates prefix tokens:

Input Indexed As
opensolr o, op, ope, open, opens, openso, opensol, opensolr

This is what powers typeahead / autocomplete suggestions — as the user types each character, results are found instantly because the prefixes are pre-indexed.