text_general
and text_general_phonetic
FieldsFine-tuning your Solr search queries is essential for getting the most relevant results. By using the text_general
and text_general_phonetic
field types, you can significantly enhance the quality of your search output.
text_general
& text_general_phonetic
text_general
FieldThis field type is designed to handle general text content, such as titles, descriptions, and body text. It provides a powerful set of analyzers:
text_general_phonetic
FieldThe text_general_phonetic
field type is perfect for phonetic searches, where sound-based similarity matters. It is similar to text_general
, but with added functionality for phonetic matching:
Once you have your field types set up, the next step is to calibrate the search parameters for optimal results. Solr provides several parameters that allow you to adjust your search behavior, including mm
, qf
, and bf
.
mm
(Minimum Should Match): This parameter defines the minimum percentage of terms in the query that must match. It’s essential to adjust this based on the number of words in the query to balance precision and recall.qf
(Query Fields): Defines which fields to query and assigns boosting factors. The more relevant fields are boosted higher for better precision.bf
(Boost Functions): Allows you to boost the results based on specific functions, such as freshness or recency of documents.Here’s an example of how to calibrate the parameters:
params["qf"] = "title^10 description^7 text^5 phonetic_title^0.3 phonetic_description^0.2 phonetic_text^0.1";
params["mm"] = "75%";
params["bf"] = "recip(rord(timestamp),1,1500,1500)^29";
In this example, the query boosts the title
field heavily and adjusts the match percentage to 75%. Additionally, the bf
function boosts more recent documents.
Here are the actual field definitions for the text_general
and text_general_phonetic
field types:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<!-- ================= INDEX‐TIME ANALYZER (English) ================= -->
<analyzer type="index">
<!-- 1. Strip HTML and fold accented characters (e.g. “résumé”→“resume”) -->
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<!-- 2. Normalize comma/dot decimals: “5,8” → “5.8” -->
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([0-9])[\\.,]([0-9])"
replacement="$1.$2"/>
<!-- 3. Break text into Unicode words & numbers -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<!-- 4. Split numbers/words but keep originals; protect tokens in protwords.txt -->
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateAll="0"
catenateNumbers="1"
catenateWords="0"
splitOnCaseChange="1"
preserveOriginal="1"
protected="protwords.txt"/>
<!-- 5. Discard tokens that are too short/long -->
<filter class="solr.LengthFilterFactory" min="1" max="50" />
<!-- 6. Fold any remaining accents (keep original) -->
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<!-- 7. Lowercase everything -->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- 8. Remove English stopwords (stopwords.txt should now contain English list) -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- 9. Remove English possessive ’s (“John’s”→“John”) -->
<filter class="solr.EnglishPossessiveFilterFactory"/>
<!-- 10. Apply English SnowballPorter stemming, protecting protwords.txt -->
<filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>
<!-- 11. Remove any duplicate tokens -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<!-- ================= QUERY‐TIME ANALYZER (English) ================= -->
<analyzer type="query">
<!-- 1. Strip HTML and fold accented characters -->
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<!-- 2. Normalize comma/dot decimals at query time -->
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([0-9])[\\.,]([0-9])"
replacement="$1.$2"/>
<!-- 3. ICU tokenizer for Unicode words & numbers -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<!-- 4. Split numbers/words but keep originals; protect protwords.txt -->
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateAll="0"
catenateNumbers="1"
catenateWords="0"
splitOnCaseChange="1"
preserveOriginal="1"
protected="protwords.txt"/>
<!-- 5. Discard tokens that are too short/long -->
<filter class="solr.LengthFilterFactory" min="1" max="50"/>
<!-- 6. Fold any remaining accents (keep original) -->
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<!-- 7. Lowercase everything -->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- 8. Expand synonyms before removing stopwords -->
<filter class="solr.SynonymGraphFilterFactory"
expand="true"
ignoreCase="true"
synonyms="synonyms.txt"/>
<!-- 9. Remove English stopwords (stopwords.txt) -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- 10. Remove English possessive ’s -->
<filter class="solr.EnglishPossessiveFilterFactory"/>
<!-- 11. Apply English SnowballPorter stemming, protecting protwords.txt -->
<filter class="solr.SnowballPorterFilterFactory"
language="English"
protected="protwords.txt"/>
<!-- 12. Remove any duplicate tokens -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<!--Phonetic Text Field-->
<fieldType name="text_general_phonetic" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- 1) Strip HTML -->
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<!-- 2) Tokenize on Unicode word boundaries rather than bare whitespace -->
<!-- WhitespaceTokenizer will treat “Co‐op” as one token, but you probably want “Co” + “op”. -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<!-- 3) Remove stopwords early on -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- 4) Break apart numbers/words but keep the original spelling for phonetic coding -->
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" splitOnNumerics="1" splitOnCaseChange="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
preserveOriginal="1"
protected="protwords.txt" />
<!-- 5) Lowercase now so phonetic sees normalized input -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<!-- 7) Fold accents (but keep originals so BeiderMorse sees both accented & un-accented) -->
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<!-- 8) Synonyms (optional—but note: synonyms + phonetics = explosion of tokens) -->
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<!-- 9) Phonetic coding: only keep **one** code per token if possible -->
<!-- nameType="GENERIC" ruleType="APPROX" is fine, but “concat=true” will glue codes together. -->
<!-- For better control, set concat="false" so each code is its own token. -->
<filter class="solr.BeiderMorseFilterFactory"
nameType="GENERIC"
ruleType="APPROX"
concat="false"
languageSet="auto"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" splitOnNumerics="1" splitOnCaseChange="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
preserveOriginal="1"
protected="protwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.BeiderMorseFilterFactory"
nameType="GENERIC"
ruleType="APPROX"
concat="false"
languageSet="auto"/>
</analyzer>
</fieldType>
Conclusion Calibrating Solr’s search parameters for specific field types like text_general and text_general_phonetic ensures that you’re getting the most relevant results from your searches. By adjusting key parameters like mm, qf, and bf, you can refine your search queries and achieve optimal performance tailored to your needs.