您的位置：首页 > 其它

Solr 4.2.x 拼写检查组件

2017-08-18 16:51 260 查看

在做搜索时一般可以在用户输入检索条件时使用suggest，而在点击完搜索时，使用拼写检查，二者结合给可以用户带来比较好的用户体验！

suggest与spellcheck看似功能一样，出发点是不一样的，使用条件也不一样，spellcheck是在没有搜索出结果时才有的功能，搜索词正确是没能spellcheck结果的，而suggest是任何情况下都有结果的。

Solr4.0以后又新增了一个拼写检查组件：org.apache.solr.spelling.DirectSolrSpellChecker，以前只有这两个：
org.apache.solr.spelling.IndexBasedSpellChecker
org.apache.solr.spelling.FileBasedSpellChecker
IndexBasedSpellChecker是基于Solr或lucene索引字段的，FileBasedSpellChecker是基于字典文件的，这个在用于词的搜索热门度排名有用。
在solr 4.0版本引入了solr.DirectSolrSpellChecker拼写检查组件，是个实验性的组件，可以为主索引提供拼写建议功能，且不需要在每次commit索引时重建。

4.0还有一个org.apache.solr.spelling. WordBreakSolrSpellChecker ：

4.x的配置：

schema.xml

----------------------自定义域类型------------------------------------------


<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>

--------------------------定义要搜索和存储的字段--------------------------------------

<field name="comName" type="text_mmseg4j_complex" indexed="true" stored="true" required="true" />
<field name="comTitle" type="text_mmseg4j_complex" indexed="true" stored="true" required="true" />
<field name="suggest" type="text_suggest" indexed="true" stored="false" multiValued="true" />

----------------------------复制要拼写检查的域原始值到新的搜索域中--------------------------------

<copyField source="comName" dest="suggest"/>
<copyField source="comTitle" dest="suggest"/>

---------------------------配置mmseg4j中文分词器域类型-----------------------------------------------

<fieldType name="text_mmseg4j_complex" class="solr.TextField" positionIncrementGap="100" >
<analyzer type="index">
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
</fieldType>
<fieldType name="text_mmseg4j_maxword" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/>
</analyzer>
</fieldType>
<fieldType name="text_mmseg4j_simple" class="solr.TextField" positionIncrementGap="100" >
<analyzer>

<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="dic"/>
</analyzer>
</fieldType>


--------------------------------------------------------------------------------

solrconfig.xml

------------------------配置查询分析器 spellcheck ---------------------------------------------------

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

<str name="queryAnalyzerFieldType">text_suggest</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">suggest</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="distanceMeasure">internal</str>
<float name="accuracy">0.5</float>
<int name="maxEdits">2</int>
<int name="minPrefix">1</int>
<int name="maxInspections">5</int>
<int name="minQueryLength">4</int>
<float name="maxQueryFrequency">0.01</float>

</lst>
<lst name="spellchecker">
<str name="name">wordbreak</str>
<str name="classname">solr.WordBreakSolrSpellChecker</str>
<str name="field">suggest</str>
<str name="combineWords">true</str>
<str name="breakWords">true</str>
<str name="maxChanges">10</str>
<str name="minBreakLength">5</str>
</lst>
</searchComponent>

------------------------将查询分析器 spellcheck 合并在/select ------------------------------------------


<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="q">abcdefghik</str>
<int name="rows">10</int>
<str name="df">text</str>


<str name="spellcheck">true</str> 
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>

<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

--------------------------------------------------------------------------

在solrj里有代码：

//拼写检查建议
query.getSolrQuery().set("spellcheck", "true");
query.getSolrQuery().set("spellcheck.q",condition.getSearchWord());
query.getSolrQuery().set("spellcheck.count", 5);

....

//当搜索不到结果时，显示建议词 List<String> wordList = new ArrayList<String>();
// 当搜索不到结果时，显示建议词
SpellCheckResponse spellCheckResponse = response.getSpellCheckResponse();
if (spellCheckResponse != null) {
if (!spellCheckResponse.isCorrectlySpelled()) {
for (Suggestion s : spellCheckResponse.getSuggestions()) {
wordList.addAll(s.getAlternatives());
}
}
}

在solr4.0以前的版本中，spellcheck模块还有buildOnCommit选项，是使用IndexBasedSpellChecker组件，buildOnCommit=true在每次创建索引时生成拼写检查字典会影响索引创建的性能，spellcheck都需要构建自己的索引，每次配置后都需要更新索引，要生成spellchecker目录，比较麻烦。而solr4.0以后，通过solr.DirectSolrSpellchecker就可以在main索引中直接用spellcheck功能了。虽然可以使用buildOnOptimize或手工生成拼写检查字典代替，在索引比较大时，还是对生成索引的速度有很大影响，基本上要增加成倍的时间。

而且之前配置完后还要先执行spellcheck=true&spellcheck.build=true，才会生成拼写检查索引。

相关参数说明：

q OR spellcheck.q=keyword 要使用拼写检查的查询关键字

q与spellcheck.q还是有区别的，在多个搜索判词组合的情况，q有多种重复结果，spellcheck.q只有一种结果

spellcheck=true 对于请求是否启用拼写检查组件

spellcheck.collate=true 使用建议的结果替换错误的词条

spellcheck.build=true 拼写检查索引使用之前只需要初始化一次拼写检查字典，之后使用buildOnCommit=true就可以了，如果是使用已存在的字典文件就不需要初始化了。

spellcheck.relaod 重新加载拼写检查器，通常指的是重新加载字典。

spellcheck.dictionary=default 指定拼写检查器的字典名称，默认为“default”，可以在每次请求时指定使用的拼写检查器。

spellcheck.count=10 最大返回建议条数。

spellcheck.alternativeTermCount=

spellcheck.onlyMorePopular=true 提示建议根据权重排序,如果为false则按字母顺序排序

spellcheck.maxResultsForSuggest=

spellcheck.extendedResults=true 显示扩展结果->显示所有原始词在索引中的词频:<int name="freq">2</int>

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航