您的位置:首页 > 其它

如何在SOLR中嵌入自己的分词系统

2010-03-19 23:23 260 查看
SOLR虽然为我们提供了分词的接入方法,但很显然并不奏效,搜遍了大江南北,也没有什么可参考的,大部分都是使用的IK或庖丁之类的分词~~,难 不成就这样永远活在别人的阴影中??答案是"NO!",如果是这样那就意味着屏蔽词管理,词典实时更新,实时持久化等多个个性化的产品需求得以在这些分词 系统上半路杀入,老鸟应该都明白这种做法的成本是太高了。 SOLR推荐但失败的分词接入方法是在schema.xml字段配置文件中写入以下配置:

编写自己的TokenFactory ,该类继成自 SOLR的BaseTokenizerFactory ,找到以下配置节点,并将 tokenizer的 class类 :替换掉。

view plaincopy to clipboardprint?<tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
<analyzer type="index">

<!-- <tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
->

<tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.apache.lucene.analysis.cn.SolrTokenFactory"/>

<!-- <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory14"/> -->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
把IK的源代码翻了个底朝天,也没看出自己的分词器和它的接入方法有什么区别,不是QUERY分词失败,就是写入的索引没有分词效果,一气之下就钻到SOLR的源代码里,经过一番苦战,终于苦尽甘来~,彻底搞定!

废话少说,来解决方案:

说明:因为修改了SOLR的部分代码,所以分词器在SCHEMA.xml的配置是彻底失效了,但是其他字段设定都沿用schema.xml.

在这之前先说下SOLR加载schema.xml的步骤:

调用栈:

org.apache.solr.core.SolrCore 520行

schema = new IndexSchema(config, IndexSchema.DEFAULT_SCHEMA_FILE, null);

org.apache.solr.schema.IndexSchema 103行

readSchema(lis);

SOLR主要是通过 private void readSchema(InputStream is)  这个函数对分词解析器进行初始化,及对schema.xml中的各种类型进行实例化,同时写入到 :protected final HashMap<String, Analyzer> analyzers ,供外部系统调用。

   这次我们开到的函数就是 org.apache.solr.schema.IndexSchema  的 readSchema()函数。

因为我们要将自己的分词解析器半路插进去,因此在这个函数的这个位置插入以下语句:

view plaincopy to clipboardprint?try{
AbstractPluginLoader<FieldType> fieldLoader = new AbstractPluginLoader<FieldType>(
"[schema.xml] fieldType", true, true) {

@Override
protected FieldType create(ResourceLoader loader, String name,
String className, Node node) throws Exception {
FieldType ft = (FieldType) loader.newInstance(className);
ft.setTypeName(name);

String expression = "./analyzer[@type='query']";
Node anode = (Node) xpath.evaluate(expression, node,
XPathConstants.NODE);
Analyzer queryAnalyzer = readAnalyzer(anode);

// An analyzer without a type specified, or with
// type="index"
expression = "./analyzer[@type='index']";
anode = (Node) xpath.evaluate(expression, node,
XPathConstants.NODE);
Analyzer analyzer = readAnalyzer(anode);

if (queryAnalyzer == null)
queryAnalyzer = analyzer;
if (analyzer == null)
analyzer = queryAnalyzer;
if (analyzer != null) {
if(ft!=null && className.equals("solr.TextField")){
ft.setAnalyzer(AnalyzerManager.getAnalyzer());
ft.setQueryAnalyzer(AnalyzerManager.getAnalyzer());
}else{
ft.setAnalyzer(analyzer);
ft.setQueryAnalyzer(analyzer);
}

}
return ft;
}
try{ 				AbstractPluginLoader<FieldType> fieldLoader = new AbstractPluginLoader<FieldType>( 						"[schema.xml] fieldType", true, true) {  					@Override 					protected FieldType create(ResourceLoader loader, String name, 							String className, Node node) throws Exception { 						FieldType ft = (FieldType) loader.newInstance(className); 						ft.setTypeName(name);  						String expression = "./analyzer[@type='query']"; 						Node anode = (Node) xpath.evaluate(expression, node, 								XPathConstants.NODE); 						Analyzer queryAnalyzer = readAnalyzer(anode);  						// An analyzer without a type specified, or with 						// type="index" 						expression = "./analyzer[@type='index']"; 						anode = (Node) xpath.evaluate(expression, node, 								XPathConstants.NODE); 						Analyzer analyzer = readAnalyzer(anode);  						if (queryAnalyzer == null) 							queryAnalyzer = analyzer; 						if (analyzer == null) 							analyzer = queryAnalyzer; 						if (analyzer != null) { 							if(ft!=null && className.equals("solr.TextField")){ 								ft.setAnalyzer(AnalyzerManager.getAnalyzer()); 								ft.setQueryAnalyzer(AnalyzerManager.getAnalyzer()); 							}else{ 								ft.setAnalyzer(analyzer); 								ft.setQueryAnalyzer(analyzer); 							} 								 						} 						return ft; 					}
在 protected FieldType create(ResourceLoader loader, String name,
String className, Node node) throws Exception {

这个函数体中,判断 className 的类名,因为我们需要对solr.TextField类型做重写,即改写text类型的分词器,所以需要加入以下判断:

view plaincopy to clipboardprint?if (analyzer != null) {
if(ft!=null && className.equals("solr.TextField")){
ft.setAnalyzer(AnalyzerManager.getAnalyzer());
ft.setQueryAnalyzer(AnalyzerManager.getAnalyzer());
}else{
ft.setAnalyzer(analyzer);
ft.setQueryAnalyzer(analyzer);
}

}
if (analyzer != null) { if(ft!=null && className.equals("solr.TextField")){ ft.setAnalyzer(AnalyzerManager.getAnalyzer()); ft.setQueryAnalyzer(AnalyzerManager.getAnalyzer()); }else{ ft.setAnalyzer(analyzer); ft.setQueryAnalyzer(analyzer); } } OK,重启SOLR,试试看,是不是奏效了??
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: