一个自定义的用语过滤非字符的Lucene分析器
2016-02-27 11:26
253 查看
<strong><span style="font-size:18px;">/*** * @author YangXin * @info 一个定义的用语过滤非字字符的Lucene分析器 */ package unitNine; import org.apache.lucene.analysis.Analyzer; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.lucene.analysis.LowerCaseFilter; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.standard.StandardFilter; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.util.Version; public class MyAnalyzer extends Analyzer{ private final Pattern alphabets = Pattern.compile("[a-z]+"); @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(Version.LUCENE_CURRENT, reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); result = new StopFilter(true, result, StandardAnalyzer.STOP_WORDS_SET); TermAttribute termAtt = (TermAttribute) result.addAttribute(TermAttribute.class); StringBuilder buf = new StringBuilder(); try { while (result.incrementToken()) { if (termAtt.termLength() < 3) continue; String word = new String(termAtt.termBuffer(), 0, termAtt.termLength()); Matcher m = alphabets.matcher(word); if (m.matches()) { buf.append(word).append(" "); } } } catch (IOException e) { e.printStackTrace(); } return new WhitespaceTokenizer(new StringReader(buf.toString())); } } </span></strong>
相关文章推荐
- java web 程序导入java EE 6 library
- Android关于buildToolVersion与CompileSdkVersion的区别
- KPCR
- 浙大PAT1053
- ((a+b) + abs(a-b))/2 的用法
- Publish module contexts to separate XML files
- INSERT……SELECT批量插入数据
- static const inline #define enum
- 使用canopy生成和k-means聚类对新闻进行聚类
- 实迷途其未远,觉今是而昨非——问卷调查
- Python学习笔记(2)
- XCode 配置 sdl2
- TableView 确认选中了哪一行
- The identity used to sign the executable is no longer valid.Please verify that your device’s clock i
- list泛型,去除对象中某一个字段值重复
- 1.struts2开发流程
- Android EditText中的inputType详解
- Noj Red packet 1651 (二分)
- CM添加kafka服务
- srebmuNfaeLottooRmuS.129