您的位置:首页 > 其它

lucene(四) lucene分词器

2016-05-18 09:47 190 查看
一、分词流程



Reader:将字符串转换为读入的流

Tokenier:主要负责接收字符流Reader,将Reader进行分词操作

                    Tokenier的一些实现类:

                      


TokenFilter:将语汇单元进行各式各样的过滤

TokenFilter的一些实现类:

                       


TokenStream:分词器做好处理后得到的一个流,这个流中存储了分词的各种信息,可以通过TokenStream有效地获取到分词单元信息



二、例子

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class AnalyzerUtils {

public static void displayAllTokenInfo(String str,Analyzer a) {
try {
TokenStream stream = a.tokenStream("content",new StringReader(str));
//位置增量的属性,存储语汇单元之间的距离
PositionIncrementAttribute pia =
stream.addAttribute(PositionIncrementAttribute.class);
//每个语汇单元的位置偏移量
OffsetAttribute oa =
stream.addAttribute(OffsetAttribute.class);
//存储每一个语汇单元的信息(分词单元信息)
CharTermAttribute cta =
stream.addAttribute(CharTermAttribute.class);
//使用的分词器的类型信息
TypeAttribute ta =
stream.addAttribute(TypeAttribute.class);
for(;stream.incrementToken();) {
System.out.print(pia.getPositionIncrement()+":");
System.out.print(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]-->"+ta.type()+"\n");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}


测试类:

package org.lucene.test;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import org.lucene.util.AnalyzerUtils;
import org.lucene.util.MySameAnalyzer;
import org.lucene.util.MyStopAnalyzer;
import org.lucene.util.SimpleSamewordContext2;

import com.chenlb.mmseg4j.analysis.MMSegAnalyzer;

public class TestAnalyzer {

@Test
public void test() {
//英文分词器
Analyzer a1 = new StandardAnalyzer(Version.LUCENE_35);
Analyzer a2 = new StopAnalyzer(Version.LUCENE_35);
Analyzer a3 = new SimpleAnalyzer(Version.LUCENE_35);
Analyzer a4 = new WhitespaceAnalyzer(Version.LUCENE_35);
String txt = "how are you thank you";

//使用MMSeg中文分词器,参数是字典所在路径
Analyzer a5 = new MMSegAnalyzer(new File("D:\\tools\\javaTools\\lucene\\mmseg4j-1.8.5\\data"));
String txt1 = "我来自中国云南昭通昭阳区师专";

AnalyzerUtils.displayAllTokenInfo(txt, a1);
System.out.println("------------------------------");
AnalyzerUtils.displayAllTokenInfo(txt, a2);
System.out.println("------------------------------");
AnalyzerUtils.displayAllTokenInfo(txt, a3);
System.out.println("------------------------------");
AnalyzerUtils.displayAllTokenInfo(txt, a4);

System.out.println("------------------------------");
AnalyzerUtils.displayAllTokenInfo(txt1, a5);

}
}

中文分词MMSeg在lucene中的应用:

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: