您的位置:首页 > 其它

Lucene中自动补全Suggest模块的索引追加和更新的解决方案

2015-08-14 10:54 686 查看
我用的版本是Lucene-Suggest-4.7.jar

在做类似百度搜索中自动补全模块的时候遇到的问题——索引追加建立,索引更新权重。本问主要解决这两个问题。大家可能在网络上已经搜索到了Lucene的Suggest包的基本用法,这里再简单的说一下:
使用suggest包建立索引时和用lucene的IndexWriter建立索引有很大的不同,这里建立索引时,大概需要三个类:实体类,实体类的迭代器类,具体操作的类。实体类不在多说,代码如下:

public class Suggester implements Serializable {
private static final long serialVersionUID = 1L;
String term;
int times;
/**
* @param term  词条
* @param times  词频
*/
public Suggester(String term, int times) {
this.term = term;
this.times = times;
}
public Suggester() {
super();
}
/**
* @return the term
*/
public String getTerm() {
return term;
}
/**
* @param term the term to set
*/
public void setTerm(String term) {
this.term = term;
}
/**
* @return the times
*/
public int getTimes() {
return times;
}
/**
* @param times the times to set
*/
public void setTimes(int times) {
this.times = times;
}
/* (non-Javadoc)
* @see java.lang.Object#toString()
*/
@Override
public String toString() {
return term + " " + times;
}
/* (non-Javadoc)
* @see java.lang.Object#hashCode()
*/
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((term == null) ? 0 : term.hashCode());
return result;
}
/*
* 只对比term
* @see java.lang.Object#equals(java.lang.Object)
*/
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Suggester other = (Suggester) obj;
if (term == null) {
if (other.term != null)
return false;
} else if (!term.equals(other.term))
return false;
return true;
}
}
具体操作的类也是调方法就OK,实体类的迭代器类我们看下源代码就明白为什么需要这个了:



这个是AnalyzingInfixSuggester类中建立索引的方法,其参数要求是传入一个InputIterator 对象,即实体类的迭代器类,下面看下实体类的迭代器类:

public class SuggesterIterator implements InputIterator {
/**集合的迭代器 */
private final Iterator<Suggester> suggesterIterator;
/**遍历的当前的Suggerter  */
private Suggester currentSuggester;
/**
* 构造方法
* @param suggesterIterator
*/
public SuggesterIterator(Iterator<Suggester> suggesterIterator) {
this.suggesterIterator = suggesterIterator;
}
/*
* 迭代下一个
* @see org.apache.lucene.util.BytesRefIterator#next()
*/
@Override
public BytesRef next() throws IOException {
if (suggesterIterator.hasNext()) {
currentSuggester = suggesterIterator.next();
String term = currentSuggester.getTerm();
try {
return new BytesRef(term.getBytes("UTF8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
//如果出错或者遍历完返回空
return null;
}
/*
* 是否有payload数据信息
* @see org.apache.lucene.search.suggest.InputIterator#hasPayloads()
*/
@Override
public boolean hasPayloads() {
return true;
}
/*
*  payload数据,存其他后期需要取出的各种数据,这里存词频
* @see org.apache.lucene.search.suggest.InputIterator#payload()
*/
@Override
public BytesRef payload() {
/**如hasPayloads retrun false 以下代码无用    */
try {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
dos.writeInt(currentSuggester.getTimes());
dos.close();
return new BytesRef(bos.toByteArray());
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
/*
* 自定义的排序规则
* @see org.apache.lucene.search.suggest.InputIterator#weight()
*/
@Override
public long weight() {
//当前权重为词频
return currentSuggester.getTimes();
}
/*
* @see org.apache.lucene.util.BytesRefIterator#getComparator()
*/
@Override
public Comparator<BytesRef> getComparator() {
return null;
}
}

在准备好之后便可以调用suggest包中的build方法建立索引了:
/**
* 创建索引
* @param list 待建立索引的数据集
* @return 创建时间
*/
public double create(List<Suggester> list, String indexPath) {
//耗时
long time = 0l;
//索引创建管理工具
AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
try {
AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, new File(indexPath), analyzer);
loger.debug("开始创建自动补全索引");
Long begin = System.currentTimeMillis();
//build索引
AnalyzingInfixSuggester.build(new SuggesterIterator(list.iterator()));
Long end = System.currentTimeMillis();
time = end - begin;
loger.debug("创建自动补全索引完成!耗时: " + time + "ms");
} catch (IOException e) {
e.printStackTrace();
} finally {
//关闭
AnalyzingInfixSuggester.close();
}
return time / 1000.0;
}

 测试的主要代码:
List<Suggester> list = new ArrayList<Suggester>();
list.add(new Suggester("张三", 1));
list.add(new Suggester("李四", 2));
double time = suggestService.create(list, "file/autoComplete/project/template/index");
System.out.println(time + " ms");

当执行完上面的代码 在索引中便建立好了 张三和李四,两条Document的索引,我们可以看到建立好的索引结构如下:



看到和IndexWriter的区别了吧,注意,我们上面建立索引是使用的空格分词器。具体索引文件的结构有兴趣就自己再研究吧。

查询部分
不多说,直接看代码自己研究去吧
/**
* 自动补全查询索引
* @param region  查询条件
* @param indexPath 索引位置
* @return  查询结果集
*/
public List<Suggester> lookup(String region, String indexPath) {
//返回的结果集
List<Suggester> reList = new ArrayList<Suggester>();
//索引文件
File indexFile = new File(indexPath);
//索引创建管理工具
AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
// 查询结果集
List<LookupResult> results = null;
try {
AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, indexFile, analyzer);
/*
*   查询结果
d7cd

*     region- 查询的关键词
*     TOPS- 返回的最多数量
*     allTermsRequired - should或者must关系
*     doHighlight - 高亮
*/
results = AnalyzingInfixSuggester.lookup(region, TOPS, true, true);
} catch (IOException e) {
e.printStackTrace();
} finally {
AnalyzingInfixSuggester.close();
}
/*
* 遍历结果
*/
System.out.println("输入词:" + region);
for (LookupResult result : results) {
String str = (String) result.highlightKey;
Integer time = null;
try {
//获取payload部分词频信息 —— 词频
BytesRef bytesRef = result.payload;
DataInputStream dis = new DataInputStream(new ByteArrayInputStream(bytesRef.bytes));
time = dis.readInt();
dis.close();
} catch (Exception e) {
e.printStackTrace();
}
reList.add(new Suggester(str, time));
}
/*
* 剔除搜索关键词自身
*/
for (int i = 0; i < reList.size(); i++) {
Suggester sug = reList.get(i);
//剔除高亮标签后进行比较
if (sug.getTerm().replaceAll("<[^>]*>", "").equals(region)) {
reList.remove(sug);
break;
}
}
return reList;
}

在你都建立好了索引,查询也成功之后,那么问题来了:如果我想在索引中追加新的索引怎么办?如果我想修改(update)索引怎么办?然而在你不断的查找过后发现,Suggest并没有提供相关方法……那么接下来着重介绍这两个问题的解决方法。

翻看源码,可以发现在build索引时也是使用IndexWriter,并且有个getIndexWriterConfig方法



在getIndexWriterConfig方法中可以看到,索引文件的打开模式是OpenMode.CREATE固定的,所以索引的建立方法只能是新建,不能是追加。



解决方法就是继承AnalyzingInfixSuggester,重写getIndexWriterConfig方法,制作我们自己的AnalyzingInfixSuggester。
代码如下:

public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {

/**索引创建方式(新建或追加)*/
private final OpenMode mode;

......

/*
* 重载 构造方法 初始化相关变量
* @param matchVersion  Lucene版本
* @param indexPath 索引文件目录
* @param analyzer 分词器
* @param mode 索引创建方式(新建或追加)
* @throws IOException
*/
public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
//调用父类构造方法
super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
this.mode = mode;
.....
}

/*
* 重写获得IndexWriterConfig的方法
* 增加索引创建方式可变(新建或追加)
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
*/
@Override
protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
iwc.setCodec(new Lucene46Codec());
if (indexAnalyzer instanceof AnalyzerWrapper) {
//如果是tmp目录,采用新建方式打开索引文件
iwc.setOpenMode(OpenMode.CREATE);
} else {
iwc.setOpenMode(mode);
}
return iwc;
}
......
}

这样,就可以在new
MyAnalyzingInfixSuggester 的时候传入我们指定的索引打开模式,便可实现追加建立索引。但是,如果你只这样写就想追加索引是不可以的,因为在Suggest内部有他自己的排序算法,就是在建立索引时候便根据权重weight进行排序,在查询时候只返回一个文档号,比如在索引中已经有了张三、李四,你再APPEND一个王五进去,在搜索“王”的时候结果会给你显示李四。是不是很郁闷?解决办法就是取消Suggest的建立时就排序的步骤,增加在搜索时排序:
一下是源码中在建立索引时的排序方法,再MyAnalyzingInfixSuggester中重写build方法
删除掉一下代码即可。



重写lookup方法 删除掉下面代码并增加排序方法:(源码中的注释也有解释)



经过上面的处理便万事大吉。可以完美解决APPEND索引的问题。
第二个问题,更新索引就简单了,只需要调用IndexWriter的delete方法删除对应的Document之后再把需要更新的对象包装成list传入create进行build即可!
Directory fsDir = FSDirectory.open(new File(indexPath));
IndexWriter indexWriter = new IndexWriter(fsDir, new IndexWriterConfig(ManageIndexService.LUCENE_VERSION, analyzer));
//删除对应的词条
indexWriter.deleteDocuments(new Term(MyAnalyzingInfixSuggester.TEXT_FIELD_NAME, sug.getTerm()));
//彻底删除
indexWriter.forceMergeDeletes();
//关闭IndexWriter
indexWriter.commit();
indexWriter.close();
loger.debug("删除旧索引成功:" + sug.getTerm());

List<Suggester> list = new ArrayList<Suggester>();
list.add(sug);
//添加建立新的词条索引
this.create(list, indexPath, OpenMode.APPEND);

下面附上完整的MyAnalyzingInfixSuggester代码。

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.codecs.lucene46.Lucene46Codec;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.MultiDocValues;
import org.apache.lucene.index.SlowCompositeReaderWrapper;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.Version;

public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {
/** 日志 **/
private final Logger logger = Logger.getLogger(MyAnalyzingInfixSuggester.class);

/** Field name used for the indexed text. */
public static final String TEXT_FIELD_NAME = "text";

/** Default minimum number of leading characters before
*  PrefixQuery is used (4). */
public static final int DEFAULT_MIN_PREFIX_CHARS = 4;
private final File indexPath;
final int minPrefixChars;
final Version matchVersion;
private final Directory dir;
/**索引创建方式(新建或追加)*/
private final OpenMode mode;

/*
* 重载 构造方法 初始化相关变量
* @param matchVersion  Lucene版本
* @param indexPath 索引文件目录
* @param analyzer 分词器
* @param mode 索引创建方式(新建或追加)
* @throws IOException
*/
public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
//调用父类构造方法
super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
this.mode = mode;
this.indexPath = indexPath;
this.minPrefixChars = DEFAULT_MIN_PREFIX_CHARS;
this.matchVersion = matchVersion;
dir = getDirectory(indexPath);
}

/*
* 重写获得IndexWriterConfig的方法
* 增加索引创建方式可变(新建或追加)
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
*/
@Override
protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
iwc.setCodec(new Lucene46Codec());
if (indexAnalyzer instanceof AnalyzerWrapper) {
//如果是tmp目录,采用新建方式打开索引文件
iwc.setOpenMode(OpenMode.CREATE);
} else {
iwc.setOpenMode(mode);
}
return iwc;
}

/*
* 重写查询方法,取消在建立索引时候进行排序
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#build(org.apache.lucene.search.suggest.InputIterator)
*/
@Override
public void build(InputIterator iter) throws IOException {
if (searcher != null) {
searcher.getIndexReader().close();
searcher = null;
}
Directory dirTmp = getDirectory(new File(indexPath.toString() + ".tmp"));
IndexWriter w = null;
IndexWriter w2 = null;
AtomicReader r = null;
boolean success = false;
try {
Analyzer gramAnalyzer = new AnalyzerWrapper(Analyzer.PER_FIELD_REUSE_STRATEGY) {
@Override
protected Analyzer getWrappedAnalyzer(String fieldName) {
return indexAnalyzer;
}

@Override
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
if (fieldName.equals("textgrams") && minPrefixChars > 0) {
return new TokenStreamComponents(components.getTokenizer(), new EdgeNGramTokenFilter(matchVersion, components.getTokenStream(), 1, minPrefixChars));
} else {
return components;
}
}
};
w = new IndexWriter(dirTmp, getIndexWriterConfig(matchVersion, gramAnalyzer));
BytesRef text;
Document doc = new Document();
FieldType ft = getTextFieldType();
Field textField = new Field(TEXT_FIELD_NAME, "", ft);
doc.add(textField);

Field textGramField = new Field("textgrams", "", ft);
doc.add(textGramField);

Field textDVField = new BinaryDocValuesField(TEXT_FIELD_NAME, new BytesRef());
doc.add(textDVField);

Field weightField = new NumericDocValuesField("weight", 0);
doc.add(weightField);

Field payloadField;
if (iter.hasPayloads()) {
payloadField = new BinaryDocValuesField("payloads", new BytesRef());
doc.add(payloadField);
} else {
payloadField = null;
}
long t0 = System.nanoTime();
while ((text = iter.next()) != null) {
String textString = text.utf8ToString();
textField.setStringValue(textString);
textGramField.setStringValue(textString);
textDVField.setBytesValue(text);
weightField.setLongValue(iter.weight());
if (iter.hasPayloads()) {
payloadField.setBytesValue(iter.payload());
}
w.addDocument(doc);
}
logger.debug("initial indexing time: " + ((System.nanoTime() - t0) / 1000000) + " msec");

r = SlowCompositeReaderWrapper.wrap(DirectoryReader.open(w, false));
w.rollback();

w2 = new IndexWriter(dir, getIndexWriterConfig(matchVersion, indexAnalyzer));
w2.addIndexes(new IndexReader[] { r });
r.close();

searcher = new IndexSearcher(DirectoryReader.open(w2, false));
w2.close();

payloadsDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), "payloads");
weightsDV = MultiDocValues.getNumericValues(searcher.getIndexReader(), "weight");
textDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), TEXT_FIELD_NAME);
assert textDV != null;
success = true;
} finally {
if (success) {
IOUtils.close(w, w2, r, dirTmp);
} else {
IOUtils.closeWhileHandlingException(w, w2, r, dirTmp);
}
}
}

/*
* 重写查询方法,改变结果排序的方法
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#lookup(java.lang.CharSequence, int, boolean, boolean)
*/
@Override
public List<LookupResult> lookup(CharSequence key, int num, boolean allTermsRequired, boolean doHighlight) {

if (searcher == null) {
throw new IllegalStateException("suggester was not built");
}

final BooleanClause.Occur occur;
if (allTermsRequired) {
occur = BooleanClause.Occur.MUST;
} else {
occur = BooleanClause.Occur.SHOULD;
}

TokenStream ts = null;
try {
ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()));
ts.reset();
final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
String lastToken = null;
BooleanQuery query = new BooleanQuery();
int maxEndOffset = -1;
final Set<String> matchedTokens = new HashSet<String>();
while (ts.incrementToken()) {
if (lastToken != null) {
matchedTokens.add(lastToken);
query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
}
lastToken = termAtt.toString();
if (lastToken != null) {
maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
}
}
ts.end();

String prefixToken = null;
if (lastToken != null) {
Query lastQuery;
if (maxEndOffset == offsetAtt.endOffset()) {
// Use PrefixQuery (or the ngram equivalent) when
// there was no trailing discarded chars in the
// string (e.g. whitespace), so that if query does
// not end with a space we show prefix matches for
// that token:
lastQuery = getLastTokenQuery(lastToken);
prefixToken = lastToken;
} else {
// Use TermQuery for an exact match if there were
// trailing discarded chars (e.g. whitespace), so
// that if query ends with a space we only show
// exact matches for that term:
matchedTokens.add(lastToken);
lastQuery = new TermQuery(new Term(TEXT_FIELD_NAME, lastToken));
}
if (lastQuery != null) {
query.add(lastQuery, occur);
}
}
ts.close();

Query finalQuery = finishQuery(query, allTermsRequired);

//新建排序方法
Sort sort = new Sort(new SortField("weight", SortField.Type.LONG, true));
TopDocs hits = searcher.search(finalQuery, num, sort);

List<LookupResult> results = createResults(hits, num, key, doHighlight, matchedTokens, prefixToken);
return results;
} catch (IOException ioe) {
throw new RuntimeException(ioe);
} finally {
IOUtils.closeWhileHandlingException(ts);
}
}

}

下一篇 有时间可能会写写跨度查询、近同义词什么的,我也写了个完成的Demo,毕竟这个网上一搜一大把 就不着急了。针对以上文章和lucene相关的有什么问题可以+692790242 来一起讨论讨论。欢迎

版权所有, 转载请注明出处!By MRC
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  lucene suggest 追加 更新