Lucene排序 Payload的应用
2011-10-19 13:29
375 查看
有关Lucene的Payload的相关内容,可以参考如下链接,介绍的非常详细,值得参考:
http://www.ibm.com/developerworks/cn/opensource/os-cn-lucene-pl/
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
例如,有这样的一个需求:
现在有两篇文档内容非常相似,如下所示:
现在我想要查询食物(foods),而且是查询关键词是egg,如何能够区别出上面两个文档哪一个更是我想要的?
可以看到上面两篇文档,文档1中描述的各项都是食物,而文档2中的book不是食物,基于上述需求,应该是文档1比文档2更相关,在查询结果中,文档1排名应该更靠前。通过上面
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/中给出的方法,可以在文档中,对给定词出现在文档的出现的权重信息(egg在文档1与文档中,以foods来衡量,文档1更相关),可以在索引之前处理一下,为egg增加payload信息,例如:
然后再进行索引,通过Lucene提供的PayloadTermQuery就能够分辨出上述egg这个Term的不同。在Lucene中,实际上是将我们存储的Payload数据,如上述"|"分隔后面的数字,乘到了tf上,然后在进行权重的计算。
下面,我们再看一下,增加一个Field来存储Payload数据,而源文档不需要进行修改,或者,我们可以在索引之前对文档进行一个处理,例如分类,通过分类可以给不同的文档所属类别的不同程度,计算一个Payload数值。
为了能够使用存储的Payload数据信息,结合上面提出的实例,我们需要按照如下步骤去做:
第一,待索引数据处理
例如,增加category这个Field存储类别信息,content这个Field存储上面的内容:
第二,实现解析Payload数据的Analyzer
由于Payload信息存储在category这个Field中,多个类别之间使用空格分隔,每个类别内容是以"|"分隔的,所以我们的Analyzer就要能够解析它。Lucene提供了DelimitedPayloadTokenFilter,能够处理具有分隔符的情况。我们的实现如下所示:
第三, 实现Similarity计算得分
Lucene中Similarity类中提供了scorePayload方法,用于计算Payload值来对文档贡献得分,我们重写了该方法,实现如下所示:
通过使用PayloadHelper这个工具类可以获取到Payload值,然后在计算文档得分的时候起到作用。
第四,创建索引
在创建索引的时候,需要使用到我们上面实现的Analyzer和Similarity,代码如下所示:
第五,查询
查询的时候,我们可以构造PayloadTermQuery来进行查询。代码如下所示:
我们可以看到查询结果,两个文档的相关度排序:
通过输出计算得分的解释信息,如下所示:
0.3314532 = (MATCH) sum of:
0.18281947 = (MATCH) weight(category:foods in 0), product of:
0.70710677 = queryWeight(category:foods), product of:
0.5945349 = idf(category: foods=2)
1.1893445 = queryNorm
0.2585458 = (MATCH) fieldWeight(category:foods in 0), product of:
0.6957931 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
0.984 = scorePayload(...)
0.5945349 = idf(category: foods=2)
0.625 = fieldNorm(field=category, doc=0)
0.14863372 = (MATCH) weight(content:egg in 0), product of:
0.70710677 = queryWeight(content:egg), product of:
0.5945349 = idf(content: egg=2)
1.1893445 = queryNorm
0.21019982 = (MATCH) fieldWeight(content:egg in 0), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(...)
0.5945349 = idf(content: egg=2)
0.5 = fieldNorm(field=content, doc=0)
0.21477571 = (MATCH) sum of:
0.066142 = (MATCH) weight(category:foods in 1), product of:
0.70710677 = queryWeight(category:foods), product of:
0.5945349 = idf(category: foods=2)
1.1893445 = queryNorm
0.09353892 = (MATCH) fieldWeight(category:foods in 1), product of:
0.25173002 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
0.356 = scorePayload(...)
0.5945349 = idf(category: foods=2)
0.625 = fieldNorm(field=category, doc=1)
0.14863372 = (MATCH) weight(content:egg in 1), product of:
0.70710677 = queryWeight(content:egg), product of:
0.5945349 = idf(content: egg=2)
1.1893445 = queryNorm
0.21019982 = (MATCH) fieldWeight(content:egg in 1), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(...)
0.5945349 = idf(content: egg=2)
0.5 = fieldNorm(field=content, doc=1)
我们可以看到,除了在tf上乘了一个Payload值以外,其他的都相同,也就是说,我们预期使用的Payload为文档(ID=0)贡献了得分,排名靠前了。否则,如果不使用Payload的话,查询结果中两个文档的得分是相同的(可以模拟设置他们的Payload值相同,测试一下看看)
相关文章阅读及免费下载:
《Lucene Ranking算法分析》
《Lucene Payload 的研究与应用》
《Lucene排序 Payload的应用》
[b]《Apache Lucene3.0结果排序原理 操作 示例》[/b]
更多《Apache Lucene文档》,尽在开卷有益360 http://www.docin.com/book_360
http://www.ibm.com/developerworks/cn/opensource/os-cn-lucene-pl/
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
例如,有这样的一个需求:
现在有两篇文档内容非常相似,如下所示:
文档1:egg tomato potato bread 文档2:egg book potato bread
现在我想要查询食物(foods),而且是查询关键词是egg,如何能够区别出上面两个文档哪一个更是我想要的?
可以看到上面两篇文档,文档1中描述的各项都是食物,而文档2中的book不是食物,基于上述需求,应该是文档1比文档2更相关,在查询结果中,文档1排名应该更靠前。通过上面
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/中给出的方法,可以在文档中,对给定词出现在文档的出现的权重信息(egg在文档1与文档中,以foods来衡量,文档1更相关),可以在索引之前处理一下,为egg增加payload信息,例如:
文档1:egg|0.984 tomato potato bread 文档2:egg|0.356 book potato bread
然后再进行索引,通过Lucene提供的PayloadTermQuery就能够分辨出上述egg这个Term的不同。在Lucene中,实际上是将我们存储的Payload数据,如上述"|"分隔后面的数字,乘到了tf上,然后在进行权重的计算。
下面,我们再看一下,增加一个Field来存储Payload数据,而源文档不需要进行修改,或者,我们可以在索引之前对文档进行一个处理,例如分类,通过分类可以给不同的文档所属类别的不同程度,计算一个Payload数值。
为了能够使用存储的Payload数据信息,结合上面提出的实例,我们需要按照如下步骤去做:
第一,待索引数据处理
例如,增加category这个Field存储类别信息,content这个Field存储上面的内容:
文档1: new Field("category", "foods|0.984 shopping|0.503", Field.Store.YES, Field.Index.ANALYZED) new Field("content", "egg tomato potato bread", Field.Store.YES, Field.Index.ANALYZED) 文档2: new Field("category", "foods|0.356 shopping|0.791", Field.Store.YES, Field.Index.ANALYZED) new Field("content", "egg book potato bread", Field.Store.YES, Field.Index.ANALYZED)
第二,实现解析Payload数据的Analyzer
由于Payload信息存储在category这个Field中,多个类别之间使用空格分隔,每个类别内容是以"|"分隔的,所以我们的Analyzer就要能够解析它。Lucene提供了DelimitedPayloadTokenFilter,能够处理具有分隔符的情况。我们的实现如下所示:
package org.shirdrn.lucene.query.payloadquery; import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter; import org.apache.lucene.analysis.payloads.PayloadEncoder; public class PayloadAnalyzer extends Analyzer { private PayloadEncoder encoder; PayloadAnalyzer(PayloadEncoder encoder) { this.encoder = encoder; } @SuppressWarnings("deprecation") public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new WhitespaceTokenizer(reader); // 用来解析空格分隔的各个类别 result = new DelimitedPayloadTokenFilter(result, '|', encoder); // 在上面分词的基础上,在进行Payload数据解析 return result; } }
第三, 实现Similarity计算得分
Lucene中Similarity类中提供了scorePayload方法,用于计算Payload值来对文档贡献得分,我们重写了该方法,实现如下所示:
package org.shirdrn.lucene.query.payloadquery; import org.apache.lucene.analysis.payloads.PayloadHelper; import org.apache.lucene.search.DefaultSimilarity; public class PayloadSimilarity extends DefaultSimilarity { private static final long serialVersionUID = 1L; @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { return PayloadHelper.decodeFloat(payload, offset); } }
通过使用PayloadHelper这个工具类可以获取到Payload值,然后在计算文档得分的时候起到作用。
第四,创建索引
在创建索引的时候,需要使用到我们上面实现的Analyzer和Similarity,代码如下所示:
package org.shirdrn.lucene.query.payloadquery; import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.payloads.FloatEncoder; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.search.Similarity; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.util.Version; public class PayloadIndexing { private IndexWriter indexWriter = null; private final Analyzer analyzer = new PayloadAnalyzer(new FloatEncoder()); // 使用PayloadAnalyzer,并指定Encoder private final Similarity similarity = new PayloadSimilarity(); // 实例化一个PayloadSimilarity private IndexWriterConfig config = null; public PayloadIndexing(String indexPath) throws CorruptIndexException, LockObtainFailedException, IOException { File indexFile = new File(indexPath); config = new IndexWriterConfig(Version.LUCENE_31, analyzer); config.setOpenMode(OpenMode.CREATE).setSimilarity(similarity); // 设置计算得分的Similarity indexWriter = new IndexWriter(FSDirectory.open(indexFile), config); } public void index() throws CorruptIndexException, IOException { Document doc1 = new Document(); doc1.add(new Field("category", "foods|0.984 shopping|0.503", Field.Store.YES, Field.Index.ANALYZED)); doc1.add(new Field("content", "egg tomato potato bread", Field.Store.YES, Field.Index.ANALYZED)); indexWriter.addDocument(doc1); Document doc2 = new Document(); doc2.add(new Field("category", "foods|0.356 shopping|0.791", Field.Store.YES, Field.Index.ANALYZED)); doc2.add(new Field("content", "egg book potato bread", Field.Store.YES, Field.Index.ANALYZED)); indexWriter.addDocument(doc2); indexWriter.close(); } public static void main(String[] args) throws CorruptIndexException, IOException { new PayloadIndexing("E:\\index").index(); } }
第五,查询
查询的时候,我们可以构造PayloadTermQuery来进行查询。代码如下所示:
package org.shirdrn.lucene.query.payloadquery; import java.io.File; import java.io.IOException; import org.apache.lucene.document.Document; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Explanation; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.search.BooleanClause.Occur; import org.apache.lucene.search.payloads.AveragePayloadFunction; import org.apache.lucene.search.payloads.PayloadTermQuery; import org.apache.lucene.store.NIOFSDirectory; public class PayloadSearching { private IndexReader indexReader; private IndexSearcher searcher; public PayloadSearching(String indexPath) throws CorruptIndexException, IOException { indexReader = IndexReader.open(NIOFSDirectory.open(new File(indexPath)), true); searcher = new IndexSearcher(indexReader); searcher.setSimilarity(new PayloadSimilarity()); // 设置自定义的PayloadSimilarity } public ScoreDoc[] search(String qsr) throws ParseException, IOException { int hitsPerPage = 10; BooleanQuery bq = new BooleanQuery(); for(String q : qsr.split(" ")) { bq.add(createPayloadTermQuery(q), Occur.MUST); } TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, true); searcher.search(bq, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc; // 文档编号 Explanation explanation = searcher.explain(bq, docId); System.out.println(explanation.toString()); } return hits; } public void display(ScoreDoc[] hits, int start, int end) throws CorruptIndexException, IOException { end = Math.min(hits.length, end); for (int i = start; i < end; i++) { Document doc = searcher.doc(hits[i].doc); int docId = hits[i].doc; // 文档编号 float score = hits[i].score; // 文档得分 System.out.println(docId + "\t" + score + "\t" + doc + "\t"); } } public void close() throws IOException { searcher.close(); indexReader.close(); } private PayloadTermQuery createPayloadTermQuery(String item) { PayloadTermQuery ptq = null; if(item.indexOf("^")!=-1) { String[] a = item.split("\\^"); String field = a[0].split(":")[0]; String token = a[0].split(":")[1]; ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction()); ptq.setBoost(Float.parseFloat(a[1].trim())); } else { String field = item.split(":")[0]; String token = item.split(":")[1]; ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction()); } return ptq; } public static void main(String[] args) throws ParseException, IOException { int start = 0, end = 10; // String queries = "category:foods^123.0 content:bread^987.0"; String queries = "category:foods content:egg"; PayloadSearching payloadSearcher = new PayloadSearching("E:\\index"); payloadSearcher.display(payloadSearcher.search(queries), start, end); payloadSearcher.close(); } }
我们可以看到查询结果,两个文档的相关度排序:
0 0.3314532 Document<stored,indexed,tokenized<category:foods|0.984 shopping|0.503> stored,indexed,tokenized<content:egg tomato potato bread>> 1 0.21477573 Document<stored,indexed,tokenized<category:foods|0.356 shopping|0.791> stored,indexed,tokenized<content:egg book potato bread>>
通过输出计算得分的解释信息,如下所示:
0.3314532 = (MATCH) sum of:
0.18281947 = (MATCH) weight(category:foods in 0), product of:
0.70710677 = queryWeight(category:foods), product of:
0.5945349 = idf(category: foods=2)
1.1893445 = queryNorm
0.2585458 = (MATCH) fieldWeight(category:foods in 0), product of:
0.6957931 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
0.984 = scorePayload(...)
0.5945349 = idf(category: foods=2)
0.625 = fieldNorm(field=category, doc=0)
0.14863372 = (MATCH) weight(content:egg in 0), product of:
0.70710677 = queryWeight(content:egg), product of:
0.5945349 = idf(content: egg=2)
1.1893445 = queryNorm
0.21019982 = (MATCH) fieldWeight(content:egg in 0), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(...)
0.5945349 = idf(content: egg=2)
0.5 = fieldNorm(field=content, doc=0)
0.21477571 = (MATCH) sum of:
0.066142 = (MATCH) weight(category:foods in 1), product of:
0.70710677 = queryWeight(category:foods), product of:
0.5945349 = idf(category: foods=2)
1.1893445 = queryNorm
0.09353892 = (MATCH) fieldWeight(category:foods in 1), product of:
0.25173002 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
0.356 = scorePayload(...)
0.5945349 = idf(category: foods=2)
0.625 = fieldNorm(field=category, doc=1)
0.14863372 = (MATCH) weight(content:egg in 1), product of:
0.70710677 = queryWeight(content:egg), product of:
0.5945349 = idf(content: egg=2)
1.1893445 = queryNorm
0.21019982 = (MATCH) fieldWeight(content:egg in 1), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(...)
0.5945349 = idf(content: egg=2)
0.5 = fieldNorm(field=content, doc=1)
我们可以看到,除了在tf上乘了一个Payload值以外,其他的都相同,也就是说,我们预期使用的Payload为文档(ID=0)贡献了得分,排名靠前了。否则,如果不使用Payload的话,查询结果中两个文档的得分是相同的(可以模拟设置他们的Payload值相同,测试一下看看)
相关文章阅读及免费下载:
《Lucene Ranking算法分析》
《Lucene Payload 的研究与应用》
《Lucene排序 Payload的应用》
[b]《Apache Lucene3.0结果排序原理 操作 示例》[/b]
更多《Apache Lucene文档》,尽在开卷有益360 http://www.docin.com/book_360
相关文章推荐
- lucene按时间排序 我在项目中的应用(三)
- Lucene增强功能:Payload的应用
- lucene.net 高级应用之排序、设置权重、优化、分布式搜索
- Lucene Payload 的研究与应用
- lucene.net 高级应用之排序、设置权重、优化、分布式搜索
- Lucene增强功能:Payload的应用
- Lucene Payload 的研究与应用
- Lucene Payload 的研究与应用
- Lucene Payload 的研究与应用
- Lucene Payload 的研究与应用
- 应用:字符串排序
- lucene在语料库建设中的应用
- java基础知识回顾之---java String final类普通方法的应用之字符串数组排序
- 数据结构学习---线性表顺序存储结构的应用(三):融合排序操作
- 移动应用界面设计模式-搜索 排序 筛选
- 函数指针的应用比较排序与冒泡排序指针完成
- 基于Lucene的Solr服务搜索引擎应用(散乱)
- [ lucene高级 ] 研讨如何进行Lucene的分布式应用 [转]
- WordPress常用插件推荐:分类排序插件My Category Order的应用
- 搜索系统:全文检索(lucene、排序、多域搜索、高亮、分页、监听器)