您的位置:首页 > 其它

lucene入门篇

2012-08-14 11:45 375 查看
如果你想快速查询你磁盘上文件,或查询邮件、Web页面,甚至查询存于数据库的数据,你都可以借助于Lucene来完成。

最新jar可到此下载:http://mirror.bit.edu.cn/apache/lucene/java/

下图先宏观的表示了搜索应用程序和 Lucene 之间的关系,也反映了利用 Lucene 构建搜索应用程序的流程,大家先直观的认识下:



首先从Lucene API说起:

1、 Lucene API(核心操作类)

IndexWriter创建和维护索引(向原索引中添加新Document,设置合并策略、优化等)
FSDirectory最主要用来存储索引文件的类,表示将索引文件存储到文件系统
Document索引和查询的原子单元,一个Document包含一系列Field
IndexReader一个抽象类,提供了访问索引的接口,当然访问索引也可以通过它的子类来完成
Analyzer分词类,它有一系列子类,都是用来将文本解析成TokenStream
Searcher用于查询索引的核心类
2、创建索引

Java代码

Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_29),true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();

doc.add(new Field("id", "101", Field.Store.YES, Field.Index.NO));

doc.add(new Field("name", "kobe bryant", Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);

writer.optimize();

writer.close();

Java代码



Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_29),true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();

doc.add(new Field("id", "101", Field.Store.YES, Field.Index.NO));

doc.add(new Field("name", "kobe bryant", Field.Store.YES, Field.Index.NO));

writer.addDocument(doc);

writer.optimize();

writer.close();

如上所示将索引文件存储于工作目录下lucene.blog文件夹 ,创建了Document,向Document里添加了两个Field id和name,然后使用IndexWriter的addDocument(Document)方法将其添加到索引目录下的索引文件中,然后使用IndexWriter的optimize()方法进行对索引文件优化,最后关闭IndexWriter;

3、通过IndexWriter删除索引中Document

Java代码

Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

writer.deleteDocuments(new Term("id", "101"));

writer.commit();

writer.close();

Java代码



Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

writer.deleteDocuments(new Term("id", "101"));

writer.commit();

writer.close();

如上先打开索引位置(工作目录下lucene.blog文件夹 ),然后直接调运IndexWriter的deleteDocuments(Term)方法删除上面2中创建的Document,注意必须调运commit()方法,上面2中之所以没有commit()是因为optimize()方法中存在默认Commit方法;

4、通过IndexWriter更新索引中Document

Java代码

Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();

doc.add(new Field("id", "101", Field.Store.YES, Field.Index.ANALYZED));
// Field.Index.ANALYZED

doc.add(new Field("name", "kylin soong", Field.Store.YES, Field.Index.ANALYZED));

writer.updateDocument(new Term("id", "101"), doc);

writer.commit();

writer.close();

Java代码



Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

Document doc = new Document();

doc.add(new Field("id", "101", Field.Store.YES, Field.Index.NO));

doc.add(new Field("name", "kylin soong", Field.Store.YES, Field.Index.NO));

writer.updateDocument(new Term("id", "101"), doc);

writer.commit();

writer.close();

通过IndexWriter的updateDocument(Term, Document)来完成更新,具体是将包含Term("id", "101")的Document删除,然后将传入的Document添加到索引文件;

5、Field选项意义

Java代码

Field field = new Field(

"101",

"kobe bryant",

Field.Store.YES,

Field.Index.ANALYZED,

Field.TermVector.YES);

Java代码



Field field = new Field(

"101",

"kobe bryant",

Field.Store.YES,

Field.Index.ANALYZED,

Field.TermVector.YES);

如上代码显示Field各属性设置情况,下面简单说明这些属性选项的意义

Field.Store.*决定是否将Field的完全值进行存储,注意:不能将整个文本内容存储,这样导致索引文件过大

Field.Store.YES存储,一旦存储,你可以用完整的Field的完全值作为查询条件查询(id:101)
Field.Store.NO不存储
Field.Index.*控制Field的值是否可查询通过索引成的索引文件

Field.Index.ANALYZED用Analyzer将Field的值分词成多个Token
Field.Index.NOT_ANALYZED不对Field的值分词,将Field的值作为一个Token处理
Field.Index.ANALYZED_NO_NORMS类似ANALYZED,但不存常规信息到索引文件
Field.Index.NOT_ANALYZED_NO_NORMS类似NOT_ANALYZED,但不存常规信息到索引文件
Field.Index.NO不进行索引,Field的值不可被搜索
如果你想要检索出唯一的terms在搜索时,或对搜索结果进行加亮处理等操作是Field.TermVector.*是必要的

Field.TermVector.YES记录唯一的terms,当重复发生时记下重复数,在不做额外处理
Field.TermVector.WITH_POSITIONS在上面基础上记录下位置
Field.TermVector.WITH_OFFSETS在TermVector.YES基础上记录偏移量
Field.TermVector.WITH_POSITIONS_OFFSETS在TermVector.YES基础上记录偏移量和位置
Field.TermVector.NO不做任何处理
6、索引numbers

Java代码

Document doc = new Document();

NumericField field1 = new NumericField("id");

field1.setIntValue(101);

doc.add(field1);

NumericField field2 = new NumericField("price");

field1.setDoubleValue(123.50);

doc.add(field2);

Java代码



Document doc = new Document();

NumericField field1 = new NumericField("id");

field1.setIntValue(101);

doc.add(field1);

NumericField field2 = new NumericField("price");

field1.setDoubleValue(123.50);

doc.add(field2);

如上所示为索引numbers方法;

7、索引Date和Time

Java代码

Document doc = new Document();

doc.add(new NumericField("timestamp").setLongValue(new Date().getTime()));

doc.add(new NumericField("day").setIntValue((int) (new Date().getTime()/24/3600)));

Calendar cal = Calendar.getInstance();

cal.setTime(new Date());

doc.add(new NumericField("dayOfMonth").setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

Java代码



Document doc = new Document();

doc.add(new NumericField("timestamp").setLongValue(new Date().getTime()));

doc.add(new NumericField("day").setIntValue((int) (new Date().getTime()/24/3600)));

Calendar cal = Calendar.getInstance();

cal.setTime(new Date());

doc.add(new NumericField("dayOfMonth").setIntValue(cal.get(Calendar.DAY_OF_MONTH)));

实质上对Date和Time的处理是将Date和Time转化为numbers来处理,注意:当然也可以把Date和Time以及上面的numbers当做字符串来处理,不过这样影响查询;

8、IndexWriter的其他同法

Java代码

Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.LIMITED);

writer.setMaxFieldLength(1);

MergePolicy policy = new LogByteSizeMergePolicy(writer);

writer.setMergePolicy(policy);

writer.optimize(5);

writer.close();

Java代码



Directory dir = FSDirectory.open(new File("lucene.blog"));

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_29), true, IndexWriter.MaxFieldLength.LIMITED);

writer.setMaxFieldLength(1);

MergePolicy policy = new LogByteSizeMergePolicy(writer);

writer.setMergePolicy(policy);

writer.optimize(5);

writer.close();

如上IndexWriter.MaxFieldLength.LIMITED设定了Field截取功能,如果Field值相当长,而你只想索引Field值的前固定个字符,可以用Field截取功能来实现;IndexWriter的setMergePolicy(policy),可以设定合并策略,另外optimize(int maxNumSegments)方法可以通过参数设定优化成的Segment个数;

9、根据确定的term查询

Java代码

IndexReader reader = IndexReader.open(FSDirectory.open(new File("lucene.blog")),true);

IndexSearcher searcher = new IndexSearcher(reader);

Term term = new Term("id","101");

Query query = new TermQuery(term);

TopDocs topDocs = searcher.search(query, 10);

System.out.println(topDocs.totalHits);

ScoreDoc[] docs = topDocs.scoreDocs;

System.out.println(docs[0].doc + " " + docs[0].score);

Document doc = searcher.doc(docs[0].doc);

System.out.println(doc.get("id"));

Java代码



IndexReader reader = IndexReader.open(FSDirectory.open(new File("lucene.blog")),true);

IndexSearcher searcher = new IndexSearcher(reader);

Term term = new Term("id","101");

Query query = new TermQuery(term);

TopDocs topDocs = searcher.search(query, 10);

System.out.println(topDocs.totalHits);

ScoreDoc[] docs = topDocs.scoreDocs;

System.out.println(docs[0].doc + " " + docs[0].score);

Document doc = searcher.doc(docs[0].doc);

System.out.println(doc.get("id"));

如上示例显示了一个Lucene查询的基本方法,IndexSearcher是核心的查询类,IndexReader 可以读取索引文件,IndexSearcher有一系列重载的Search()方法,可以根据传入不同参数进行不同查询处理,ScoreDoc数组保存查询结果,和相关得分;

10、根据QueryParser查询,并收集查询结果

Java代码

IndexReader reader = IndexReader.open(FSDirectory.open(new File("lucene.blog")),true);

IndexSearcher searcher = new IndexSearcher(reader);

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29);

QueryParser parser = new QueryParser(Version.LUCENE_29,"name",analyzer);

String queryString = "kobe";

Query query = parser.parse(queryString);

TopScoreDocCollector collector = TopScoreDocCollector.create(10, false);

searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

for(int i = 0 ; i < hits.length ; i ++) {

Document doc = searcher.doc(hits[i].doc);

String name = doc.get("name");

if (name != null) {

System.out.println(name);

}

}

Java代码



IndexReader reader = IndexReader.open(FSDirectory.open(new File("lucene.blog")),true);

IndexSearcher searcher = new IndexSearcher(reader);

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29);

QueryParser parser = new QueryParser(Version.LUCENE_29,"name",analyzer);

String queryString = "kobe";

Query query = parser.parse(queryString);

TopScoreDocCollector collector = TopScoreDocCollector.create(10, false);

searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

for(int i = 0 ; i < hits.length ; i ++) {

Document doc = searcher.doc(hits[i].doc);

String name = doc.get("name");

if (name != null) {

System.out.println(name);

}

}

如上为一个使用QueryParser查询关键字“kobe”的实例,另外还对查询结果进行了收集

11、使用Lucene图形化工具Luke来操作索引

Luke使用非常简单:

下载:http://code.google.com/p/luke/ 点击下载最新版本,下载完成直接点击下载的jar包,就可以进入图形化操作界面,选择索引的目录就可以对索引进行图形化操作

参考文档:

http://www.ibm.com/developerworks/cn/java/j-lo-lucene1/

http://johnoyoung.blog.sohu.com/90691910.html

http://blog.sina.com.cn/s/blog_51e1d40e0100aadc.html

http://blog.21cn.com/johnoyoung/article/51468

http://hi.baidu.com/johnoyoung/blog/item/76c5be6008c1e0da8cb10d00.html

http://johnoyoung.bokee.com/viewdiary.32408821.html

http://blog.yesky.com/blog/junhay/archive/2008/06/21/1855338.html

http://www.diybl.com/course/3_program/java/javajs/2008622/127288.html

http://dev-club.esnai.com/club/bbs/announce,2580533.htm

http://q.yesky.com/album/welcome.do?userId=2475401

http://www.chinaaspx.com/Comm/Dotnetbbs/Showtopic.aspx?Forum_ID=9&Id=296558&PPage=1

http://www.bitscn.com/member/index.php?uid=junhay

http://johnoyoung.bokee.com/

http://q.yesky.com/junhay/

http://my.codepub.com/space-49916-do-blog-id-17751.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: