您的位置：首页 > 其它

lucene 3.6.0学习总结

2016-05-31 11:33 337 查看

目前，主流的全文索引工具有：Lucene , Sphinx , Solr , ElasticSearch。其中Solr和Elastic Search都是基于Lucene的。Sphinx不是 apache的项目，如果你想把Sphinx放到某个商业性的项目中，你就得买个商业许可证。(其实我只学习了lucence,solr 只是了解,这两天项目需要,研究学习了下.此文为个人学习备忘之用)

第一章 LUCENE基础

在全文索引工具中，都是由这样的三部分组成：索引部分、分词部分和搜索部分

　　IndexWriter：用来创建索引并添加文档到索引中。

Directory：这个类代表了索引的存储的位置，是一个抽象类。

Analyzer：对文档内容进行分词处理，把分词后的内容交给 IndexWriter来建立索引。

Document：由多个Field组成，相当于数据库中的一条记录。

Field：相当于数据库中的一条记录中的一个字段。

分词部分的核心类

Analyzer：简单分词器（SimpleAnalyzer）、停用词分词器（StopAnalyzer）、空格分词器（WhitespaceAnalyzer）、标准分词器（StandardAnalyzer）。

TokenStream：可以通过这个类有效的获取到分词单元信息。

Tokenizer：主要负责接收字符流Reader,将Reader进行分词操作。

TokenFilter：将分词的语汇单元，进行各种各样过滤。

搜索部分的核心类

IndexSearcher：用来在建立好的索引上进行搜索。

Term：是搜索的基本单位。

Query：把用户输入的查询字符串封装成Lucene能够识别的Query。

TermQuery：是抽象类Query的一个子类，它的构造函数只接受一个参数，那就是一个Term对象

TopDocs：保存返回的搜索结果。

SocreDoc：保存具体的Document对象。

第二章索引建立

索引的建立是将现实世界中所有的结构化和非结构化数据提取信息，创建索引的过程。如下图：

示例子:

package text;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class TestFileIndexer {
public   static   void  main(String[] args)  throws  Exception  {
/*  指明要索引文件夹的位置,这里是C盘的source文件夹下  */
File fileDir  =   new  File( "c:\\source " );
/*  这里放索引文件的位置  */
File indexDir  =   new  File( "c:\\index" );
Directory dir=FSDirectory.open(indexDir);//将索引存放在磁盘上
Analyzer lucenAnalyzer=new StandardAnalyzer(Version.LUCENE_36);//分析器
IndexWriterConfig iwc=new IndexWriterConfig(Version.LUCENE_36,lucenAnalyzer);
iwc.setOpenMode(OpenMode.CREATE);//创建新的索引文件create 表示创建或追加到已有索引库
IndexWriter indexWriter=new IndexWriter(dir,iwc);//把文档写入到索引库
File[] textFiles=fileDir.listFiles();//得到索引文件夹下所有文件
long startTime=new Date().getTime();
//增加document到检索去
for (int i = 0; i < textFiles.length; i++) {
//          if (textFiles[i].isFile()&& textFiles[i].getName().endsWith(".txt")) {
System.out.println(":;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;");
System.out.println("File"+textFiles[i].getCanonicalPath()+"正在被索引...");
String temp=FileReaderAll(textFiles[i].getCanonicalPath(),"GBK");
System.out.println(temp);
Document document=new Document();
Field FieldPath=new Field("path",textFiles[i].getPath(),Field.Store.YES,Field.Index.NO);
Field FieldBody=new Field("body",temp,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS);
NumericField modifiField=new NumericField("modified");//所以key为modified
modifiField.setLongValue(fileDir.lastModified());
document.add(FieldPath);
document.add(FieldBody);
document.add(modifiField);
indexWriter.addDocument(document);

//          }
}
indexWriter.close();
//计算一下索引的时间
long endTime=new Date().getTime();
System.out.println("花了"+(endTime-startTime)+"毫秒把文档添加到索引里面去"+fileDir.getPath());
}
public static String FileReaderAll(String FileName,String charset)throws IOException{
BufferedReader reader=new BufferedReader(new InputStreamReader(new FileInputStream(FileName),charset));
String line=new String();
String temp=new String();
while ((line=reader.readLine())!=null) {
temp+=line;
}
reader.close();
return temp;
}

　　　 Field.Store.YES：存储。该值可以被恢复（还原）。

NO：不存储。该值不可以被恢复，但可以被索引。

Field.Index.ANALYZED：分词。

NOT_ANALYZED：不分词。

NOT_ANALYZED_NO_NORMS：不分词也不加权（即不存储NORMS信息）。

查询索引的基本信息

package text;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class TestQuery {
public static void main(String[] args) throws ParseException, IOException {
String index="c:\\index";//搜索的索引路径
IndexReader reader=IndexReader.open(FSDirectory.open(new File(index)));
IndexSearcher searcher=new IndexSearcher(reader);//检索工具
ScoreDoc[] hits=null;
String queryString="测试";  //搜索的索引名称
Query query=null;
Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_36);
try {
QueryParser qp=new QueryParser(Version.LUCENE_36,"body",analyzer);//用于解析用户输入的工具
query=qp.parse(queryString);
} catch (Exception e) {
// TODO: handle exception
}
if (searcher!=null) {
TopDocs results=searcher.search(query, 10);//只取排名前十的搜索结果
hits=results.scoreDocs;
Document document=null;
for (int i = 0; i < hits.length; i++) {
document=searcher.doc(hits[i].doc);
String body=document.get("body");
String path=document.get("path");
String modifiedtime=document.get("modifiField");
System.out.println(body+"        ");
System.out.println(path);
}
if (hits.length>0) {
System.out.println("找到"+hits.length+"条结果");

}
searcher.close();
reader.close();
}

}
}

索引文件作用

索引建立成功后，会自动在磁盘上生成一些不同后缀的文件（如下图），这些文件缺一不可，这里简单的介绍下不同后缀名的文件都有些什么作用：

.fdt : 保存域的值（即Store.YES属性的文件）。

.fdx : 与.fdt的作用相同。

.fnm :保存了此段包含了多少个域，每个域的名称及索引方式。

.frq : 保存倒排表。数据出现次数（哪篇文章哪个词出现了多少次）。

.nrm : 保存评分和排序信息。

.prx : 偏移量信息。倒排表中每个词在包含此词的文档中的位置。

.tii : 保存了词典(Term Dictionary)。也即此段包含的所有的词按字典顺序的排序。

.tis : 同上。存储索引信息。

备注：

①如上图，具有相同前缀文件的属同一个段，图中共两个段 "_0"和 "_1"。

②一个索引可以包含多个段，段与段之间是独立的，添加新文档可以生成新的段，不同的段可以合并。

③这些索引文件可以使用使用lukeall-3.5.0.jar打开，具体使用方法在后面的章节进行详述

　　　　 Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_36);

　　　　QueryParser qp=new QueryParser(Version.LUCENE_36,"body",analyzer);//用于解析用户输入的工具
　　 Query query=qp.parse(queryString);

根据Query获取TopDocs

TopDocs tds = searcher.search(query, 10); //返回10条数据

根据TopDocs获取ScoreDoc

ScoreDoc[] hits=null;

hits=results.scoreDocs;

Document document=null;
for (int i = 0; i < hits.length; i++) {
document=searcher.doc(hits[i].doc);
String body=document.get("body");
String path=document.get("path");
String modifiedtime=document.get("modifiField");
System.out.println(body+" ");
System.out.println(path);
}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航