您的位置：首页 > 其它

Lucene全文搜索学习

2013-08-08 14:58 513 查看

全文检索的概念：从大量的信息中快速、准确地查找出要的信息；只处理文本，不处理语义；全面、快速、准确是衡量全文检索系统的关键指标。

全文检索的应用场景：站内搜索；垂直搜索。

全文检索和数据库搜索的区别：

中文姓名匹配：([\u4E00-\u9FA5]{2,4})</a>[ ]+(\u5148\u751F|\u5973\u58EB)

lucene是实现了全文检索的一个框架。

1、Directory.class 描述索引库的一个类，相当于数据库。

2、Document 描述索引库中的数据格式，相当于数据库中的表。

3、Document(List<Field>)

4、Field里存放的是一个字符串形式的键值对。

5、对索引库的索引的操作实际上是对Document的

所需jar包
搭建lucene的开发环境，要准备lucene的jar包，要加入的jar包至少有

lucene-core-3.1.0.jar     (核心包)

lucene-analyzers-3.1.0.jar    (分词器)

lucene-highlighter-3.1.0.jar    (高亮器)

lucene-memory-3.1.0.jar       (高亮器)

建立索引和搜索代码示例

public class ArticleIndex {

/**
* 1、创建一个对象，并设置属性
* 2、创建IndexWriter
* 3、利用Indexwriter吧该对象放入到索引库中
* 4、关闭IndexWriter
* @throws IOException
*/

//可以执行两次建立索引成功，说明javabean中的id不是确定索引的唯一标示，目录ID由lucene内部生成。

@Test
public void testCreatIndex() throws IOException{
Article article = new Article();
article.setId(1l);
article.setTitle("百度搜索是怎么做的呢？");
article.setContent("百度一下，你就发现，百度还不错呦，信不信由你，反正我信了！");

Directory directory = FSDirectory.open(new File("./newpath"));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
IndexWriter indexWriter  = new IndexWriter(directory, analyzer, MaxFieldLength.UNLIMITED);

//把article转化为document
Document document = new Document();

//store表示是否将内容放到索引库中
//Index表示是否将关键字放到索引库中
Field field1 = new Field("id", article.getId().toString(), Store.YES, Index.NOT_ANALYZED);
Field field2 = new Field("title", article.getTitle(), Store.YES, Index.ANALYZED);
Field field3 = new Field("content", article.getContent(), Store.YES, Index.ANALYZED);
document.add(field1);
document.add(field2);
document.add(field3);
indexWriter.addDocument(document);

indexWriter.close();

}
/**
* 搜索代码
* @throws IOException
* @throws ParseException
*/

//搜索时，Analyzer分词器会把输入的关键字都变成小写

@Test
public void testSearchIndex() throws IOException, ParseException{

Directory directory = FSDirectory.open(new File("./newpath"));
//创建IndexSearcher
IndexSearcher indexSearcher = new IndexSearcher(directory);

//创建Query对象
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser queryParser = new QueryParser(Version.LUCENE_30,"title",analyzer);

Query query = queryParser.parse("百度");

//搜索：query表示搜索条件  1表示一条记录
TopDocs topDocs = indexSearcher.search(query, 2);

int totalRecords = topDocs.totalHits;//获取总记录数
System.out.println(totalRecords);

ScoreDoc[] scoreDocs = topDocs.scoreDocs;//获取前n行的目录ID
List<Article> articles = new ArrayList<Article>();
for(ScoreDoc scoreDoc : scoreDocs){

float score = scoreDoc.score;//相关度得分
int index = scoreDoc.doc;//目录列表ID
Document document = indexSearcher.doc(index);
Article article = new Article();
article.setId(Long.parseLong(document.get("id")));
article.setTitle(document.get("title"));
article.setContent(document.get("content"));
articles.add(article);

}

for(Article article : articles){
System.out.println(article.getContent());
}
}
}

对索引库的操作
1、保持数据库和索引库的同步，在操作数据库是同时更新索引库。
Index：no——不向目录库中存；not_analyzer ——存，不分词；analyzer —— 存并且分词。
Store：yes——会存到内容库中；no——不存到内容库中。
IndexWriter.addDocument(doc);

DocumentUtils.java
在对索引库进行操作时，增、删、改过程要把一个JavaBean封装成Document，而查询的过程是要把一个Document转化成JavaBean。在进行维护的工作中，要反复进行这样的操作，所以我们有必要建立一个工具类来重用代码。

对索引库的删除和更新操作：

/**
* 删除
*    并不是把原来的cfs文件删除掉了，而是在原来的基础上多了一个del文件
*/
@Test
public void testDelete() throws Exception{
IndexWriter indexWriter = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
/**
* Term
*  关键词对象  把关键词封装在了对象中
*/
Term term = new Term("title","lucene");
indexWriter.deleteDocuments(term);
indexWriter.close();
}

/**
* 更新
*  先删除后增加
*/
@Test
public void testUpdate() throws Exception{
IndexWriter indexWriter = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
Term term = new Term("title","lucene");
Article article = new Article();
article.setId(1L);
article.setTitle("lucene可以做搜索引擎");
article.setContent("aaaaa");
/**
* Term为删除
* Document为增加
*/
indexWriter.updateDocument(term, DocumentUtils.article2Document(article));
indexWriter.close();
}

因为当一个IndexWriter在进行读索引库操作的时候，lucene会为索引库上锁，以防止其他IndexWriter访问索引库而导致数据不一致，直到IndexWriter关闭为止。结论：同一个索引库只能有一个IndexWriter进行操作。

/**
* 1、当刚创建完一个 indexWriter的时候，那么indexWriter所指向的索引库就被上锁了,这个时候，另外的indexWriter还是indexSearch的操作是无效的
* 2、当indexWriter关闭的时候，释放IO流的资源，释放锁的过程
* 3、索引库的最多的操作是检索，后台维护的操作是比较少的
* @author Think
*
*/
public class IndexWriterTest {
@Test
public void testIndexWriter() throws Exception{
IndexWriter writer = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
writer.close();
IndexWriter writer2 = new IndexWriter(LuceneUtils.directory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
}
}

索引库的优化
indexWriter.optimize(); 手动合并文件
indexWriter().setMergeFactor(3); 当文件的个数达到3的时候，会自动合并成一个文件。默认的情况：10
每次建立索引，都会增加一个cfs文件，每次删除，都会增加del文件和cfs文件，如果增加、删除很多次，文件大量增加，这样检索的速度也会下降，所以有必要去优化索引结构，使文件的结构发生改变从而提高效率。

内存索引库和文件索引库

把内存索引库和文件索引库结合提高效率。

//为true时，表示重新创建或者覆盖，为false表示追加。默认为false
IndexWriter ramIndexWriter = new IndexWriter(ramDirectory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
IndexWriter fileIndexWriter = new IndexWriter(fileDirectory,LuceneUtils.analyzer,true,MaxFieldLength.LIMITED);

/**
* 内存索引库的特点
*   1、查询效率比较快
*   2、数据不是持久化数据
* 文件索引库的特点
*   1、查询效率比较慢
*   2、数据是持久化类的
* 内存索引库和文件索引库的结合
*     百万级别的数据，使用一个索引库效率很低，可以建立多个索引库。
* lucene提供了一些方法可以做很多个索引库出来(在一个项目中),
* 可以对某一个索引库进行检索，还可以针对合并的索引库进行检索
*  方法： fileIndexWriter.addIndexesNoOptimize(ramDirectory);//合并操作
*
* @author Think
*
*/
public class DirectoryTest {

@Test
public void testRamDirectory() throws Exception{
/**
* 创建内存索引库
*/
Directory ramDirectory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDirectory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
Article article = new Article();
article.setId(1L);
article.setTitle("lucene可以做搜索引擎");
article.setContent("baidu,google是很好的搜索引擎");
indexWriter.addDocument(DocumentUtils.article2Document(article));
indexWriter.close();
this.showData(ramDirectory);
}

private void showData(Directory directory) throws Exception{
IndexSearcher indexSearcher = new IndexSearcher(directory);
QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},LuceneUtils.analyzer);
Query query = queryParser.parse("lucene");
TopDocs topDocs = indexSearcher.search(query, 20);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
List<Article> articles = new ArrayList<Article>();
for(int i=0;i<scoreDocs.length;i++){
Document document = indexSearcher.doc(scoreDocs[i].doc);
Article article = DocumentUtils.document2Article(document);
articles.add(article);
}
for(Article article:articles){
System.out.println(article.getId());
System.out.println(article.getTitle());
System.out.println(article.getContent());
}
}

/**
* 文件索引库和内存索引库的合并的操作
*/
@Test
public void testFileAndRam() throws Exception{
/**
* 1、创建两个indexWriter
*     一个对应文件索引库
*     一个对应内存索引库
* 2、把文件索引库中的内容复制到内存索引库
* 3、内存索引库和应用程序交互
* 4、内存索引库的内容同步到文件索引库
*/
Directory fileDirectory = FSDirectory.open(new File("./indexDir"));
/**
* 把文件索引库中的内容复制到内存索引库
*/
Directory ramDirectory = new RAMDirectory(fileDirectory);
IndexWriter ramIndexWriter = new IndexWriter(ramDirectory,LuceneUtils.analyzer,MaxFieldLength.LIMITED);
IndexWriter fileIndexWriter = new IndexWriter(fileDirectory,LuceneUtils.analyzer,true,MaxFieldLength.LIMITED);

/**
* 应用程序和内存索引库交互
*/
Article article = new Article();
article.setId(1L);
article.setTitle("lucene可以做搜索引擎");
article.setContent("baidu,google是很好的搜索引擎");
ramIndexWriter.addDocument(DocumentUtils.article2Document(article));

ramIndexWriter.close();

/**
* 把内存索引库中的内容同步到文件索引库
*/
fileIndexWriter.addIndexesNoOptimize(ramDirectory);
fileIndexWriter.close();

this.showData(fileDirectory);
}
}

分词器Analyzer
英文分词器把关键词由大写变成小写。
在向索引库和目录库中存数据时都用到分词器。
一定要使用UTF-8编码

/**
* 分词器
* @author Think
*
*/
public class AnalyzerTest {
@Test
public void testAnalyzer_En() throws Exception{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
String text = "Creates a searcher searching the index in the named directory";
/**
* 英文分词器
*  creates
searcher
searching
index
named
directory
*/
/**英文分词器的执行过程
* 1、切分关键词
* 2、去掉停用词
* 3、把大写变成小写
*/
this.testAnalyzer(analyzer, text);
}

/**
* lucene内置的两个中文分词器，都不好用
* 单字分词
* @throws Exception
*/
@Test
public void testCH_1() throws Exception{
Analyzer analyzer = new ChineseAnalyzer();
String text = "这个论坛很不错";
this.testAnalyzer(analyzer, text);
}

/**
* 二分法分词
* @throws Exception
*/
@Test
public void testCH_2() throws Exception{
Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_30);
String text = "这个论坛很不错";
this.testAnalyzer(analyzer, text);
}

/**
* IK分词器，中文分词器，支持自定义词典
* @throws Exception
*/
@Test
public void testCh_3() throws Exception{
Analyzer analyzer = new IKAnalyzer();
String text = "lucene可以做搜索引擎";
this.testAnalyzer(analyzer, text);
}
/**
* 测试分词器代码，输出分词结果
* @param analyzer 分词器对象
* @param text 检索的文本，字符串形式
* @throws Exception
*/
private void testAnalyzer(Analyzer analyzer,String text)throws Exception{
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
tokenStream.addAttribute(TermAttribute.class);
while(tokenStream.incrementToken()){
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
System.out.println(termAttribute.term());
}
}
}

高亮器
测试时，建立索引库和查询需要使用同一个分词器。

/**
* 1、使关键词高亮
* 2、控制摘要的大小
*
* @author Think
*
*/
public class HighlighterTest {

@Test
public void testSearchIndex() throws Exception {
IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);
QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,
new String[] { "title", "content" }, LuceneUtils.analyzer);
Query query = queryParser.parse("百度");
/**
* 设置高亮器
* 规定要高亮的文本的前缀和后缀  只适合于网页
* <font color='red'>方立勋</font>
*/
Formatter formatter = new SimpleHTMLFormatter("<font color='red'>","</font>");
Scorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter,scorer);

/**
* 控制摘要的大小
*/
Fragmenter fragmenter = new SimpleFragmenter(20);
highlighter.setTextFragmenter(fragmenter);

TopDocs topDocs = indexSearcher.search(query, 10);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
List<Article> articles = new ArrayList<Article>();
for (int i = 0; i < scoreDocs.length; i++) {
Document document = indexSearcher.doc(scoreDocs[i].doc);
/**
* 使用高亮器:参数
*   LuceneUtils.analyzer
*      用分词器把高亮部分的词分出来
*   field
*      针对那个字段进行高亮
*   document.get("title")
*      获取要高亮的字段
*/
String titleText = highlighter.getBestFragment(LuceneUtils.analyzer, "title", document.get("title"));
String contentText = highlighter.getBestFragment(LuceneUtils.analyzer, "content", document.get("content"));
if(titleText!=null){
document.getField("title").setValue(titleText);
}
if(contentText!=null){
document.getField("content").setValue(contentText);
}

Article article = DocumentUtils.document2Article(document);
articles.add(article);
}
for (Article article : articles) {
System.out.println(article.getId());
System.out.println(article.getTitle());
System.out.println(article.getContent());
}
}
}

检索结果分页

public class DispageTest {

public void testSearchIndex(int firstResult,int maxResult) throws Exception{
IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);
QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},LuceneUtils.analyzer);
Query query = queryParser.parse("lucene");
TopDocs topDocs = indexSearcher.search(query, 25);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
List<Article> articles = new ArrayList<Article>();
//防止出现角标越界
int length = Math.min(topDocs.totalHits, firstResult+maxResult);

/**
* 进行分页
*/
for(int i=firstResult;i<length;i++){
Document document = indexSearcher.doc(scoreDocs[i].doc);
Article article = DocumentUtils.document2Article(document);
articles.add(article);
}

for(Article article:articles){
System.out.println(article.getId());
System.out.println(article.getTitle());
System.out.println(article.getContent());
}
}

@Test
public void testDispage() throws Exception{
this.testSearchIndex(20, 10);
}
}

查询
通配符查询：百度，左匹配

/**
* 查询方式
*      关键词查询
*      查询所有的文档
*      范围查询
*      通配符查询   重点
*      短语查询
*      boolean查询       重点
*
* @author Think
*
*/
public class QueryTest {
/**
* 1、关键词查询就是把一个关键词封装在了一个对象中，根据该关键词进行查询
* 2、因为没有分词器，所以区分大小写
* @throws Exception
*/
@Test
public void testTermQuery() throws Exception {
Term term = new Term("title","lucene");
Query query = new TermQuery(term);
this.testSearchIndex(query);
}

private void testSearchIndex(Query query) throws Exception {
IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);
TopDocs topDocs = indexSearcher.search(query, 28);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
List<Article> articles = new ArrayList<Article>();
for (int i = 0; i < scoreDocs.length; i++) {
Document document = indexSearcher.doc(scoreDocs[i].doc);
Article article = DocumentUtils.document2Article(document);
articles.add(article);
}
for (Article article : articles) {
System.out.println(article.getId());
System.out.println(article.getTitle());
System.out.println(article.getContent());
}
}

@Test
public void testQueryAllDocs() throws Exception{
Query query = new MatchAllDocsQuery();
this.testSearchIndex(query);
}

/**
* * 代表任意多个任意字符
* ? 代表任意一个字符
* @throws Exception
*/
@Test
public void testQueryWildCard() throws Exception{
Term term = new Term("title","l*?");
Query query = new WildcardQuery(term);
this.testSearchIndex(query);
}

/**
* 短语查询
*    1、所有的短语查询针对的是相同的字段
*    2、两个以上的短语查询，要指出该关键词分词后的位置
*/
@Test
public void testQueryPharse() throws Exception{
Term term = new Term("title","lucene");
Term term2 = new Term("title","搜索");
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(term,0);
phraseQuery.add(term2,4);
this.testSearchIndex(phraseQuery);
}

/**
* boolean查询
*  Occur.MUST  必须满足该条件
*  Occur.MUST_NOT  必须不能出现
*  Occur.SHOULD  可以有可以没有  or
*/
@Test
public void testBooleanQuery() throws Exception{
Term term = new Term("title","北京");
Query query = new WildcardQuery(term);

Term term2 = new Term("title","美女");
Query query2 = new WildcardQuery(term2);

Term term3 = new Term("title","北京美女");
Query query3 = new WildcardQuery(term3);

BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(query, Occur.SHOULD);
booleanQuery.add(query2,Occur.SHOULD);
booleanQuery.add(query3,Occur.SHOULD);
this.testSearchIndex(booleanQuery);
}

/**
* 范围查询
*/
@Test
public void testQueryRange() throws Exception{
Query query = NumericRangeQuery.newLongRange("id", 5L, 15L, true, true);
this.testSearchIndex(query);
}
}

排序，根据相关度得分

/**
* 1、相同的关键词，相同的结构
*       得分一样
* 2、相同的结构，不同的关键词
*       得分不一样(lucene和搜索的得分是不一样的， 一般情况下，中文比英文的得分高)
* 3、不同的结构，相同的关键词
*               关键词出现的次数越多,得分越高
* 4、竞价排名，在往索引库中放时，通过设置ducument的boost数值大小，相关度得分会乘以这个数值，从而提高相关度得分。
* @author Think
*
*/
public class SortTest {
@Test
public void testSearchIndex() throws Exception{
IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.directory);
QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_30,new String[]{"title","content"},LuceneUtils.analyzer);
Query query = queryParser.parse("lucene");
TopDocs topDocs = indexSearcher.search(query, 28);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
List<Article> articles = new ArrayList<Article>();
for(int i=0;i<scoreDocs.length;i++){
System.out.println(scoreDocs[i].score);
Document document = indexSearcher.doc(scoreDocs[i].doc);
Article article = DocumentUtils.document2Article(document);
articles.add(article);
}
for(Article article:articles){
System.out.println(article.getId());
System.out.println(article.getTitle());
System.out.println(article.getContent());
}
}
}

bbs项目异常，经检查代码没有问题，工作空间设置成UTF-8解决。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Lucene 全文搜索

相关文章推荐

新的分享

章节导航