您的位置:首页 > 其它

lucene 入门实例

2013-06-10 10:23 441 查看
Apache Lucene is a high-performance, full-featured text search engine library.Here's a simple example how to use Lucene for indexing and searching (using JUnitto check if the results are what we expect):

Apache Lucene 是高性能,全文搜索引擎;下例显示如何索引(index)和搜索(search):

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();
    
    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();


The Lucene API is divided into several packages:

org.apache.lucene.analysis
defines an abstract
Analyzer
API
for converting text from a
Reader
into a
TokenStream
,an enumeration of token
Attribute
s. A TokenStream can be composed by applying
TokenFilter
s to the output of a
Tokenizer
. Tokenizers and TokenFilters are strung together and applied with an
Analyzer
. analyzers-common
provides a number of Analyzer implementations, includingStopAnalyzerand the grammar-based

StandardAnalyzer.

org.apache.lucene.codecs
provides an abstraction over the encoding and decoding of the inverted
index structure,as well as different implementations that can be chosen depending upon application needs.

提供反向索引结构的编码与译码抽象接口及不同应用需要的实现

org.apache.lucene.document
provides a simple
Document
class.
A Document is simply a set of named
Field
s,whose values may be strings or instances of
Reader
.

由fields组成,fields 的vaule即string或reader实例

org.apache.lucene.index
provides two primary classes:
IndexWriter
,which
creates and adds documents to indices; and
IndexReader
,which accesses the data in the index.

writer 为创建添加doc入index中,reader即从index中读数据

org.apache.lucene.search
provides data structures to represent queries (ie
TermQuery
for
individual words,
PhraseQuery
for phrases, and
BooleanQuery
for boolean combinations of queries) and the
IndexSearcher
which turns queries into
TopDocs
.A number of
QueryParsers are provided for producingquery structures from strings or xml.

提供代表查询的数据结构,(TermQuery 单词 ,PhraseQuery 词组 ,BooleanQurey 布尔)

IndexSearcher 将查询转换为TopDocs

org.apache.lucene.store
defines an abstract class for storing persistent data, the
Directory
,which
is a collection of named files written by an
IndexOutput
and read by an
IndexInput
. Multiple implementations are provided, including
FSDirectory
,which uses a file system directory to store files, and
RAMDirectory
which implements
files as memory-resident data structures.

定义存储数据的抽象类-Directory,其实例如FSDirectory 文件系统目录,RAMDirectory 内存数据结构

org.apache.lucene.util
contains a few handy data structures and util classes, ie
OpenBitSet
and

PriorityQueue
.

To use Lucene, an application should:

Create
Document
s byadding
Field
s;
Create an
IndexWriter
and add documents to it with
addDocument()
;
Call
QueryParser.parse()to build a query from a string; and
Create an
IndexSearcher
and pass the query to its
search()
method.

Some simple examples of code which does this are:

IndexFiles.java creates an index for all the files contained in a directory. 为目录下的所有文件建立index

SearchFiles.java prompts forqueries and searches an index.

To demonstrate these, try something like:
> java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soups

adding rec.food.recipes/soups/abalone-chowder

[ ... ]
> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles

Query: chowder

Searching for: chowder

34 total matching documents

1. rec.food.recipes/soups/spam-chowder

[ ... thirty-four documents contain the word "chowder" ... ]

Query: "clam chowder" AND Manhattan

Searching for: +"clam chowder" +manhattan

2 total matching documents

1. rec.food.recipes/soups/clam-chowder

[ ... two documents contain the phrase "clam chowder"and the word "manhattan" ... ]

[ Note: "+" and "-" are canonical, but "AND", "OR"and "NOT" may be used. ]

public final class Document
extends Object
implements Iterable<IndexableField>

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may be
stored

with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

Documents 是索引和搜索的单元,每个Document就是一系列的field,每个field有一个name和文本value。

field可以存放在document中,这样搜索时也就搜索field。因此每个document一般都有至少1个field,document因field才相互区别。

public final void add(IndexableField field)

Adds a field to a document. Several fields may be added with the same name. In this case, if the fields are indexed, their text is treated as though appended for the purposes of search.

向document加入field。当多个同name的field添加后,如果field被index,那么所有同name的field,其value(即文本)在search时会一并search,如同被合并一样。

Note that add like the removeField(s) methods only makes sense prior to adding a document to an index. These methods cannot be used to change the content of an existing index! In order to achieve this, a document has to be deleted
from an index and a new changed version of that document has to be added.

注意,仅在document添加入index之前才有调用add和remove的意义。这些方法无法改变已存在的index;要改变index,只能从index中删除document,再加入已更新的document。

public class IndexWriter
extends Object
implements Closeable, TwoPhaseCommit

An
IndexWriter
creates and maintains an index.

创建并维护一个index

The
IndexWriterConfig.OpenMode
option on
IndexWriterConfig.setOpenMode(OpenMode)
determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with
IndexWriterConfig.OpenMode.CREATE

even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. If
IndexWriterConfig.OpenMode.CREATE_OR_APPEND

is used IndexWriter will create a new index if there is not already an index at the provided path and otherwise open the existing index.

In either case, documents are added with
addDocument
and removed with
deleteDocuments(Term)
or
deleteDocuments(Query)
. A document can be updated with
updateDocument
(which just deletes and then adds the entire document). When finished adding, deleting and updating documents,
close

should be called.

These changes are buffered in memory and periodically flushed to the
Directory

(during the above method calls). A flush is triggered when there are enough added documents since the last flush. Flushing is triggered either by RAM usage of the documents (see
IndexWriterConfig.setRAMBufferSizeMB(double)
)
or the number of added documents (see
IndexWriterConfig.setMaxBufferedDocs(int)
). The default is to flush when RAM
usage hits
IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB
MB. For best indexing speed you should flush by RAM usage with
a large RAM buffer. Additionally, if IndexWriter reaches the configured number of buffered deletes (see
IndexWriterConfig.setMaxBufferedDeleteTerms(int)
)
the deleted terms and queries are flushed and applied to existing segments. In contrast to the other flush options
IndexWriterConfig.setRAMBufferSizeMB(double)

and
IndexWriterConfig.setMaxBufferedDocs(int)
, deleted terms won't trigger a segment flush. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either
commit()

or
close()
is called. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (seebelow
for changing the
MergeScheduler
).

Opening an
IndexWriter
creates a lock file for the directory in use. Trying to open another
IndexWriter
on the same directory will lead to a

LockObtainFailedException
. The
LockObtainFailedException
is also thrown if an IndexReader on the same directory is used to delete documents from the index.

Expert:
IndexWriter
allows an optional
IndexDeletionPolicy

implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy is
KeepOnlyLastCommitDeletionPolicy

which removes all prior commits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to
the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on.

Expert:
IndexWriter
allows you to separately change the
MergePolicy

and the
MergeScheduler
. The
MergePolicy
is invoked whenever there are changes to the segments in the index. Its role is to select which merges to do, if any, and return a
MergePolicy.MergeSpecification

describing the merges. The default is
LogByteSizeMergePolicy
. Then, the

MergeScheduler
is invoked with the requested merges and it decides when and how to run the merges. The default is
ConcurrentMergeScheduler
.

NOTE: if you hit an OutOfMemoryError then IndexWriter will quietly record this fact and block all future segment commits. This is a defensive measure in case any internal state (buffered documents and deletions)
were corrupted. Any subsequent calls to
commit()
will throw an IllegalStateException. The only course of action is to call
close()
, which internally will call

rollback()
, to undo any changes to the index since the last commit. You can also just call
rollback()
directly.

NOTE:
IndexWriter
instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you shouldnot synchronize on the
IndexWriter
instance as this may cause deadlock; use your own (non-Lucene) objects instead.

NOTE: If you call
Thread.interrupt()
on a thread that's within IndexWriter, IndexWriter will try to catch this (eg, if it's in a wait() or Thread.sleep()), and will then throw the unchecked exception
ThreadInterruptedException

and clear the interrupt status on the thread.

//document 函数 输入参数为IndexableFieldType类型,是抽象的接口类,实际输入参数是Document类型,
//实现了所有的抽象函数
 public final void add(IndexableField field) {
    fields.add(field);
  }
 //document内私有变量,是ArrayList 数据结构
 private final List<IndexableField> fields = new ArrayList<IndexableField>();
 
 // IndexableField 接口
 public interface IndexableField {

  /** Field name */
  public String name();

  /** {@link IndexableFieldType} describing the properties
   * of this field. */
  public IndexableFieldType fieldType();
  
  /** 
   * Returns the field's index-time boost.
   * <p>
   * Only fields can have an index-time boost, if you want to simulate
   * a "document boost", then you must pre-multiply it across all the
   * relevant fields yourself. 
   * <p>The boost is used to compute the norm factor for the field.  By
   * default, in the {@link Similarity#computeNorm(FieldInvertState)} method, 
   * the boost value is multiplied by the length normalization factor and then
   * rounded by {@link DefaultSimilarity#encodeNormValue(float)} before it is stored in the
   * index.  One should attempt to ensure that this product does not overflow
   * the range of that encoding.
   * <p>
   * It is illegal to return a boost other than 1.0f for a field that is not
   * indexed ({@link IndexableFieldType#indexed()} is false) or omits normalization values
   * ({@link IndexableFieldType#omitNorms()} returns true).
   *
   * @see Similarity#computeNorm(FieldInvertState)
   * @see DefaultSimilarity#encodeNormValue(float)
   */
  public float boost();

  /** Non-null if this field has a binary value */
  public BytesRef binaryValue();

  /** Non-null if this field has a string value */
  public String stringValue();

  /** Non-null if this field has a Reader value */
  public Reader readerValue();

  /** Non-null if this field has a numeric value */
  public Number numericValue();

  /**
   * Creates the TokenStream used for indexing this field.  If appropriate,
   * implementations should use the given Analyzer to create the TokenStreams.
   *
   * @param analyzer Analyzer that should be used to create the TokenStreams from
   * @return TokenStream value for indexing the document.  Should always return
   *         a non-null value if the field is to be indexed
   * @throws IOException Can be thrown while creating the TokenStream
   */
  public TokenStream tokenStream(Analyzer analyzer) throws IOException;
}

//indexwrite
public void addDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer) throws IOException {
    updateDocument(null, doc, analyzer);
  } 

 public TextField(String name, Reader reader) {
    super(name, reader, TYPE_NOT_STORED);
  }


//DocumentsWriterPerThread 构造函数

public DocumentsWriterPerThread(Directory directory, DocumentsWriter parent,
      FieldInfos.Builder fieldInfos, IndexingChain indexingChain) {
    this.directoryOrig = directory;
    this.directory = new TrackingDirectoryWrapper(directory);
    this.parent = parent;
    this.fieldInfos = fieldInfos;
    this.writer = parent.indexWriter;
    this.infoStream = parent.infoStream;
    this.codec = parent.codec;
    this.docState = new DocState(this, infoStream);
    this.docState.similarity = parent.indexWriter.getConfig().getSimilarity();
    bytesUsed = Counter.newCounter();
    byteBlockAllocator = new DirectTrackingAllocator(bytesUsed);
    pendingDeletes = new BufferedDeletes();
    intBlockAllocator = new IntBlockAllocator(bytesUsed);
    initialize();
    // this should be the last call in the ctor 
    // it really sucks that we need to pull this within the ctor and pass this ref to the chain!
    consumer = indexingChain.getChain(this);
  }


//DocumentsWriterPerThread 内函数,
static final IndexingChain defaultIndexingChain = new IndexingChain() {

    @Override
    DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) {
      /*
      This is the current indexing chain:

      DocConsumer / DocConsumerPerThread
        --> code: DocFieldProcessor
          --> DocFieldConsumer / DocFieldConsumerPerField
            --> code: DocFieldConsumers / DocFieldConsumersPerField
              --> code: DocInverter / DocInverterPerField
                --> InvertedDocConsumer / InvertedDocConsumerPerField
                  --> code: TermsHash / TermsHashPerField
                    --> TermsHashConsumer / TermsHashConsumerPerField
                      --> code: FreqProxTermsWriter / FreqProxTermsWriterPerField
                      --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerField
                --> InvertedDocEndConsumer / InvertedDocConsumerPerField
                  --> code: NormsConsumer / NormsConsumerPerField
          --> StoredFieldsConsumer
            --> TwoStoredFieldConsumers
              -> code: StoredFieldsProcessor
              -> code: DocValuesProcessor
    */

    // Build up indexing chain:

      final TermsHashConsumer termVectorsWriter = new TermVectorsConsumer(documentsWriterPerThread);
      final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter();

      final InvertedDocConsumer termsHash = new TermsHash(documentsWriterPerThread, freqProxWriter, true,
                                                          new TermsHash(documentsWriterPerThread, termVectorsWriter, false, null));
      final NormsConsumer normsWriter = new NormsConsumer();
      final DocInverter docInverter = new DocInverter(documentsWriterPerThread.docState, termsHash, normsWriter);
      final StoredFieldsConsumer storedFields = new TwoStoredFieldsConsumers(
                                                      new StoredFieldsProcessor(documentsWriterPerThread),
                                                      new DocValuesProcessor(documentsWriterPerThread.bytesUsed));
      return new DocFieldProcessor(documentsWriterPerThread, docInverter, storedFields);
    }
  };
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: