您的位置:首页 > 编程语言 > C语言/C++

Zettair 介绍(全文搜索引擎,基于C语言的一个高效倒排索引模式)

2013-09-23 13:56 218 查看
本文文章的下载地址:http://download.csdn.net/detail/yijiyong100/6307159

1 Zettair介绍

1.1 Zettair简要说明

Zettair是一个基于倒排序索引结构的全文搜索开源引擎,由RMIT墨尔本皇家理工大学开源实现的。搜索引擎通常都是建立在一个特殊的结构之上的,称之为倒排序索引,这样可以快速响应查询。但是这样对于查询存在两个缺点。第一个,倒排序索引必须在查找之前建立好,第二,索引结构需要额外的电脑硬盘空间。但是上面的问题都是可以被忽略的,因为如果索引一天被用到成千上百次的从无数的文档中来查找你所有需要的信息,这也是值得的。


Search engines are usually based on a special structure, called inverted index, which is used to answer queries quickly. There are two disadvantages resulting from this approach. First, the inverted index structure most be constructed prior to searching it,
and secondly, the index structure takes up additional space on a computer's hard-disk. However, both problems mentioned above are negligible, if an index is queried a few hundred times a day, and can be used to find information that would otherwise be lost
in the depths of a pile of documents.

倒排序索引结构是一个非常容易懂的结构,而且有很多人正在研究。在很多研讨论文或者研究期刊上都能找到,比如MG杂志。

An inverted index is a well researched and understood structure. It is documented and discussed in a few research papers and books, such as MG ("Managing Gigabytes").

如下的文章将会有一个全面的介绍关于如何使用Zettair,在文章的细节中会讲到如何使用Zettair构建索引和如何使用索引查询。还会有一些说明关于查看Zettair的源代码的知识。

These pages begin with a tutorial overview of using Zettair. Then they document in more detail how you to use Zettair to build an index and how to query that index. There are also some pointers for those wishing to hack the Zettair source code.

有任何问题都可以发送邮件zettair@cs.rmit.edu.au 或者拜访主页:
http:// href="http://www.seg.rmit.edu.au/zettair/index.html" target=_blank>www.seg.rmit.edu.au/zettair/index.html

Feel free to drop us (the Search Engine Group) a line at zettair@cs.rmit.edu.au if you have any questions or comments. Or visit the Zettair home page for more information.

1.2 Zettair引擎优点

在网络上有人专门研究各种开源引擎的各项性能指标。
http:// href="http://hi.baidu.com/shichunqi/item/9666691d060c434b6926bb1b" target=_blank>hi.baidu.com/shichunqi/item/9666691d060c434b6926bb1b

其中Zettair 是基于C语言编写的,无论是在CPU、内存、还是在索引结构存储消耗空间,还是查询的时间消耗,性能指标都是较为出色的,同时支持增量索引, 结果摘要, 文件类型选择, 词根替换, 结果排序, 排序策略, 搜索类型,,基于小数据集(TREC-4)和大数据集(WT10g), 分析了搜索 引擎的整体性能, Zettair是最完整的开源引擎之一。

1.3 Zettair简单使用说明

1.3.1 前提

本章节将包含一系列的介绍关于Zettair。包括如何使用zettair对索引进行收集,如何使用索引进行查询,仅仅只是对基础功能的一个简单的使用说明,如果想了解更详细的信息请看创建索引和查询索引的章节。

This page contains a tutorial-style introduction to Zettair. It will show you how to index a document collection using Zettair, and how to run queries against that index. It only introduces the basic functionality for each use. For more details, see the pages
on building and searching using Zettair.

本章节假设你已经下载、解压、编译,而且安装了Zettair各个部分。如果你还还没有那么做,请到Zettair的主页上获取安装包,然后解压,按照安装说明安装Zettair的各个部分。我们假设Zet可执行程序已经安装到你的系统变量PATH目录中了。如果没有,也可以使用全路径指定可执行文件(比如:, /usr/local/zettair/bin/zet)。

This tutorial assumes that you have downloaded, unpacked, compiled, and installed the Zettair distribution. If you have not done so, please grab Zettair from the Zettair home page, unpack it, and follow the installation instructions in the INSTALL file contained
within the distribution. We also assume that the zet executable has been installed in your PATH. If not, use the full path for the executable (for instance, /usr/local/zettair/bin/zet).

1.3.2 创建索引

创建一个索引:

为了帮助你快速的让Zettair运行起来,我们已经包括赫尔曼?梅尔维尔的《白鲸记》这部电影的全部文字作为文档收集对象。可以在Zettair中的子目录./txt中找到。

To help you get up and running quickly with Zettair, we have included the full text for Herman Melville's Moby Dick as a sample document collection for you to play around with. This can be found in the subdirectory txt of the Zettair distribution.

先让我们使用《白鲸记》开始创建索引,你切换到txt目录中。我们假设zet可执行程序在你的PATH环境变量中,否则就需要指定全路径。让我们真正开始执行构建索引。

zet -i moby.txt

Let's begin by indexing Moby Dick. To do this, change your current directory to txt. (You can index it from anywhere, but this is simplest.) We'll assume that the zet executable is in your PATH; otherwise, substitute the full pathname to the executable wherever
you see 'zet' below. So, let's build this index:

$ zet -i moby.txt

-i 的参数告诉zet我们在创建一个新的索引,《白鲸记》文本的大小小于1.3MB,所以不会花费太长的时间。Zettair在实际的应用场景中可以创建10G或者更大的文本集对应的索引,而且速度也不会很慢。当创建完成索引后,在当前目录下会生成4个新文件,全部是以index作为前缀的,都是Zettair的索引文件。

The '-i' argument tells zet that we're building a new index. The text of Moby Dick is less than 1.3 MBs in length, so this won't take long to run - Zettair is more used to working with document collections of 10 GB or more, but it won't complain. When it's
finished running, you should see four new files in the current directory, all prefixed with "index" These are Zettair's index files.

1.3.3 查询索引

我们已经为查询做好了准备,我们再次运行zet,不加任何参数。

So now we're ready to run some queries. To do this, we run zet again, this time without any options:

$ zet

Zettair会加载索引(索引不大,会非常快的),然后等待你的输入,下面是我们测试搜索:whale 的结果。

Zettair will load up the index (very quickly, in this case), and then prompt you for input. Let's test the rumour that Moby Dick has something to say about whales:

> whale

1. Chapter 32, Paragraph 46 (score 0.814503, docid 713)

2. Chapter 32, Paragraph 23 (score 0.687340, docid 690)

3. Chapter 32, Paragraph 25 (score 0.542362, docid 692)

4. Chapter 32, Paragraph 8 (score 0.489850, docid 675)

5. Chapter 32, Paragraph 22 (score 0.488983, docid 689)

6. Chapter 32, Paragraph 26 (score 0.484616, docid 693)

7. Chapter 75, Paragraph 10 (score 0.453542, docid 1552)

8. Chapter 32, Paragraph 21 (score 0.433975, docid 688)

9. Chapter 41, Paragraph 7 (score 0.403410, docid 875)

10. Chapter 81, Paragraph 47 (score 0.402218, docid 1646)

11. Chapter 41, Paragraph 3 (score 0.378583, docid 871)

12. Chapter 56, Paragraph 5 (score 0.367106, docid 1236)

13. Chapter 0, Paragraph 74 (score 0.340201, docid 74)

14. Chapter 45, Paragraph 17 (score 0.333519, docid 969)

15. Chapter 32, Paragraph 35 (score 0.332929, docid 702)

16. Chapter 45, Paragraph 5 (score 0.331750, docid 957)

17. Chapter 87, Paragraph 21 (score 0.330964, docid 1723)

18. Chapter 91, Paragraph 6 (score 0.327630, docid 1796)

19. Chapter 55, Paragraph 7 (score 0.326381, docid 1223)

20. Chapter 68, Paragraph 6 (score 0.324882, docid 1411)

20 results of 791 shown (took 0.000702 seconds)

如上搜索结果whale出现了791次在文档中(《白鲸记》)。其中第32章节的46个段落是Zettair认为最相关的搜索结果。我们可以在Zettari中使用cache命令指定索引文档ID,来显示文章的章节部分。

This tells us that the word "whale" occurs in 791 documents in the collection (which is to say, paragraphs in Moby Dick). Zettair thinks the most pertinent paragraph is paragraph 46 of chapter 32. We can ask Zettair to print out this document using the 'cache'
directive and specifying the document's docid:

cache:713 命令。

> [cache:713] <DOC> <DOCNO>Chapter 32, Paragraph 46</DOCNO> Beyond the DUODECIMO, this system does not proceed, inasmuch as the Porpoise is the smallest of the whales. Above, you have all the Leviathans of note. But there are a rabble of uncertain, fugitive,
half-fabulous whales, which, as an American whaleman, I know by reputation, but not personally. I shall enumerate them by their fore-castle appellations; for possibly such a list may be valuable to future investigators, who may complete what I have here but
begun. If any of the following whales, shall hereafter be caught and marked, then he can readily be incorporated into this System, according to his Folio, Octavo, or Duodecimo magnitude:--The Bottle-Nose Whale; the Junk Whale; the Pudding-Headed Whale; the
Cape Whale; the Leading Whale; the Cannon Whale; the Scragg Whale; the Coppered Whale; the Elephant Whale; the Iceberg Whale; the Quog Whale; the Blue Whale; etc. <From Icelandic, Dutch, and old English authorities, there might be quoted other lists of uncertain
whales, blessed with all manner of uncouth names. But I omit them as altogether obsolete; and can hardly help suspecting them for mere sounds, full of Leviathanism, but signifying nothing. <DOC>>

不用担心<DOC>和<DOCNO>的标记,那些仅仅是 TREC(国际文本检索会议,最为权威的搜索检索大会)的格式经常为了标记创建索引用的。在上面的章节中,你会发现whale这个单词出现的频率很高,所以Zettair认为是你最想要的查询结果。

Don't worry about the <DOC> and <DOCNO> tags: that's just part of the TREC format we've used to mark up Moby Dick for indexing. You'll notice that the word 'whale' occurs often, which is why Zettair thinks this is probably the paragraph you're looking for.

1.3.4 多词查询

当然也可以一次使用多个单词进行查询,下面是查询某种特殊的鲸鱼的结果。

You can, of course, query for more than one word at a time. Say we were looking for a particular kind of whale:

> white whale

1. Chapter 42, Paragraph 4 (score 1.429675, docid 897) [...]

20. Chapter 48, Paragraph 2 (score 0.752030, docid 1002)

20 results of 852 shown (took 0.000801 seconds)

852个段落被找到,但是"whale"一个单词就被找到出现在791个段落中,zettair把whale和white这两个单词出现一个或者都出现都列举出来。我们可以让Zettair只显示两个单词同时出现在文档中的结果中。

Hmm, 852 paragraphs--but "whale" only occurs in 791! Well, what Zettair is reporting here is all the documents with either "white" or "whale" in them. We can tell specify that we only want documents that both occur in:

> white AND whale

1. Chapter 59, Paragraph 4 (score 1.255408, docid 1269) [...] 2

0. Chapter 54, Paragraph 88 (score 0.806199, docid 1191)

20 results of 130 shown (took 0.000330 seconds)

如果输入的时候前后加上单引号,搜索的结果为全词匹配了。

or, probably more to the point, only documents that the exact phrase "white whale" occurs in:

> "white whale"

1. Chapter 36, Paragraph 41 (score 1.357970, docid 789) [...]

20. Chapter 52, Paragraph 4 (score 0.840175, docid 1088) 20 results of 91 shown (took 0.000307 seconds)

1.3.5 文档摘要

到目前为止Zettair是非常棒的,至少我们这么认为,但是可能会让人有点厌烦,如果对每个文档都去进行查看是否使我们要查找的,特别是文档中包括多个段落的时候。我们需要展示出来的搜索结果包含每篇文章的摘要,而且Zettair也提供了这些摘要。

This is great so far (or at least, we hope you think so), but it gets tiresome having to individually request each document to see if it's what we're looking for, especially if the documents are longer than a single paragraph. What we really want is for the
list of results to include a summary of each document. And we can ask Zettair to provide just this to us.

为了看到搜索结果的摘要,我们需要重新启动Zettair。敲击CONTORL-D或者其他任意键组合来表明你的输入的开始和结束。这一次,我们运行zet 但是带上参数--summary 选项表明我们需要看文档的摘要,而且我们会看到搜索的内容在文档摘要中是什么样子,我们也会限制仅仅只输入搜索的结果的前两位。

To do so, we'll have to restart Zettair. Hit CONTROL-D or whatever key combination indicates end of input on your system to end your current session. This time, we'll run the zet executable with the '--summary' option to indicate that we'd like to see document
summaries, and what form we want these summaries to be in. We'll also restrict output to just the top 2 results:

$ zet --summary=capitalise -n 2

Zettair也能高亮显示你的搜索关键词在文档的摘要中通过不同的方式,可以通过大写。所以我们看看搜索的结果:

Zettair can highlight your search terms within the document summaries in a number of different ways, capitalise being one of them. So, let's try out some summaries:

> ship sea storm

1. Chapter 9, Paragraph 18 (score 2.973852, docid 261) A dreadful STORM comes on, the SHIP is like to break... He sees no black sky and raging SEA, feels not the reeling timbers, and little hears he or heeds he the far rush of the mighty whale, which even now
with open mouth is cleaving the SEAS after him.

2. Chapter 121, Paragraph 4 (score 2.431819, docid 2294) What's the mighty difference between holding a mast's lightning-rod in the STORM, and standing close by a mast that hasn't got any lightning-rod at all in a STORM?

2 results of 650 shown (took 0.002139 seconds) > "dark blue ocean" 1. Chapter 35, Paragraph 11 (score 4.953089, docid 745) "Roll on, thou deep and DARK BLUE OCEAN, roll! Ten thousand blubber-hunters sweep over thee in vain." 1 results of 1 shown (took 0.002945
seconds)

简单的使用说明就是如此了。

And that concludes our tour.

1.4 Zettair索引构建说明

详尽的讲述Zettair的索引构建。

1.4.1 索引构建说明

Zettair可以构建倒排序索通过解析不同类型的源文档。阅读以下的描述来了解索引是如何构建的。目前暂时支持如下文档格式构建索引:HTML、TREC(国际文本检索会议)认定的格式。

Zettair can build inverted indexes by parsing different types of source collections. Please read the format descriptions to understand fully how an index is constructed from the given data. Currently, the following index types are supported:

HTML

TREC

可以指定多个文件。

Usage: zet -i file1 ... fileN

索引构建的选项

Index construction options

-i,--index

让zettair进行索引的构建模式

Put Zettair into index construction mode (as opposed to searching mode).

file1 ... fileN

file1...fileN 是需要创建索引的文本集,如果没有指定参数可以从stdin中读取。这样可以通过管道符来指定特定的文件名或者通过shell命令也可以。比如

:find . -name "*.c" -or -name "*.h" | ./zet -i -f source_index

The given files (file1 ... fileN) are files to index for searching. If no files are given then a list of filenames, seperated by whitespace, is read from stdin. This allows you to pipe a list of filenames to index in from a file or shell command. The command:

find . -name "*.c" -or -name "*.h" | ./zet -i -f source_index

我们也可以通过前缀或者后缀来建立索引,-f --filename prefix

would find all files with c and h extensions and index them, placing the result into a set of files that start with source_index.

-f,--filename prefix

指定索引名,如果没有指定索引名,index也会被用作默认的,前缀也可以包含路径部分。

give the name of the index to use. If no name is given, 'index' will be used as the default. The prefix can include directory path components.

-c,--config config_file

使用如上的配置文件来解析。这个配置中决定了抽取的文本标签。格式是一个简单文本(除去尖角号)指定文档中指定解析时是打开这个还是关闭这个标签。在config/psettings.xml 有样例。

use this configuration file for the parser. The configuration file determines which tags the parser attempts to extract text from. The format is a simple text file where the name of a tag (minus the angled brackets) is followed by a number that indicates
whether parsing should be turned on or off after this tag. See config/psettings.xml for an example.

--big-and-fast

参数会让zettair消耗约500M内存在构建索引时(默认是20M左右)

causes zettair to use around 500MB of memory during indexing (by default, around 20MB is used)

-a,--add

构建索引让zettair追加到已经存在的索引。默认的时候是会造成一个错误的。

allow zettair to add new postings to an existing index. By default, this causes an error.

--stem{ none | eds | light | porters }

使用词干提取算法在索引构建的过程中,NONE是不进行提取。EDS是会把ed、e、s等去掉。light模式自定义词干提取,虽然高效,但是会稍稍比Porter'词干提取效率低一些。Porter's 词干提取算法是一个慢速、完整的被人熟知的词干提取算法。

Use given stemming algorithm during index construction. None is no stemming. eds removes 'e', 'ed', and 's'. light is a custom stemmer that is fast, but slightly less effective than Porter's stemming. Porter's stemming is a slow, complex, well-known stemming
algorithm.

--anh-impact

生成顺序倒排压缩索引在构建的过程中,在查询的时候也需要指定压缩排序的评估。

Generate impact-ordered inverted lists during construction. This is required to use impact-ordered evaluation during querying.

-t { TREC | HTML }

指定特定的文本类型来进行构建索引。

select the type of the index, TREC or HTML (default: autodetect)

Sample Command Line:

zet -i -f disk45 -c /research/zettair/config/parser_settings.trec -t TREC /research/TREC/disk45/fbis /research/TREC/disk45/fr /research/TREC/disk45/ft /research/TREC/disk45/latimes

上述命令会使用TREC解析器来创建一个倒排序索引从如下的文件列表中。

This command will use the TREC parser to create an inverted index from the four listed files. You should then find the following index files:

disk45.map

disk45.v.0

disk45.param.0

disk45.vocab.0

1.4.2 索引文档类型

1、HTML Format

HTML格式,HTML解析器会把每一个文件当做一个HTML格式的文档。文本从HTML文档中抽取出来。

The HTML parser treats each file as one document in HTML format. Text is extracted from HTML documents according to the parser settings file, documented above.

2、TREC Format

TREC 格式经常被用来合并成千上万的文档到一个文档中,这样构建索引只需要查找一个文档而不是成千上万的文件。当原始的文件中的边界可以被解析器识别,这样就可以把这些文件合并到一个文件汇中去。解析器会抽取内容从给定的文件中和HTML相类似的方式进行抽取。再就是TREC解析器会查找<DOC>和</DOC>配对的标签来标记文档的开始和结尾,会通过<DOCNO> and </DOCNO>标签来查找文档标号的标签。TREC format 是由国际文本检索大会使用的实验数据格式而出名的。

It is often advantageous to combine several (thousand) documents in one file and be able to index and search on one single file rather than a few thousand files. This can be done by writing the information of several files into one file and formatting the one
file in such a way that original document boundaries can be detected by the parser. The parser will extract words from the given file in much the same way as in HTML mode. Additionally, the TREC parser looks for tags: <DOC> and </DOC> to signal the beginning
or end of a document, and identifies the documents via their TREC document number, which is found between a <DOCNO> and </DOCNO> tags. The TREC format is named as such because it is the format used by the Text Retrieval Conference (TREC) for experimental data.

如下式摘要样例:

The following excerpt from the Bible represents, for instance, 8 documents (of which 4 documents contain only one word).

<DOC> And the sons of Noah, that went forth of the ark, were Shem, and Ham,

and Japheth: and Ham is the father of Canaan. </DOC>

<DOC> genesis </DOC>

<DOC> These are the three sons of Noah: and of them was the whole earth overspread.</DOC>

<DOC> genesis </DOC>

<DOC> And Noah began to be an husbandman, and he planted a vineyard:</DOC>

<DOC> genesis </DOC>

<DOC> And he drank of the wine, and was drunken; and he was uncovered within his tent.</DOC>

<DOC> genesis </DOC>

1.5 Zettair索引查询说明

详尽的讲述Zettair的索引查询参数。

如下内容将讲述如何使用Zettair来查询倒排序索引,有两个可执行程序可以用来查询索引。

This page documents how you can use Zettair to query an inverted index. There are two executables that can be used for querying indexes build by Zettair:

zet 用于普通查询

zet Used for general querying

zet_trec 用于TREC实验查询,输入必须是TREC格式的文件,输出的文件格式可以直接作为trec 评估体系用于评估。

zet_trec Used for TREC experiments. The input is a TREC topic file, and the output is in a format that can be used with the trec_eval program.

1.5.1 索引度量选项

如下选在在zet和zet_trec 都生效。

These can be used with either zet or zet_trec to change the similarity metric used by Zettair.

--okapi

BM25 google的一种评分算法模型。

Use the Okapi BM25 metric.

-1,--k1=floatnum

BM25 google的一种评分算法模型,K1

Set the k1 parameter for the Okapi BM25 metric to the specified floating point value.

-b,--b=floatnum

BM25 google的一种评分算法模型,b

Set the b parameter for the Okapi BM25 metric to the specified floating point value.

-3,--k3=floatnum

BM25 google的一种评分算法模型,K3

Set the k3 parameter for the Okapi BM25 metric to the specified floating point value.

--pivoted-cosine=floatnum

枢接余弦度量

Use the pivoted cosine metric, with the pivot provided as a floating point value.

--cosine

余弦度量

Use the cosine metric.

--hawkapi=floatnum

Use Dave Hawking's adaptation of the Okapi BM25 metric, with the alpha value provided as a floating point number.

--anh-impact

使用anh压缩度量评估体系。

Use Anh and Moffat's impact-ordered evaluation, including separate metric. --anh-impact must have been used when building the index in order to employ impact-ordered query evaluation.

--dirichlet=uintnum

有点类似于语言建模的评估体系。

Use the Dirichlet-smoothed, query-likelihood language modelling metric with mu value given as an unsigned integer.

1.5.2 索引查询

Usage: zet [query1 ... queryN]

Index querying options:

-f prefix

指定索引索引文件的前缀。如果没有指定则index作为默认。

Give the name of the index to use. If no name is given then 'index' is used by default. The prefix may contain directory path elements.

-n results

-n 设置响应查询返回的个数,默认是20个。

Sets the maximum number of results returned in response to each query. The default is 20.

--query-list=filename

查询输入从文件中读取,而不是从标准输入中读取。

Instructs Zettair to read queries from the given file, instead of from stdin.

--query-stop=[filename]

使用制定文件名来停止查询。如果没有指定,默认的停止列表会被加载。

Uses the words contained in the given filename as stop words (not evaluated) during querying. If no filename is given, a default stop list is loaded.

--big-and-fast

可以让zettair在查询过程中使用约500MB的内存,默认是20MB。

Instructs Zettair to use approximately 500MB of memory while querying. The default memory usage should be around 20MB.

-b first_result

忽略查询的结果的数量。当查询的单词没有重复的单词时非常有效。默认是0

Sets the number of results to skip for each query. This can be useful in obtaining more results for a query without repeating those already obtained. The default is 0.

--summary={ plain | capitalise | tag | none }

选择展示的摘要文档的类型。nonet是不需要展示摘要,这是默认选项,其他的是如何高亮查找的单词在概要的文档中。plain:不高亮显示。capitalise 把找到的内容转成大写,tag 把找到的结果前后加上<b>的标签。

Choose the type of document summarisation to perform. none means do not provide document summaries with the query results; this is the default. The other alternatives specify how to highlight the search terms in the summary. plain specifies not to highlight
the search terms. capitalise highlights the search terms by capitalising them. tag highlights them by surrounding them with <b> tags.

query1 ... queryN

query1 到N的模式有点类似于google的输入查询,查询包含的关键词被选择的用AND和OR进行分割,但是操作符必须是大写。默认的操作符是OR,查询是不区分大小写的,除非指定AND和OR。在操作过程中不会进行词干提取和停止。所有的查询结构都在排序后展现出来。注意:google的运算符 - 现在目前的zettair中不支持的。

For searching, the given queries (query1 ... queryN) are Google-like queries that are used to search the index. Queries consist of keywords and phrases (represented "like this") optionally separated by the operators AND and OR (operators MUST be capitalised).
The default operator is OR. Search is case-insensitive, except for recognition of AND and OR. Stopping and stemming are not performed. All results are ranked by relevance. Note that the Google operator '-' and modifiers are not currently supported.

如果没有返回结果在命令行,Zettair会启动交互模式。在这种模式查询请求会读取标准的输入。交互模式仅仅在没有标准输入的时候存在。你能退出通过输入ctrl+d。

If no queries are found in the command line, Zettair will start in interactive mode. In this mode queries are read from standard input and executed. Interactive mode exits once it can no longer read from standard input. You can cause it to exit by entering
the end-of-file control character, typically control-d.

-v

显示版本信息

print version information

-h

显示帮助信息

print a help message

Sample Command Line:

zet -n 10 -f disk45

样例查询:

Example queries:

mail configuration

searches for the word 'mail' and the word 'configuration'. Pages returned can have either word, or both (OR query) in upper, lower or mixed case.

mail AND configuration

searches for pages that have the words 'mail' and 'configuration' in them.

shakespeare "to be or not to be"

searches for the word 'shakespeare' and the phrase 'to be or not to be'.

在命令行查询时可以使用双引号来查询,比如:"This is a query"

Note that if you are entering queries at the command line, you will probably have to escape (using the backslash or other means) double quotes for phrases. e.g.

> zet "this is a query \"with a phrase\""

1.5.3 TREC索引查询

Usage: zet_trec index

TREC querying options:

-f,--file=topic_file

输入查询话题词的文件。

Add TREC topic_file to list of topic files to process.

-F,--file-list=file

输入查询话题词的文件,可以指定多个文件。

Add files listed in file to list of topic files to process

-r,--runid=run_id

run_id是评估系统的id

Output run_id as id for this evaluation (run_id is a text field in trec_eval output)

-n,--number-results=results

输出的结果数量。

Number of results to output per query.

-t,--title

输入单个查询词

Use topic titles in queries (this is the default if none of -t, -a or -d are specified).

-d,--description

为查询指定描述的内容

Use topic descriptions in queries.

-a,--narrative

用文本叙述来进行查询。

Use topic narratives in queries.

--print-queries

打印查询结果,在stderr上。

Print queries to stderr as they are constructed from the topic file and resolved.

--timing

打印查询耗时,不包括打印的时间和索引加载的时间。

Print the total time taken in querying to stderr after all topics have been resolved. The time printed excludes index loading time.

--dummy

如果没有查询结构,则插入副本节点。

Insert dummy entries for topics that have no answers in the results set. This has been required for TREC terabyte submissions in the past.

--non-stop

不进行中断,当查询的文档量大和噪音强的查询日志时。

Don't stop if a query cannot be constructed from a topic. Useful when running large, noisy query logs.

--query-stop=[filename]

使用制定文件名来停止查询。如果没有指定,默认的停止列表会被加载。

Uses the words contained in the given filename as stop words (not evaluated) during querying. If no filename is given, a default stop list is loaded.

--big-and-fast

可以让Zettair 使用约500M的内存当查询时,默认是20MB。

Instructs Zettair to use approximately 500MB of memory while querying. The default memory usage should be around 20MB.

--qrels==[filename]

不是以TREC的格式输出找到的结果,而是按照评估使用的Qrels file文件。在TREC Qrel 格式,输出的内容格式可用于trec 评估。

Instead of printing search results in TREC format, the results are evaluated against the given Qrels file, in TREC Qrel format, and trec_eval-like output is produced.

index

索引名必须是TREC输入词文件。

The name of the index that is queried using the TREC topic files.

-h,--help

打印帮助信息

Print help message

-v,--version

打印版本信息

Print version information

简单的样例:

Sample Command Line:

./zet_trec -f /research/TREC-7/topics.351-400 -n 1000 disk45 > query.log

query.log 可以用于trec_eval的评估体系。

The file query.log can then be evaluated with trec_eval against pre-prepared relevance judgements.

1.6 Zettair代码导读

Zettair在设计时尽量的干净、简洁、可弹性化、尽量的快速。在普通的工作中,zettair处理简单的查询表现的非常出色,而且在不同的方向领域有足够的扩展空间。

Zettair is designed to be as clean, simple, flexible and fast as possible. While it is currently a work in progress, Zettair handles simple searching quite well, and has sufficient architecture to be extended in many different directions.

核心顶级的方法的查询的源码被分割到不同的文件中。核心函数方法在index.h 中进行了定义。

The core search code is seperated from the different front-end access methods. The core searching methods are documented in include/index.h.

Zettair也有一些编译配置选项和长度限制,这些可以在 ./src/include/def.h 中被找到,尽管默认配置就可以满足大多数人的要求。

Zettair also has a number of compile-time configuration options and length limitations. These can all be found and changed in src/include/def.h, although the default settings should suffice for most people.

除了阅读代码之外,你想了解任何信息都可发送邮件到 zettair@cs.rmit.edu.au 。

Apart from reading the source code, if you want to know more about any part of Zettair feel free to contact us at zettair@cs.rmit.edu.au.

1.7 附录

如果您对如上的翻译文档 ,有什么意见和看法也欢迎发送Email到 yijiyong100@163.com 和我进行交流。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐