您的位置：首页 > 编程语言 > Go语言

Term weight algorithm in IR

2016-08-12 11:36 211 查看

1 TF-IDF

2 BM25

f是TD-IDF中的TF，|D|是文档D的长度，avgdl是语料库全部文档的平均长度。k1和b是参数。usually chosen, in absence of an advanced optimization, as k1∈[1.2,2.0] and b = 0.75 。

b的相关性

令： y=1-b+b*x， x表示|D|/avgdl, x与y的关系如上图。

b越大，文档长度对相关性得分的影响越大，反之越小。b越大时，当文档长度大于平均长度，那么相关性得分越小；反之越大。

这可以理解为，当文档较长时，包含qi的机会越大，因此，同等fi的情况下，长文档与qi的相关性应该比短文档与qi的相关性弱。

K的相关性

令： y=(tf*(k+1))./(tf+k)， k与y的关系如下图。

从图表明, k对相似度的影响不大。

3 DFR(divergence form randomness)

Basic Randomness Models

The DFR models are based on this simple idea: “The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d”. In other words the term-weight is inversely relat
4000
ed to the probability of term-frequency within the document d obtained by a model M of randomness:

weight(t|d)∝−logProbM(t∈d|Collection)

(8)

where the subscript M stands for the type of model of randomness employed to compute the probability. The basic models are derived in the following table.

Basic DFR Models
D	Divergence approximation of the binomial
P	Approximation of the binomial
BE	Bose-Einstein distribution
G	Geometric approximation of the Bose-Einstein
I(n)	Inverse Document Frequency model
I(F)	Inverse Term Frequency model
I(ne)	Inverse Expected Document Frequency model

If the model M is the binomial distribution, then the basic model is P and computes the value:

−logProbP(t∈d|Collection)=−log(TF tf)ptfqTF−tf

where:

TF is the term-frequency of the term t in the Collection

tf is the term-frequency of the term t in the document d

N is the number of documents in the Collection

p is 1/N and q=1-p

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:

−logProbG(t∈d|Collection)=−log((11+λ)(λ1+λ))

where λ = F/N.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航