您的位置:首页 > 编程语言 > Go语言

Term weight algorithm in IR

2016-08-12 11:36 211 查看

1 TF-IDF

2 BM25

f是TD-IDF中的TF,|D|是文档D的长度,avgdl是语料库全部文档的平均长度。k1和b是参数。usually chosen, in absence of an advanced optimization, as k1∈[1.2,2.0] and b = 0.75 。





b的相关性

令: y=1-b+b*x, x表示|D|/avgdl, x与y的关系如上图。

b越大,文档长度对相关性得分的影响越大,反之越小。b越大时,当文档长度大于平均长度,那么相关性得分越小;反之越大。

这可以理解为,当文档较长时,包含qi的机会越大,因此,同等fi的情况下,长文档与qi的相关性应该比短文档与qi的相关性弱。



K的相关性

令: y=(tf*(k+1))./(tf+k), k与y的关系如下图。



从图表明, k对相似度的影响不大。

3 DFR(divergence form randomness)

Basic Randomness Models

The DFR models are based on this simple idea: “The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d”. In other words the term-weight is inversely relat
4000
ed to the probability of term-frequency within the document d obtained by a model M of randomness:

weight(t|d)∝−logProbM(t∈d|Collection)

(8)

where the subscript M stands for the type of model of randomness employed to compute the probability. The basic models are derived in the following table.

Basic DFR Models
DDivergence approximation of the binomial
PApproximation of the binomial
BEBose-Einstein distribution
GGeometric approximation of the Bose-Einstein
I(n)Inverse Document Frequency model
I(F)Inverse Term Frequency model
I(ne)Inverse Expected Document Frequency model
If the model M is the binomial distribution, then the basic model is P and computes the value:

−logProbP(t∈d|Collection)=−log(TF tf)ptfqTF−tf

where:

TF is the term-frequency of the term t in the Collection

tf is the term-frequency of the term t in the document d

N is the number of documents in the Collection

p is 1/N and q=1-p

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:

−logProbG(t∈d|Collection)=−log((11+λ)(λ1+λ))

where λ = F/N.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: