您的位置:首页 > 其它

A Toolkit For Langugae Modeling——SRILM使用记录

2015-02-25 19:52 531 查看
参考:

SRILM安装:http://blog.csdn.net/zhoubl668/article/details/7759370

SRILM使用:http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87

文献:SRILM - An
Extensible Language Modeling Toolkit(点此阅读)

更有兴趣的可以参考:

SRILM源码框架分析:http://download.csdn.net/download/yqzhao/4546985

SRILM源码阅读系列:http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html

SRILM打折算法:http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

两个核心模块

SRILM工具包的有两个核心模块,一个是利用训练数据构建语言模型,是ngram-count模块,另一个是对语言模型进评测(计算测试集困惑度),是ngram模块。

一. ngram-count

对于ngram-count模块,有很多的计数功能,可以单独生成训练语料的计数文件,然后可以读取计数文件构建语言模型,也可以两步一起做。

假设语料库的名字是train.data,如下:

it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let
me know .
No worry about that . I 'll take it and you need not wrap it up .
Do you do alterations ?
the light was red .
we want to have a table near the window .
it 's over there , just in front of the tourist information .
I twisted it playing tennis . it felt Okay after the game but then it started turning black - and
- blue . is it serious ?
please input your pin number .

1.计数功能——生成计数文件

命令:

ngram-count -text train.data -order 3 -write train.data.count

其中text指向的后边是输入文件, order
3是3元语法到意思,-write后跟到是输出到计数文件,如果不指定-order选项的话,默认是3元语法。

从rain.data.count得到的部分结果如下:

please 1
please input 1
please input your 1
<s> 8
<s> it 2
<s> it 's 2
<s> the 1
<s> the light 1
<s> we 1
<s> we want 1
…...
up 1
up . 1
up . </s> 1
Do 1
Do you 1
Do you do 1

这里的<s>,</s>分别表示句子的开始和结束,计数文件的格式是:

a_z <tab> c(a_z)

a_z:a表示n元语法的第一个词,z表示最后一个词,_表示在a和z之间的0个或多个词
c(a_z):表示a_z在训练语料中的计数

2.从计数文件构建语言模型

命令:

ngram-count -read train.data.count -order 3 -lm train.lm

其中参数-read指向输入文件,此处为 train.data.count ;-order与上同;-lm指向训练好的语言模型输出文件,此处为train.lm,此命令为后面可以接具体的打折算法,和后退或插值相结合,比如后面接-interpolate
-kndiscount,其中-interpolate为插值平滑,-kndiscount为 modifiedKneser-Ney 打折法,如果不指定的话,默认使用Good-Turing打折和Katz退避算法,这里使用默认。

train.lm文件部分结果如下:

\data\
ngram 1=75
ngram 2=106
ngram 3=2

\1-grams:
-1.770852 'll -0.03208452
-1.770852 's -0.02453138
-1.770852 , -0.4659371
-1.770852 - -0.02832437
-1.030489 . -0.5141692

…...

\2-grams:
-1.361728 'll bring
-1.361728 'll take
-1.361728 's just
-1.361728 's over
-0.1760913 , just
-1.361728 - and
-1.361728 - blue

…...
\3-grams:
-0.1760913 . I 'll
-0.1760913 <s> it 's

其中文件格式:
log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))

注:f(a_z)是条件概率即P(z|a_),bow(a_z)是回退权重

第一列表示以10为底对数的条件概率P(z|a_),第二列是n元词,第三列是以10为底的对数回退权重(它为未看见的n+1元词贡献概率)

3.直接结合上面两步

一般的直接利用训练语料构建语言模型,即结合上面两部,命令如下:

ngram-count -text train.data -lm train.lm

这里没写-order,默认是3,没指定打折算法,默认使用Good-Turing打折和Katz退避算法

二. n-gram模块测试集困惑度

假设测试集为test.data,如下:

we want to have a table near the window .
read a list of sentence .

先用如下命令:

ngram -ppl test.data -order 3 -lm train.lm

参数-ppl有两个主要任务,一是计算句子对数概率(log10P(T),其
中P(T)为所有句子的概率乘积)二是计算测试集困惑度,有两个指标ppl, 和ppl1

在终端上得到输出结果:

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603

输出信息的第一行:2个句子,16单词,3个未登录词;

输出信息的第二行:无零概率,logP(T)=-17.9098,ppl=
15.6309 ppl1= 23.8603

其中ppl和ppl1分别计算如下(这里参考自http://www.52nlp.cn/language-model-training-tools-srilm-details):

ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}

其中Sen和Word分别代表句子和单词数。

在原来命令基础上如果想要得到更详细的输出信息,可以加选项-debug
0-4,-debug
0对应这里的默认情况。比如用ngram
-ppl test.data -order 3 -lm train.lm -debug 1,终端上得到输出信息如下:

reading 75 1-grams

reading 106 2-grams

reading 2 3-grams

we want to have a table near the window .

1 sentences, 10 words, 0 OOVs

0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697

read a list of sentence .

1 sentences, 6 words, 3 OOVs

0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  srilm 语言模型 ngram