您的位置：首页 > 其它

A Toolkit For Langugae Modeling——SRILM使用记录

2015-02-25 19:52 531 查看

参考：

SRILM安装：http://blog.csdn.net/zhoubl668/article/details/7759370

SRILM使用：http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87

文献：SRILM - An
Extensible Language Modeling Toolkit(点此阅读)

更有兴趣的可以参考：

SRILM源码框架分析：http://download.csdn.net/download/yqzhao/4546985

SRILM源码阅读系列：http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html

SRILM打折算法：http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

两个核心模块

SRILM工具包的有两个核心模块，一个是利用训练数据构建语言模型，是ngram-count模块，另一个是对语言模型进评测（计算测试集困惑度），是ngram模块。

一. ngram-count

对于ngram-count模块，有很多的计数功能，可以单独生成训练语料的计数文件，然后可以读取计数文件构建语言模型，也可以两步一起做。

假设语料库的名字是train.data，如下：

it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let
me know .
No worry about that . I 'll take it and you need not wrap it up .
Do you do alterations ?
the light was red .
we want to have a table near the window .
it 's over there , just in front of the tourist information .
I twisted it playing tennis . it felt Okay after the game but then it started turning black - and
- blue . is it serious ?
please input your pin number .

1.计数功能——生成计数文件

命令：

ngram-count -text train.data -order 3 -write train.data.count

其中text指向的后边是输入文件， order
3是3元语法到意思，-write后跟到是输出到计数文件，如果不指定-order选项的话，默认是3元语法。

从rain.data.count得到的部分结果如下：

please 1
please input 1
please input your 1
<s> 8
<s> it 2
<s> it 's 2
<s> the 1
<s> the light 1
<s> we 1
<s> we want 1
…...
up 1
up . 1
up . </s> 1
Do 1
Do you 1
Do you do 1

这里的<s>，</s>分别表示句子的开始和结束，计数文件的格式是：

a_z <tab> c(a_z)

a_z：a表示n元语法的第一个词，z表示最后一个词，_表示在a和z之间的0个或多个词
c(a_z)：表示a_z在训练语料中的计数

2.从计数文件构建语言模型

命令：

ngram-count -read train.data.count -order 3 -lm train.lm

其中参数-read指向输入文件，此处为 train.data.count ；-order与上同；-lm指向训练好的语言模型输出文件，此处为train.lm，此命令为后面可以接具体的打折算法，和后退或插值相结合，比如后面接-interpolate
-kndiscount，其中-interpolate为插值平滑，-kndiscount为 modifiedKneser-Ney 打折法，如果不指定的话，默认使用Good-Turing打折和Katz退避算法，这里使用默认。

train.lm文件部分结果如下：

\data\
ngram 1=75
ngram 2=106
ngram 3=2

\1-grams:
-1.770852 'll -0.03208452
-1.770852 's -0.02453138
-1.770852 , -0.4659371
-1.770852 - -0.02832437
-1.030489 . -0.5141692

…...

\2-grams:
-1.361728 'll bring
-1.361728 'll take
-1.361728 's just
-1.361728 's over
-0.1760913 , just
-1.361728 - and
-1.361728 - blue

…...
\3-grams:
-0.1760913 . I 'll
-0.1760913 <s> it 's

其中文件格式：
log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))

注：f(a_z)是条件概率即P(z|a_)，bow(a_z)是回退权重

第一列表示以10为底对数的条件概率P(z|a_)，第二列是n元词，第三列是以10为底的对数回退权重(它为未看见的n+1元词贡献概率)

3.直接结合上面两步

一般的直接利用训练语料构建语言模型，即结合上面两部，命令如下：

ngram-count -text train.data -lm train.lm

这里没写-order，默认是3,没指定打折算法，默认使用Good-Turing打折和Katz退避算法

二. n-gram模块测试集困惑度

假设测试集为test.data，如下：

we want to have a table near the window .
read a list of sentence .

先用如下命令：

ngram -ppl test.data -order 3 -lm train.lm

参数-ppl有两个主要任务，一是计算句子对数概率(log10P(T)，其
中P(T)为所有句子的概率乘积）二是计算测试集困惑度，有两个指标ppl, 和ppl1

在终端上得到输出结果：

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603

输出信息的第一行：2个句子，16单词，3个未登录词；

输出信息的第二行：无零概率，logP(T)=-17.9098，ppl=
15.6309 ppl1= 23.8603

其中ppl和ppl1分别计算如下(这里参考自http://www.52nlp.cn/language-model-training-tools-srilm-details)：

ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}

其中Sen和Word分别代表句子和单词数。

在原来命令基础上如果想要得到更详细的输出信息，可以加选项-debug
0-4，-debug
0对应这里的默认情况。比如用ngram
-ppl test.data -order 3 -lm train.lm -debug 1，终端上得到输出信息如下：

reading 75 1-grams

reading 106 2-grams

reading 2 3-grams

we want to have a table near the window .

1 sentences, 10 words, 0 OOVs

0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697

read a list of sentence .

1 sentences, 6 words, 3 OOVs

0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379

file test.data: 2 sentences, 16 words, 3 OOVs

0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： srilm 语言模型 ngram

相关文章推荐

新的分享

章节导航