A Toolkit For Langugae Modeling——SRILM使用记录
2015-02-25 19:52
531 查看
参考:
SRILM安装:http://blog.csdn.net/zhoubl668/article/details/7759370
SRILM使用:http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87
文献:SRILM - An
Extensible Language Modeling Toolkit(点此阅读)
更有兴趣的可以参考:
SRILM源码框架分析:http://download.csdn.net/download/yqzhao/4546985
SRILM源码阅读系列:http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html
SRILM打折算法:http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
假设语料库的名字是train.data,如下:
it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let
me know .
No worry about that . I 'll take it and you need not wrap it up .
Do you do alterations ?
the light was red .
we want to have a table near the window .
it 's over there , just in front of the tourist information .
I twisted it playing tennis . it felt Okay after the game but then it started turning black - and
- blue . is it serious ?
please input your pin number .
其中text指向的后边是输入文件, order
3是3元语法到意思,-write后跟到是输出到计数文件,如果不指定-order选项的话,默认是3元语法。
从rain.data.count得到的部分结果如下:
please 1
please input 1
please input your 1
<s> 8
<s> it 2
<s> it 's 2
<s> the 1
<s> the light 1
<s> we 1
<s> we want 1
…...
up 1
up . 1
up . </s> 1
Do 1
Do you 1
Do you do 1
这里的<s>,</s>分别表示句子的开始和结束,计数文件的格式是:
a_z <tab> c(a_z)
a_z:a表示n元语法的第一个词,z表示最后一个词,_表示在a和z之间的0个或多个词
c(a_z):表示a_z在训练语料中的计数
其中参数-read指向输入文件,此处为 train.data.count ;-order与上同;-lm指向训练好的语言模型输出文件,此处为train.lm,此命令为后面可以接具体的打折算法,和后退或插值相结合,比如后面接-interpolate
-kndiscount,其中-interpolate为插值平滑,-kndiscount为 modifiedKneser-Ney 打折法,如果不指定的话,默认使用Good-Turing打折和Katz退避算法,这里使用默认。
train.lm文件部分结果如下:
\data\
ngram 1=75
ngram 2=106
ngram 3=2
\1-grams:
-1.770852 'll -0.03208452
-1.770852 's -0.02453138
-1.770852 , -0.4659371
-1.770852 - -0.02832437
-1.030489 . -0.5141692
…...
\2-grams:
-1.361728 'll bring
-1.361728 'll take
-1.361728 's just
-1.361728 's over
-0.1760913 , just
-1.361728 - and
-1.361728 - blue
…...
\3-grams:
-0.1760913 . I 'll
-0.1760913 <s> it 's
其中文件格式:
log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))
注:f(a_z)是条件概率即P(z|a_),bow(a_z)是回退权重
第一列表示以10为底对数的条件概率P(z|a_),第二列是n元词,第三列是以10为底的对数回退权重(它为未看见的n+1元词贡献概率)
这里没写-order,默认是3,没指定打折算法,默认使用Good-Turing打折和Katz退避算法
we want to have a table near the window .
read a list of sentence .
先用如下命令:
参数-ppl有两个主要任务,一是计算句子对数概率(log10P(T),其
中P(T)为所有句子的概率乘积)二是计算测试集困惑度,有两个指标ppl, 和ppl1
在终端上得到输出结果:
file test.data: 2 sentences, 16 words, 3 OOVs
0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603
输出信息的第一行:2个句子,16单词,3个未登录词;
输出信息的第二行:无零概率,logP(T)=-17.9098,ppl=
15.6309 ppl1= 23.8603
其中ppl和ppl1分别计算如下(这里参考自http://www.52nlp.cn/language-model-training-tools-srilm-details):
ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}
其中Sen和Word分别代表句子和单词数。
在原来命令基础上如果想要得到更详细的输出信息,可以加选项-debug
0-4,-debug
0对应这里的默认情况。比如用ngram
-ppl test.data -order 3 -lm train.lm -debug 1,终端上得到输出信息如下:
reading 75 1-grams
reading 106 2-grams
reading 2 3-grams
we want to have a table near the window .
1 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697
read a list of sentence .
1 sentences, 6 words, 3 OOVs
0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379
file test.data: 2 sentences, 16 words, 3 OOVs
0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603
SRILM安装:http://blog.csdn.net/zhoubl668/article/details/7759370
SRILM使用:http://hi.baidu.com/keyever/item/8fad8918b90b8e6b3f87ce87
文献:SRILM - An
Extensible Language Modeling Toolkit(点此阅读)
更有兴趣的可以参考:
SRILM源码框架分析:http://download.csdn.net/download/yqzhao/4546985
SRILM源码阅读系列:http://blog.chinaunix.net/uid/20658401/cid-67529-list-1.html
SRILM打折算法:http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
两个核心模块
SRILM工具包的有两个核心模块,一个是利用训练数据构建语言模型,是ngram-count模块,另一个是对语言模型进评测(计算测试集困惑度),是ngram模块。一. ngram-count
对于ngram-count模块,有很多的计数功能,可以单独生成训练语料的计数文件,然后可以读取计数文件构建语言模型,也可以两步一起做。假设语料库的名字是train.data,如下:
it 's just down the hall . I 'll bring you some now . if there is anything else you need , just let
me know .
No worry about that . I 'll take it and you need not wrap it up .
Do you do alterations ?
the light was red .
we want to have a table near the window .
it 's over there , just in front of the tourist information .
I twisted it playing tennis . it felt Okay after the game but then it started turning black - and
- blue . is it serious ?
please input your pin number .
1.计数功能——生成计数文件
命令:ngram-count -text train.data -order 3 -write train.data.count |
3是3元语法到意思,-write后跟到是输出到计数文件,如果不指定-order选项的话,默认是3元语法。
从rain.data.count得到的部分结果如下:
please 1
please input 1
please input your 1
<s> 8
<s> it 2
<s> it 's 2
<s> the 1
<s> the light 1
<s> we 1
<s> we want 1
…...
up 1
up . 1
up . </s> 1
Do 1
Do you 1
Do you do 1
这里的<s>,</s>分别表示句子的开始和结束,计数文件的格式是:
a_z <tab> c(a_z)
a_z:a表示n元语法的第一个词,z表示最后一个词,_表示在a和z之间的0个或多个词
c(a_z):表示a_z在训练语料中的计数
2.从计数文件构建语言模型
命令:ngram-count -read train.data.count -order 3 -lm train.lm |
-kndiscount,其中-interpolate为插值平滑,-kndiscount为 modifiedKneser-Ney 打折法,如果不指定的话,默认使用Good-Turing打折和Katz退避算法,这里使用默认。
train.lm文件部分结果如下:
\data\
ngram 1=75
ngram 2=106
ngram 3=2
\1-grams:
-1.770852 'll -0.03208452
-1.770852 's -0.02453138
-1.770852 , -0.4659371
-1.770852 - -0.02832437
-1.030489 . -0.5141692
…...
\2-grams:
-1.361728 'll bring
-1.361728 'll take
-1.361728 's just
-1.361728 's over
-0.1760913 , just
-1.361728 - and
-1.361728 - blue
…...
\3-grams:
-0.1760913 . I 'll
-0.1760913 <s> it 's
其中文件格式:
log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))
注:f(a_z)是条件概率即P(z|a_),bow(a_z)是回退权重
第一列表示以10为底对数的条件概率P(z|a_),第二列是n元词,第三列是以10为底的对数回退权重(它为未看见的n+1元词贡献概率)
3.直接结合上面两步
一般的直接利用训练语料构建语言模型,即结合上面两部,命令如下:ngram-count -text train.data -lm train.lm |
二. n-gram模块测试集困惑度
假设测试集为test.data,如下:we want to have a table near the window .
read a list of sentence .
先用如下命令:
ngram -ppl test.data -order 3 -lm train.lm |
中P(T)为所有句子的概率乘积)二是计算测试集困惑度,有两个指标ppl, 和ppl1
在终端上得到输出结果:
file test.data: 2 sentences, 16 words, 3 OOVs
0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603
输出信息的第一行:2个句子,16单词,3个未登录词;
输出信息的第二行:无零概率,logP(T)=-17.9098,ppl=
15.6309 ppl1= 23.8603
其中ppl和ppl1分别计算如下(这里参考自http://www.52nlp.cn/language-model-training-tools-srilm-details):
ppl=10^{-{logP(T)}/{Sen+Word}};ppl1=10^{-{logP(T)}/Word}
其中Sen和Word分别代表句子和单词数。
在原来命令基础上如果想要得到更详细的输出信息,可以加选项-debug
0-4,-debug
0对应这里的默认情况。比如用ngram
-ppl test.data -order 3 -lm train.lm -debug 1,终端上得到输出信息如下:
reading 75 1-grams
reading 106 2-grams
reading 2 3-grams
we want to have a table near the window .
1 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -12.4723 ppl= 13.6096 ppl1= 17.6697
read a list of sentence .
1 sentences, 6 words, 3 OOVs
0 zeroprobs, logprob= -5.43749 ppl= 22.8757 ppl1= 64.9379
file test.data: 2 sentences, 16 words, 3 OOVs
0 zeroprobs, logprob= -17.9098 ppl= 15.6309 ppl1= 23.8603
相关文章推荐
- GC调优方案,步骤二:使用工具[IBM Pattern Modeling and Analysis Tool for Java Garbage Collector]分析gc信息
- QQforMacQQforMac聊天记录/登录信息的保存路径 QQ for Mac使用问题
- Maximum Entropy Modeling Toolkit for Python and C++(转载)
- 记录一次使用v-for使用报错,提示component lists rendered with v-for should have explicit keys
- Network Emulator for Windows Toolkit 使用心得(一)
- MySQL for mac使用记录
- RNNLM——A Toolkit For Language Modeling rnnlm基本功能命令详细介绍
- 关于使用mybatis传值时出现java.lang.NumberFormatException: For input string: "C"报错记录
- PMTK3 :probabilistic modeling toolkit for Matlab/Octave
- thrift for lua 使用记录
- Windows Ribbon for WinForms 使用记录
- Recurrent Neural Network Language Modeling Toolkit by Tomas Mikolov使用示例
- [记录]ASP.NET MVC 2.0 如何使用Html.RadioButtonFor?
- fefora下面的虚拟机 wine 与winetrick使用记录
- [记录]ASP.NET MVC 2.0 如何使用Html.RadioButtonFor?
- 记录一个router-link和v-for联合使用的坑
- startActivityForResult使用以及内存泄露记录
- 使用阿里云的SLB的X-Forwarded-For后端tomcat记录真实IP
- 使用.NET Framework组件中的DataGrid显示ADO中的RecordSet对象的记录
- 使用dom4j和XMLHTTP轻松解决多条记录操作