计词unigram和bigram的频次
2016-07-27 14:45
567 查看
http://blog.csdn.net/niuox/article/details/11395397
在自然语言处理中,我们经常需要用到n元语法模型。
其中,有关中文分词的一些概念是我们需要掌握的,譬如:
unigram 一元分词,把句子分成一个一个的汉字
bigram 二元分词,把句子从头到尾每两个字组成一个词语
trigram 三元分词,把句子从头到尾每三个字组成一个词语.
我们来简单的做个练习:
输入的是断好词的文本,每个句子一行。
统计词unigram和bigram的频次,并将它们分别输出到`data.uni`和`data.bi`两个文件中。
[python] view
plain copy
#!/usr/bin/env python
class NGram(object):
def __init__(self, n):
# n is the order of n-gram language model
self.n = n
self.unigram = {}
self.bigram = {}
# scan a sentence, extract the ngram and update their
# frequence.
#
# @param sentence list{str}
# @return none
def scan(self, sentence):
# file your code here
for line in sentence:
self.ngram(line.split())
#unigram
if self.n == 1:
try:
fip = open("data.uni","w")
except:
print >> sys.stderr ,"failed to open data.uni"
for i in self.unigram:
fip.write("%s %d\n" % (i,self.unigram[i]))
if self.n == 2:
try:
fip = open("data.bi","w")
except:
print >> sys.stderr ,"failed to open data.bi"
for i in self.bigram:
fip.write("%s %d\n" % (i,self.bigram[i]))
# caluclate the ngram of the words
#
# @param words list{str}
# @return none
def ngram(self, words):
# unigram
if self.n == 1:
for word in words:
if word not in self.unigram:
self.unigram[word] = 1
else:
self.unigram[word] = self.unigram[word] + 1
# bigram
if self.n == 2:
num = 0
stri = ''
for i in words:
num = num + 1
if num == 2:
stri = stri + " "
stri = stri + i
if num == 2:
if stri not in self.bigram:
self.bigram[stri] = 1
else:
self.bigram[stri] = self.bigram[stri] + 1
num = 0
stri = ''
if __name__=="__main__":
import sys
try:
fip = open(sys.argv[1],"r")
except:
print >> sys.stderr, "failed to open input file"
sentence = []
for line in fip:
if len(line.strip())!=0:
sentence.append(line.strip())
uni = NGram(1)
bi = NGram(2)
uni.scan(sentence)
bi.scan(sentence)
在自然语言处理中,我们经常需要用到n元语法模型。
其中,有关中文分词的一些概念是我们需要掌握的,譬如:
unigram 一元分词,把句子分成一个一个的汉字
bigram 二元分词,把句子从头到尾每两个字组成一个词语
trigram 三元分词,把句子从头到尾每三个字组成一个词语.
我们来简单的做个练习:
输入的是断好词的文本,每个句子一行。
统计词unigram和bigram的频次,并将它们分别输出到`data.uni`和`data.bi`两个文件中。
[python] view
plain copy
#!/usr/bin/env python
class NGram(object):
def __init__(self, n):
# n is the order of n-gram language model
self.n = n
self.unigram = {}
self.bigram = {}
# scan a sentence, extract the ngram and update their
# frequence.
#
# @param sentence list{str}
# @return none
def scan(self, sentence):
# file your code here
for line in sentence:
self.ngram(line.split())
#unigram
if self.n == 1:
try:
fip = open("data.uni","w")
except:
print >> sys.stderr ,"failed to open data.uni"
for i in self.unigram:
fip.write("%s %d\n" % (i,self.unigram[i]))
if self.n == 2:
try:
fip = open("data.bi","w")
except:
print >> sys.stderr ,"failed to open data.bi"
for i in self.bigram:
fip.write("%s %d\n" % (i,self.bigram[i]))
# caluclate the ngram of the words
#
# @param words list{str}
# @return none
def ngram(self, words):
# unigram
if self.n == 1:
for word in words:
if word not in self.unigram:
self.unigram[word] = 1
else:
self.unigram[word] = self.unigram[word] + 1
# bigram
if self.n == 2:
num = 0
stri = ''
for i in words:
num = num + 1
if num == 2:
stri = stri + " "
stri = stri + i
if num == 2:
if stri not in self.bigram:
self.bigram[stri] = 1
else:
self.bigram[stri] = self.bigram[stri] + 1
num = 0
stri = ''
if __name__=="__main__":
import sys
try:
fip = open(sys.argv[1],"r")
except:
print >> sys.stderr, "failed to open input file"
sentence = []
for line in fip:
if len(line.strip())!=0:
sentence.append(line.strip())
uni = NGram(1)
bi = NGram(2)
uni.scan(sentence)
bi.scan(sentence)
相关文章推荐
- 语音识别之语言模型----前缀搜索算法
- Word2vec 句向量模型PV-DM与PV-DBOW原论文翻译
- RNN的通俗讲解(初级篇)
- N-Gram语言模型
- Statistical language model 统计语言模型
- A Toolkit For Langugae Modeling——SRILM使用记录
- RNNLM——A Toolkit For Language Modeling rnnlm基本功能命令详细介绍
- rwthlm工具包安装步骤
- Recurrent Neural Network Based Language Model(RNNLM)原理及BPTT数学推导
- Feedforward Neural Network Language Model(NNLM)c++核心代码实现
- 统计自然语言处理——信息论基础
- 语言模型 Language Modeling
- 牛津大学神经网络语言模型 OxLM 安装及使用
- [文献阅读] A Statistical MT Tutorial Workbook
- 深度学习与自然语言处理(7)_斯坦福cs224d 语言模型,RNN,LSTM与GRU
- 深度学习word2vec笔记之基础篇
- 基于循环神经网络实现基于字符的语言模型(char-level RNN Language Model)-tensorflow实现
- 关于RNNLM的思考,特别是与HMM,n-gram的区别
- Tomas Mikolov's Recurrent Neural Networks Language Modeling Toolkit
- AndroidAnnotations使用配置整合版