您的位置：首页 > 其它

计词unigram和bigram的频次

2016-07-27 14:45 567 查看

http://blog.csdn.net/niuox/article/details/11395397

在自然语言处理中，我们经常需要用到n元语法模型。

其中，有关中文分词的一些概念是我们需要掌握的，譬如：

unigram 一元分词，把句子分成一个一个的汉字

bigram 二元分词，把句子从头到尾每两个字组成一个词语

trigram 三元分词，把句子从头到尾每三个字组成一个词语.

我们来简单的做个练习：

输入的是断好词的文本，每个句子一行。

统计词unigram和bigram的频次，并将它们分别输出到`data.uni`和`data.bi`两个文件中。

[python] view
plain copy

#!/usr/bin/env python



class NGram(object):



    def __init__(self, n):

        # n is the order of n-gram language model

        self.n = n

        self.unigram = {}

        self.bigram = {}



    # scan a sentence, extract the ngram and update their

    # frequence.

    #

    # @param    sentence    list{str}

    # @return   none

    def scan(self, sentence):

        # file your code here

        for line in sentence:

            self.ngram(line.split())

        #unigram

        if self.n == 1:

            try:

                fip = open("data.uni","w")

            except:

                print >> sys.stderr ,"failed to open data.uni"

            for i in self.unigram:

                fip.write("%s %d\n" % (i,self.unigram[i]))

        if self.n == 2:

            try:

                fip = open("data.bi","w")

            except:

                print >> sys.stderr ,"failed to open data.bi"

            for i in self.bigram:

                fip.write("%s %d\n" % (i,self.bigram[i]))

    # caluclate the ngram of the words

    #

    # @param    words       list{str}

    # @return   none

    def ngram(self, words):

        # unigram

        if self.n == 1:

            for word in words:

                if word not in self.unigram:

                    self.unigram[word] = 1

                else:

                    self.unigram[word] = self.unigram[word] + 1



        # bigram

        if self.n == 2:

            num = 0

            stri = ''

            for i in words:

                num = num + 1

                if num == 2:

                    stri  = stri + " "

                stri = stri + i

                if num == 2:

                    if stri not in self.bigram:

                        self.bigram[stri] = 1

                    else:

                        self.bigram[stri] = self.bigram[stri] + 1

                    num = 0

                    stri = ''



if __name__=="__main__":

    import sys

    try:

        fip = open(sys.argv[1],"r")

    except:

        print >> sys.stderr, "failed to open input file"

    sentence = []

    for line in fip:

        if len(line.strip())!=0:

            sentence.append(line.strip())

    uni = NGram(1)

    bi = NGram(2)

    uni.scan(sentence)

    bi.scan(sentence)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 语言模型 n-gram unigram bigram

相关文章推荐

新的分享

章节导航