您的位置:首页 > 编程语言 > Python开发

Python 统计文章单词出现频率

2014-11-26 22:57 483 查看
近来学习Python,Python在科学计算中有着较强的优势。练习文章处理的初级代码,共享出来希望高手指点。

任务目标:统计英文文章中出现频率较高的单词,画出频率图并显示频率较高的单词。

基本要求已完成。应改变显示结果的条件,适应长短相差较大的文章。

import pylab
import numpy
import string

def linetoword(line):
for ch in line:
if ch not in string.lowercase and ch not in string.uppercase and not ch == ' ':
line.replace(ch,' ')
wordlist = line.split(" ")
newlist = []
for word in wordlist:
if len(word)>3:
word = word.lower()
newlist.append(word)
return newlist

def readarticle(title):
file = open(title,"r")
wordlist = []
line = file.readline()

while not line == "":
wordlist.extend(linetoword(line))
line = file.readline()
file.close()
return wordlist

wordlist = readarticle("article.txt")

uniqueword = dict()

for word in wordlist:
if word in uniqueword.keys():
uniqueword[word] = uniqueword[word]+1
else:
uniqueword[word] = 1

for key,val in uniqueword.items():
if val<5:
uniqueword.pop(key)

word = [word for word in uniqueword.keys()]
count = [val for val in uniqueword.values()]

width = 0.2
xval = numpy.arange(len(uniqueword))
pylab.xticks(xval+width/2.0,word,rotation=45)
pylab.bar(xval,count,width = width,color = 'r')
pylab.title("Frequency of an article")
pylab.show()


可将单词频率统计的几行代码改成自己需要的功能,文章存储在article.txt中。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: