您的位置:首页 > 编程语言 > Python开发

50万邮件文本分域检索与查询的python实现(4)

2012-09-16 20:54 411 查看
第四小节介绍对每个域的倒排表分别进行Top 50统计

在shell中与用户交互,根据用户输入决定导入哪个datebase。然后统计字典中每个键值的数目,接着调用sorted函数。代码如下:

import pickle
def frncy(Field_Name,Dbase_Name):
mydb=open(Dbase_Name,'r')
mapping=pickle.load(mydb)

mapping_cnt={}
for key in mapping.keys():
cnt=0
for doc in mapping[key]:
# doc[1] is the number of 'key' in every doc[0]
cnt=cnt+doc[1]
mapping_cnt.setdefault(key,[]).append(cnt)

print
print '********Top 50 Tokens and Their Frequency for \'',Field_Name,'\'*********'

sorted_list=[]
sorted_list=sorted(mapping_cnt.iteritems(), key=lambda a:a[1], reverse=True)
for i in range(50):
tmp=sorted_list[i]
print '* Token',i+1,': ',tmp[0], '   ( Frequency:',tmp[1],')'
print '**********************************************************'

def main():
print '( ------- Top 50 Tokens for Any Field -------- )'
while True:
field_name=raw_input('* Please input field name(\'To\'|\'From\'|\'Subject\'|\'body\', \'q\' to quit): ')
if field_name=='q':
exit()
elif field_name=='To':
frncy('To','dbase_to')
elif field_name=='From':
frncy('From','dbase_from')
elif field_name=='Subject':
frncy('Subject','dbase_subject')
elif field_name=='body':
frncy('mail body','dbase_body')
else:
print '** Opps, not having this field! Try again...'

if __name__ == '__main__':
main()

给出“Subject”域的Top 50

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: