您的位置：首页 > 其它

使用jieba分词对中文文档进行分词|停用词去重

2016-11-27 11:02 681 查看

1.使用jieba分词对中文文档进行分词

# -*- coding: utf-8 -*-
# @Time    : 17-8-4 上午9:26
# @Author  : 未来战士biubiu！！
# @FileName: test.py
# @Software: PyCharm Community Edition
# @Blog    ：http://blog.csdn.net/u010105243/article/
# Python3
import jieba

# jieba.load_userdict('userdict.txt')
# 创建停用词list
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
return stopwords

# 对句子进行分词
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
stopwords = stopwordslist('./test/stopwords.txt')  # 这里加载停用词的路径
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
return outstr

inputs = open('./test/input.txt', 'r', encoding='utf-8')
outputs = open('./test/output.txt', 'w')
for line in inputs:
line_seg = seg_sentence(line)  # 这里的返回值是字符串
outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

2.停用词表去重

从网上收集来的停用词可能有重复的，下面的代码去重

# 停用词表按照行进行存储，每一行只有一个词语
# python3
def stopwd_reduction(infilepath, outfilepath):
infile = open(infilepath, 'r', encoding='utf-8')
outfile = open(outfilepath, 'w')
stopwordslist = []
for str in infile.read().split('\n'):
if str not in stopwordslist:
stopwordslist.append(str)
outfile.write(str + '\n')

stopwd_reduction('./test/stopwords.txt', './test/stopword.txt')

3停用词词表

根据自己的需要合并的中文停用词词表，需要的可以下载下载地址

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航