您的位置：首页 > 理论基础 > 计算机网络

Python网络编程：初探nltk与美汤结合处理网络数据

2018-05-11 20:00 639 查看

前置要求

使用Python BeautifulSoup 包获得网页数据，需要用户电脑中安装了BeautifulSoup包。该包支持pip安装，可输入如下指令安装：

pip3 install BeautifulSoup4

导入网页数据：urlopen(URL)

Python网页数据的读写与I/O读写操作类似，因此以下操作读入维基百科“雾霾”这一页的数据：

from urllib.request import urlopen
from bs4 import BeautifulSoup

raw = urlopen("http://en.wikipedia.org/wiki/Smog").read()
print(type(raw))
print(raw[100:200])

读入的数据以字节流的形式呈现

加工文本数据：BeaurifulSoup和正则表达

使用BeautifulSoup类即可将其转化为BeautifulSoup的对象：

soup = BeautifulSoup(raw, 'html.parser')
print(type(soup))

以下代码获取网页文件位于<p></p>标签之间的段落文本数据，并存在texts列表中：

texts = []
for para in soup.find_all('p'):
text = para.text
texts.append(text)
print(texts[:10])

如果需要去掉获得数据中的所有应用（如“[1]、[2]”等），可以对文本进行如下处理：

import re
regex = re.compile('\[[0-9]*\]')
joined_texts = '\n'.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
print(type(joined_texts))
print(joined_texts)

joined_texts这个文本字符串首先由texts每一个元素用换行符串起来获得。作处理时，以它为输入文本，将其中表示维基引用的符号全部替换为空字符串即可。

然后，就可以对获得的joined_texts进行一些处理，例如：

import nltk
wordlist = nltk.word_tokenize(joined_texts)
print(wordlist[:8])
good_text = nltk.Text(wordlist)
good_text.concordance('smog')

获得一些关于雾霾这个单词的信息。

更多NLTK有关信息，详情请见：

NLTK入门一：文本的信息统计、搜索和词频统计概览

NLTK入门二：NLTK文本分析初步

输出处理文档：

最后，我们可以对于处理完毕的文档进行输出。文本输出既可以直接以单词列表的形式：

NLTK_file = open("NLTK-Smog.txt", "w", encoding='UTF-8')
NLTK_file.write(str(wordlist))
NLTK_file.close()

也可以以处理之后文本的形式：

text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

国际化适用

Beautiful 用于其他语言（如中文网页信息），仍然需要经过特殊处理，本文的代码不一定直接适用。比如，中文的换行符号是u'\xa0'，以及很多中文网站架构都与一般国际上的不同。往往需要多重操作。以下为一个为邻居追星族爬百度页面的典型代码：

# -*- coding: utf-8 -*-

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

raw = urlopen("https://baike.baidu.com/item/%E7%8E%8B%E4%BF%8A%E5%87%AF/75850").read()

#Tesing code to test wheather link is successful:
#print(type(raw))
#print(raw[100:200])

text_soup = BeautifulSoup(raw, 'html.parser')
#print(text_soup)

texts = []
for para in text_soup.find_all('div'):
text = para.text
texts.append(text)

texts = texts[72 : -104]
#print(texts)

regex = re.compile('(<cite>([^<>\/].+?)</cite>)+')
joined_texts = ''.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
regex = re.compile('\[[0-9]*\]')
joined_texts = re.sub(regex, '', joined_texts)

words = joined_texts.split('\n')
while u'\xa0' in words:
words.remove(u'\xa0')

while '' in words:
words.remove('')

joined_texts = '\n'.join(words)

text_file = open(u"王俊凯.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

参考资料：

NLTK官方文档，http://www.nltk.org/

NLTKbook模块官方文档，http://www.nltk.org/book/

墨尔本大学科研委员会（Resbaz）NLTK培训课程

Jupyter notebook常见快捷键：https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcut

更改Jupyter notebook起始目录的四种方法，URL：https://blog.csdn.net/qq_33039859/article/details/54604533

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Beautiful Soup Python Browser NLTK

相关文章推荐

新的分享

章节导航