词语相似度计算:3、使用urllib爬取wiki文章,使用beautifulSoup解析html
2016-03-21 10:18
676 查看
详细介绍参考:
http://blog.csdn.net/mmc2015/article/details/50923309
完整代码供大家参考。。。。
[python] view
plain copy
#!usr/bin/env
# -*-coding:utf-8 -*-
import pandas as pd
import numpy as np
import urllib, urllib2
import re
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding("utf8")
#for UnicodeEncodeError
def SaveFile(content, filename):
f=open("wikiData/"+filename,"a")
f.write(str(content)+"\n")
f.close()
def SpideWiki(words):
user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
try:
for i in range(len(words)):
url="https://en.wikipedia.org/wiki/"+words[i]
request=urllib2.Request(url, headers=headers)
response=urllib2.urlopen(request)
wikiHtml=response.read().decode('utf-8')
html=BeautifulSoup(str(wikiHtml),"lxml")
div=html.find(name='div', id='mw-content-text')
ps=div.find_all(name='p', limit=3, recursive=False) #only direct children
for p in ps:
pText=p.get_text()
SaveFile(pText, words[i])
print words[i], "process over...", "=="*20
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
http://blog.csdn.net/mmc2015/article/details/50923309
完整代码供大家参考。。。。
[python] view
plain copy
#!usr/bin/env
# -*-coding:utf-8 -*-
import pandas as pd
import numpy as np
import urllib, urllib2
import re
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding("utf8")
#for UnicodeEncodeError
def SaveFile(content, filename):
f=open("wikiData/"+filename,"a")
f.write(str(content)+"\n")
f.close()
def SpideWiki(words):
user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
try:
for i in range(len(words)):
url="https://en.wikipedia.org/wiki/"+words[i]
request=urllib2.Request(url, headers=headers)
response=urllib2.urlopen(request)
wikiHtml=response.read().decode('utf-8')
html=BeautifulSoup(str(wikiHtml),"lxml")
div=html.find(name='div', id='mw-content-text')
ps=div.find_all(name='p', limit=3, recursive=False) #only direct children
for p in ps:
pText=p.get_text()
SaveFile(pText, words[i])
print words[i], "process over...", "=="*20
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
相关文章推荐
- 词语相似度计算:1、安装NLTK和下载WordNet语料库;WordNet的使用
- 词语相似度计算:2、使用NLTK和WordNet计算词语相似度
- 词语相似度计算:6、实验报告
- 词语相似度计算:5、训练各种相似度模型(LR,RF,NMF,LDA等)【待续】
- 词语相似度计算:4、提取文本tf、tfidf特征
- html结构左固定,右自适应
- html中onclick方法无效
- HTML—表单的学习
- HTML—个人简历
- HTML—标签表格
- HTML笔试题 20道单选
- 谈谈HtmlControl与WebControl的区别与用途
- 在html中块级元素与内联元素分析
- 学习HTML之前必须了解的基础
- HTML表单详解
- html-注册邮箱
- html--表单
- html-图片热点和网页划区
- HTML基础2 表单和框架
- 关于HTML表格