您的位置:首页 > 编程语言 > Python开发

python学习:网页解析

2015-09-20 00:18 507 查看

python学习:网页解析

python网页解析工具,可以根据标签特点获取相应标签中的内容。开始没有找到beautifulsoap这个强大的工具,同时也想提升程序的运行效率(自己的程序仅需运行一次),这里自己实现了一个可以根据HTML标签获取到网页元素的程序,这个程序是基于查找的形式对网页解析,没有对网页元素进行分类和归类。

程序基于python3.0,以上版本,分为两大块,webclient、html元素解析,具体的实现如下:

webclient网页获取器

这里采用urllib模块对网页URL进行访问并返回相应的网页信息。

mport urllib
import urllib.request
class HTMLClient:
def GetPage(self, url):
#user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
user_agent = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
try:
res = urllib.request.urlopen(req)
return res.read().decode("utf-8")
except urllib.error.HTTPError as e:
return None
def GetPic(self, url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
try:
res = urllib.request.urlopen(req)
return res.read()
except urllib.error.HTTPError as e:
return None


html元素解析

这里,我们利用html中各个标签的对称性,提出一个算法,找到起始标签,对于相同的元素标签如div进行计数,当<div>与</div>相同时,返回其间的数据作为一个模块,具体代码如下:

class Simple_Parser:
def Parser(self, data, stag, etag):
epos = 0
npos = epos
totallen = 0
elen = len(etag)
dlen = len(data)
depth = 0
while True:
epos = data[totallen:].find(etag)
if epos < 0:
return -1
else:
count = data[totallen:epos+totallen].count(stag)
depth += count
depth -=1
epos += elen
totallen += epos
if depth == 0:
return totallen
if totallen >= dlen:
return -1
return -1
def feed(self, data, stag, etag):
headtag = stag.split(' ')[0]
itemList = []
pseek = 0
while True:
npos = data[pseek:].find(stag)
if npos < 0:
return itemList

nlen = self.Parser(str(data[pseek+npos:]), headtag, etag)
if nlen >= 0:
itemList.append(data[pseek+npos:pseek+npos+nlen])
pseek += npos+nlen
else:
return itemList


网页抓取样例

最终,我选取了剑侠情缘网络版三的新闻页面进行抓取。代码如下:

class JX3_Spider:
def Get_News(self, page):
myparser = Simple_Parser()
return myparser.feed(page, u'<div class="news_list news_list02">', u'</div>');
def Get_CSS(self, page):
myparser = Simple_Parser()
return myparser.feed(page, u'<link ', u'/>')

if __name__ == '__main__':
myclient = HTMLClient()
mypage = myclient.GetPage("http://xw.jx3.xoyo.com/news/")
jx3_spider = JX3_Spider()
jx3_news = jx3_spider.Get_News(mypage)
jx3_css = jx3_spider.Get_CSS(mypage)
infile = input('>')
with open("jx3_news.html", 'wb') as jx3file:
jx3file.write(b'<head>')
jx3file.write(b'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">')
for item in jx3_css:
jx3file.write(bytes(item, 'utf-8'))
for item in jx3_news:
jx3file.write(bytes(item, 'utf-8'))
jx3file.close()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python 爬虫 html dom parse