python学习:网页解析
2015-09-20 00:18
507 查看
python学习:网页解析
python网页解析工具,可以根据标签特点获取相应标签中的内容。开始没有找到beautifulsoap这个强大的工具,同时也想提升程序的运行效率(自己的程序仅需运行一次),这里自己实现了一个可以根据HTML标签获取到网页元素的程序,这个程序是基于查找的形式对网页解析,没有对网页元素进行分类和归类。程序基于python3.0,以上版本,分为两大块,webclient、html元素解析,具体的实现如下:
webclient网页获取器
这里采用urllib模块对网页URL进行访问并返回相应的网页信息。mport urllib import urllib.request class HTMLClient: def GetPage(self, url): #user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' user_agent = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36' headers = { 'User-Agent' : user_agent } req = urllib.request.Request(url, None, headers) try: res = urllib.request.urlopen(req) return res.read().decode("utf-8") except urllib.error.HTTPError as e: return None def GetPic(self, url): user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } req = urllib.request.Request(url, None, headers) try: res = urllib.request.urlopen(req) return res.read() except urllib.error.HTTPError as e: return None
html元素解析
这里,我们利用html中各个标签的对称性,提出一个算法,找到起始标签,对于相同的元素标签如div进行计数,当<div>与</div>相同时,返回其间的数据作为一个模块,具体代码如下:class Simple_Parser: def Parser(self, data, stag, etag): epos = 0 npos = epos totallen = 0 elen = len(etag) dlen = len(data) depth = 0 while True: epos = data[totallen:].find(etag) if epos < 0: return -1 else: count = data[totallen:epos+totallen].count(stag) depth += count depth -=1 epos += elen totallen += epos if depth == 0: return totallen if totallen >= dlen: return -1 return -1 def feed(self, data, stag, etag): headtag = stag.split(' ')[0] itemList = [] pseek = 0 while True: npos = data[pseek:].find(stag) if npos < 0: return itemList nlen = self.Parser(str(data[pseek+npos:]), headtag, etag) if nlen >= 0: itemList.append(data[pseek+npos:pseek+npos+nlen]) pseek += npos+nlen else: return itemList
网页抓取样例
最终,我选取了剑侠情缘网络版三的新闻页面进行抓取。代码如下:class JX3_Spider: def Get_News(self, page): myparser = Simple_Parser() return myparser.feed(page, u'<div class="news_list news_list02">', u'</div>'); def Get_CSS(self, page): myparser = Simple_Parser() return myparser.feed(page, u'<link ', u'/>') if __name__ == '__main__': myclient = HTMLClient() mypage = myclient.GetPage("http://xw.jx3.xoyo.com/news/") jx3_spider = JX3_Spider() jx3_news = jx3_spider.Get_News(mypage) jx3_css = jx3_spider.Get_CSS(mypage) infile = input('>') with open("jx3_news.html", 'wb') as jx3file: jx3file.write(b'<head>') jx3file.write(b'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">') for item in jx3_css: jx3file.write(bytes(item, 'utf-8')) for item in jx3_news: jx3file.write(bytes(item, 'utf-8')) jx3file.close()
相关文章推荐
- Python动态类型的学习---引用的理解
- Python3写爬虫(四)多线程实现数据爬取
- 垃圾邮件过滤器 python简单实现
- 下载并遍历 names.txt 文件,输出长度最长的回文人名。
- install and upgrade scrapy
- Scrapy的架构介绍
- Centos6 编译安装Python
- 使用Python生成Excel格式的图片
- 让Python文件也可以当bat文件运行
- [Python]推算数独
- 爬虫笔记
- Python中zip()函数用法举例
- Python中map()函数浅析
- Python将excel导入到mysql中
- 在Windows 8.1的IE 11中屏蔽双击放大功能
- mac os 下安装python与Scrapy