python中HTMLParser简单理解
2016-06-25 21:32
519 查看
找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。
重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,
<h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>
<span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>
<time datetime="2016-07-29T00:00:00+00:00">29 July – 01 Aug. <span class="say-no-more"> 2016</span></time>
在遇到特殊字符串时(例如–)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016
from html.parser import HTMLParser from html.entities import name2codepoint class MyHTMLParser(HTMLParser): in_title = False 7 in_loca = False in_time = False def handle_starttag(self,tag,attrs): if ('class','event-title') in attrs: self.in_title = True elif ('class','event-location') in attrs: self.in_loca = True elif tag == 'time': self.in_time = True self.times = [] def handle_data(self,data): if self.in_title: print('-'*50) print('Title:'+data.strip()) if self.in_loca: print('Location:'+data.strip()) if self.in_time: self.times.append(data) def handle_endtag(self,tag): if tag == 'h3':self.in_title = False if tag == 'span':self.in_loca = False if tag == 'time': self.in_time = False print('Time:'+'-'.join(self.times)) parser = MyHTMLParser() with open('s.html') as html: parser.feed(html.read())
重点理解15-17和30-32行,python的HTMLParser在解析网页中的文本时,是按照一个个字符串解析的,
<h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>
<span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>
<time datetime="2016-07-29T00:00:00+00:00">29 July – 01 Aug. <span class="say-no-more"> 2016</span></time>
在遇到特殊字符串时(例如–)会直接跳过,将前后作为两个字符串,15-17和30-32的配合是为了获取span中的年份2016
相关文章推荐
- 几个用Python实现的简单算法
- python正则替换
- python之class(1)
- 利用python进行数据分析-数据聚合与分组运算1
- Python(1)
- Python的由来
- 基于布尔注入的Python代码
- 基于报错注入的Python代码
- python 标识符
- python实现超市扫码仪计费
- Python 爬虫2
- Python 2.7下配置opencv
- python中文编码问题
- Python3之hashlib
- 5.3 Python 列表
- OSError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/pip-1.5-py2.7.egg/EGG-INFO
- 5.2 python 变量
- python-scipy 图像处理
- python 三元条件判断的3种实现方法
- python-PIL 图像基本操作