您的位置：首页 > 编程语言 > Python开发

python中HTMLParser简单理解

2016-06-25 21:32 519 查看

找一个网页，例如https://www.python.org/events/python-events/，用浏览器查看源码并复制，然后尝试解析一下HTML，输出Python官网发布的会议时间、名称和地点。

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):

　　in_title = False
7 　　in_loca = False
　　in_time = False

　　def handle_starttag(self,tag,attrs):
　　　　if ('class','event-title') in attrs:
　　　　　　self.in_title = True
　　　　elif ('class','event-location') in attrs:
　　　　　　self.in_loca = True
　　　　elif tag == 'time':
　　　　　　self.in_time = True
　　　　　　self.times = []

　　def handle_data(self,data):
　　　　if self.in_title:
　　　　　　print('-'*50)
　　　　　　print('Title:'+data.strip())
　　　　if self.in_loca:
　　　　　　print('Location:'+data.strip())
　　　　if self.in_time:
　　　　　　self.times.append(data)
　　def handle_endtag(self,tag):
　　　　if tag == 'h3':self.in_title = False
　　　　if tag == 'span':self.in_loca = False
　　　　if tag == 'time':
　　　　　　self.in_time = False
　　　　　　print('Time:'+'-'.join(self.times))
parser = MyHTMLParser()
with open('s.html') as html:
parser.feed(html.read())

重点理解15-17和30-32行，python的HTMLParser在解析网页中的文本时，是按照一个个字符串解析的，

　　<h3 class="event-title"><a href="/events/python-events/401/">PyOhio 2016</a></h3>

　　<span class="event-location">The Ohio Union at The Ohio State University. 1739 N. High Street, Columbus, OH 43210, USA</span>

　　<time datetime="2016-07-29T00:00:00+00:00">29 July – 01 Aug. <span class="say-no-more"> 2016</span></time>

在遇到特殊字符串时（例如–）会直接跳过，将前后作为两个字符串，15-17和30-32的配合是为了获取span中的年份2016

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航