您的位置：首页 > 编程语言 > Python开发

[python爬虫学习]1.爬取本地网页

2017-09-27 05:01 316 查看

python基础语法，自动跳过，虽然看得也是很幸苦。

大体思路就是通过CSS样式的位置来定位到自己想要的信息。

首先介绍BeautifulSoup，这是一款神器，有了它，就可以解析一切网页（至少就我认知水准而言）。而它，则是把一个html解析成一个树状结构（打开网页源代码就能看到一条条层级分明的代码），,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment 。

这些具体就不解释了。

如何获取css地址呢？就是单击右键，点击检查，然后在某一个代码处点击Copy selector就可以获取css了。

比如，该网页中的文字标题：

它的地址是：body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a

但是，这样select（美汤中的一个函数方法）的话，你只能锁定这一个标题，为了能找到所有的标题，我把一些具体信息去掉，得到了：div.caption > h4 > a。

这里最麻烦的是统计星星的数量：

关于星星的代码有两个：

stars=soup.select('div.ratings > p:nth-of-type(2) ')
。。。。。。。。。。
'star':len(star.find_all('span',class_='glyphicon glyphicon-star'))

第一句代码找到了包含了所有星星的区域，但星星有实有虚，我要的是统计实星的数量，于是需要用find_all来找到class为glyphicon glyphicon-star（实星的属性）的星星。（find_all技能get）。

剩下的就看具体代码了：

from bs4 import BeautifulSoup

path="/webcrawl/SourceCode/Plan-for-combating-master/week1/1_2/1_2answer_of_homework/1_2_homework_required/index.html"
with open(path,'r') as f:
soup=BeautifulSoup(f,'lxml')

heads=soup.select('div.caption > h4 > a')
prices=soup.select(' div.caption > h4.pull-right')
articles=soup.select(' div > div.caption > p')
remark_texts=soup.select('div.ratings > p.pull-right')
stars=soup.select('div.ratings > p:nth-of-type(2) ')

for head,price,article,remark_text,star in zip(heads,prices,articles,remark_texts,stars):
data={
'head':head.get_text(),
'price':price.get_text(),
'article':article.get_text(),
'remark':remark_text.get_text(),
'star':len(star.find_all('span',class_='glyphicon glyphicon-star')) #通过限制属性来找到实星，并用len统计长度（star.find_all是一个列表）
}
#打印所有星数大于3星的产品信息
if data['star']>3:
print(data)

总之，这项工作的精髓就在于找到网页中各个元素css表达式的规律，然后通过去除一些具体信息来找出一类事物。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航