您的位置：首页 > 编程语言 > Python开发

【Python爬虫】从零开始玩转爬虫——Top电影信息爬取

2017-08-16 17:17 1221 查看

本次来搞搞Top电影排行榜的电影信息，同样这属于一篇学习性的文章，希望能够分享在学习过程中遇到的坑与学到的新技术，试图用最简单的话来阐述我所记录的Python爬虫笔记。

一、爬取结果展示

电影海报：

电影信息：

二、需求分析

在2345电影网中爬取TOP排行榜电影信息及图片存入本地文件。网址如下：

http://dianying.2345.com/top/

可将项目划分为两个子问题：

1.如何获取页面信息

2.如何解析信息并将图片存入本地文件夹

因此，可定义函数：

1.getHtml()

2.saveInfo()

三、DOM树结构分析

推荐采用Google Chrome 打开网站地址http://dianying.2345.com/top/

鼠标右键检查，查看网页结构

找到ul列表，所有需要的信息都藏在这里面啦！！

再往下深入挖掘我们发现海报的信息存在这里：

电影的文本信息存在这里：

好啦，把这些鳖孙都找出来那就简单啦。我们只需要分别把它们提取出来就好。每次这个时候都特别的激动，或许这就是爬虫的魅力吧。

四、实际代码剖析

导入需要用到的库：

import requests
from bs4 import BeautifulSoup

主函数编写：

def main():
url = 'http://dianying.2345.com/top/'
html = getHtml(url)
saveInfo(html)

main()

此处很简单，只需把URL放进来，再运行各个函数即可。

getHtml()函数编写：

def getHtml(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status()
r.encoding = 'gbk'
return r.text
except:
return '出现某些错误'

老规矩，采用requests.get()方法获取网页信息，一开始我我采用的是r.encoding = r.apparent_encoding的方法来解决乱码问题可是发现坑爹了，实际编码为GB2312。在此，我将编码转化为GBK的编码格式。至于GB2312与GBK有何区别请小伙伴们自行百度。（好吧，我承认我懒~~）

此外，注意采用try…except的结构！try…except的结构！try…except的结构！重要的事想说几遍就说几遍。这样能让你的程序更加稳定。

saveInfo()函数编写：

def saveInfo(html):
soup = BeautifulSoup(html,'lxml')#解析网页
move_ls = soup.find('ul',class_='picList clearfix')
movies = move_ls.find_all('li')
for top in movies:
img_url = top.find('img')['src']#查找所有的图片链接
name = top.find('span',class_='sTit').get_text()
try:
time = top.find('span',class_='sIntro').get_text()
except:
time = '暂时无上映时间信息'

try:
actors = top.find('p',class_='pActor')
actor = ''
for act in actors.contents:
actor = actor + act.string + ' '
except:
actor = '暂时无演员姓名'

if top.find('p',class_='pTxt pIntroHide'):
intro = top.find('p',class_='pTxt pIntroHide').get_text()
else:
intro = top.find('p',class_='pTxt pIntroShow').get_text()
print('影片名： {}\t{}\n{}\n{}\n\n'.format(name,time,actor,intro))

# 下载图片
with open('D:/movie_img/'+name + '.png','wb+') as f:
f.write(requests.get(img_url).content)

注意，因为某些电影出现无上映时间或此部电影为动画片，因此无演员的情况。在此继续拿出try..expect的结构解决此问题。

还有运行过程中出现电影简介显示不完全的问题。吓得我赶紧回网页结构查看。发现毒瘤如下：

由于简介过长，所以有些电影采用了‘展开全部’这样的下拉效果来隐藏全部信息。将全部的信息全部封装在另一个html标签中。如下图：

我们发现真正的电影简介信息是放在‘pTxt pIntroHide’标签中的而有部分电影却没有这个标签故采用if判断语句将内容提取。

if top.find('p',class_='pTxt pIntroHide'):
intro = top.find('p',class_='pTxtpIntroHide').get_text()

else:
intro = top.find('p',class_='pTxtpIntroShow').get_text()

最后，将之前提取的海报链接打开，并采用with open 文件操作的方法将图片存放于D盘的movie_img文件夹下。由于图片属于二进制文件故采用‘wb+’的写入方法。

五、源代码如下

# -*- coding: utf-8 -*-
"""
Created on Wed Aug 16 14:31:23 2017

@author: 追梦囚徒
"""

import requests
from bs4 import BeautifulSoup

def getHtml(url):
try:
r = requests.get(url,timeout = 30)
r.raise_for_status()
r.encoding = 'gbk'
return r.text
except:
return '出现某些错误'

def saveInfo(html):
soup = BeautifulSoup(html,'lxml')
move_ls = soup.find('ul',class_='picList clearfix')
movies = move_ls.find_all('li')
for top in movies:
img_url = top.find('img')['src']
name = top.find('span',class_='sTit').get_text()
try:
time = top.find('span',class_='sIntro').get_text()
except:
time = '暂时无上映时间信息'

try:
actors = top.find('p',class_='pActor')
actor = ''
for act in actors.contents:
actor = actor + act.string + ' '
except:
actor = '暂时无演员姓名'

if top.find('p',class_='pTxt pIntroHide'):
intro = top.find('p',class_='pTxt pIntroHide').get_text()
else:
intro = top.find('p',class_='pTxt pIntroShow').get_text()
print('影片名： {}\t{}\n{}\n{}\n\n'.format(name,time,actor,intro))

# 下载图片
with open('D:/movie_img/'+name + '.png','wb+') as f:
f.write(requests.get(img_url).content)

def main():
url = 'http://dianying.2345.com/top/'
html = getHtml(url)
saveInfo(html)

main()

搞定啦，这次的知识分享到这里就结束了，其实入门Python真的很简单，而Python也真的可以做很多事。比如数据分析、数据挖掘、机器学习、网站后端开发等。Python难的地方在于你的思维与你的努力程度，只有不断的练，不断的写才能有所提高。其实应该学习任何一门编程语言都是这样的吧。最后希望本篇文章能帮助自己巩固知识更能帮助别人理清思路。如果文章中有错误或不足的地方，还望海涵。

最后给自己的公众号打个广告啦！我将定时推送贵州省贵安新区大数据发展消息以及R语言与Python语言的学习干货。期待您的关注哦！！！

数聚贵安（shujuguian）

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 爬虫

相关文章推荐

新的分享

章节导航