您的位置:首页 > 产品设计 > UI/UE

用requests和beautifulsoup爬取豆瓣电影top250,代码及遇到的问题

2015-09-06 10:10 627 查看
初始代码如下:
# -*-coding:utf8-*-
import requests
from bs4 import BeautifulSoup

url='http://movie.douban.com/top250'
html=requests.get(url)
soup=BeautifulSoup(html)
print soup.title
结果报错(如下)和警告(略),

后来将代码改成如下,解决了问题
# -*-coding:utf8-*-
import requests
from bs4 import BeautifulSoup

url='http://movie.douban.com/top250'
html=requests.get(url)
soup=BeautifulSoup(html.text,"lxml")
print soup.title
最终程序如下:
#!/usr/bin/env python
# -*-coding:utf8-*-
import requests
import sys
from bs4 import BeautifulSoup
reload(sys)
sys.setdefaultencoding("utf-8")

# 获取电影名
def get_movie(soup,name):
titles=soup.find_all(class_="title")
for title in titles:
if title.string[1]!='/':                   # 去除其他同名
name.append(title.string)
return name

# 获取电影排名和评分
def get_number_score(soup,number,score):
number_score=soup.find_all('em')
for i in range(len(number_score)):
if i%2==0:
number.append(number_score[i].string)
else:
score.append(number_score[i].string)
return number,score

name=[];number=[];score=[]                         # 变量初始化
f=open('movie.txt','w')
# 得到豆瓣top250的电影
for i in range(10):
url='http://movie.douban.com/top250?start=%s&filter=&type=' %(i*25)
html=requests.get(url).text
soup=BeautifulSoup(html,"lxml")
name=get_movie(soup,name)
(number,score)=get_number_score(soup, number, score)
# 将结果写入文件
for j in range(len(name)):
title_str='%s %s %s' %(number[j],name[j],score[j])
f.writelines(title_str+'\n')

f.close()

最后导出豆瓣top250的电影,格式如下:
1 肖申克的救赎 9.6
x xxxxxxx x.x
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息