您的位置:首页 > 编程语言 > Python开发

python 爬虫学习篇1

2013-03-29 23:09 429 查看
python爬取diameizi网页,然后下载图片

python 环境是2.7.3

代码地址:https://gist.github.com/zjjott/5270366

作者讨论地址:http://tieba.baidu.com/p/2239765168?fr=itb_feed_jing#30880553662l

需要抓的美女图片地址:http://diameizi.diandian.com/

#coding=utf-8
import os
os.system("wget -r --spider http://diameizi.diandian.com 2>|log.txt")#非常简单的抓取整个网页树结构的语句————实质上是一种偷懒
filein=open('log.txt','r')
fileout=open('dst','w+')#一个装最后的结果的没用的文件
filelist=list(filein)
import urllib2,time
from bs4 import BeautifulSoup
header={
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/8.0.1'}
def getsite(url):
req=urllib2.Request(url,None,header)
site=urllib2.urlopen(req)
return site.read()##上面这六句基本万金油了。。
try:
dst=set()
for p in filelist:
if p.find('http://diameizi.diandian.com/post')>-1:
p=p[p.find('http'):]
dst.add(p)
i=0
for p in dst:
#if i<191:
#        i+=1
#        continue##断点续传部分
pagesoup=BeautifulSoup(getsite(p))
pageimg=pagesoup.find_all('img')
for href in pageimg:
print i,href['src']
picpath="pic/"+href['src'][-55:-13]+href['src'][-4:]##名字的起法有问题。。。不过效果还行。。
pic=getsite(href['src'])
picfile=open(picpath,'wb')
picfile.write(pic)
i+=1
picfile.close()
finally:
for p in dst:
fileout.write(p)
fileout.close()


上面的log.txt

文件大体就是下面的内容。

Spider mode enabled. Check if remote file exists.
--2013-03-29 23:00:10--  http://diameizi.diandian.com/ Resolving diameizi.diandian.com (diameizi.diandian.com)... 113.31.29.120, 113.31.29.121
Connecting to diameizi.diandian.com (diameizi.diandian.com)|113.31.29.120|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30502 (30K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2013-03-29 23:00:11--  http://diameizi.diandian.com/ Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `diameizi.diandian.com/index.html'

0K .......... .......... .........                        94.6K=0.3s

2013-03-29 23:00:12 (94.6 KB/s) - `diameizi.diandian.com/index.html' saved [30502]

Loading robots.txt; please ignore errors.
--2013-03-29 23:00:12--  http://diameizi.diandian.com/robots.txt Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 209 [text/plain]
Saving to: `diameizi.diandian.com/robots.txt'

0K                                                       100% 20.8M=0s

2013-03-29 23:00:12 (20.8 MB/s) - `diameizi.diandian.com/robots.txt' saved [209/209]

Removing diameizi.diandian.com/robots.txt.
Removing diameizi.diandian.com/index.html.

Spider mode enabled. Check if remote file exists.
--2013-03-29 23:00:12--  http://diameizi.diandian.com/rss Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/xml]
Remote file exists but does not contain any link -- not retrieving.

Removing diameizi.diandian.com/rss.
unlink: No such file or directory

Spider mode enabled. Check if remote file exists.
--2013-03-29 23:00:12--  http://diameizi.diandian.com/archive Reusing existing connection to diameizi.diandian.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 82303 (80K) [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2013-03-29 23:00:12--  http://diameizi.diandian.com/archive Reusing existing connection to diameizi.diandian.com:80.


从上面的文本文件中寻找需要的相关资料。

上面的代码还没有测试成功,因为是2.7.3平台的缘故吧。

例子上给的应该是python3.x版本。有些出入
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: