python 爬虫学习篇1
2013-03-29 23:09
429 查看
python爬取diameizi网页,然后下载图片
python 环境是2.7.3
代码地址:https://gist.github.com/zjjott/5270366
作者讨论地址:http://tieba.baidu.com/p/2239765168?fr=itb_feed_jing#30880553662l
需要抓的美女图片地址:http://diameizi.diandian.com/
上面的log.txt
文件大体就是下面的内容。
从上面的文本文件中寻找需要的相关资料。
上面的代码还没有测试成功,因为是2.7.3平台的缘故吧。
例子上给的应该是python3.x版本。有些出入
python 环境是2.7.3
代码地址:https://gist.github.com/zjjott/5270366
作者讨论地址:http://tieba.baidu.com/p/2239765168?fr=itb_feed_jing#30880553662l
需要抓的美女图片地址:http://diameizi.diandian.com/
#coding=utf-8 import os os.system("wget -r --spider http://diameizi.diandian.com 2>|log.txt")#非常简单的抓取整个网页树结构的语句————实质上是一种偷懒 filein=open('log.txt','r') fileout=open('dst','w+')#一个装最后的结果的没用的文件 filelist=list(filein) import urllib2,time from bs4 import BeautifulSoup header={ 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/8.0.1'} def getsite(url): req=urllib2.Request(url,None,header) site=urllib2.urlopen(req) return site.read()##上面这六句基本万金油了。。 try: dst=set() for p in filelist: if p.find('http://diameizi.diandian.com/post')>-1: p=p[p.find('http'):] dst.add(p) i=0 for p in dst: #if i<191: # i+=1 # continue##断点续传部分 pagesoup=BeautifulSoup(getsite(p)) pageimg=pagesoup.find_all('img') for href in pageimg: print i,href['src'] picpath="pic/"+href['src'][-55:-13]+href['src'][-4:]##名字的起法有问题。。。不过效果还行。。 pic=getsite(href['src']) picfile=open(picpath,'wb') picfile.write(pic) i+=1 picfile.close() finally: for p in dst: fileout.write(p) fileout.close()
上面的log.txt
文件大体就是下面的内容。
Spider mode enabled. Check if remote file exists. --2013-03-29 23:00:10-- http://diameizi.diandian.com/ Resolving diameizi.diandian.com (diameizi.diandian.com)... 113.31.29.120, 113.31.29.121 Connecting to diameizi.diandian.com (diameizi.diandian.com)|113.31.29.120|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 30502 (30K) [text/html] Remote file exists and could contain links to other resources -- retrieving. --2013-03-29 23:00:11-- http://diameizi.diandian.com/ Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `diameizi.diandian.com/index.html' 0K .......... .......... ......... 94.6K=0.3s 2013-03-29 23:00:12 (94.6 KB/s) - `diameizi.diandian.com/index.html' saved [30502] Loading robots.txt; please ignore errors. --2013-03-29 23:00:12-- http://diameizi.diandian.com/robots.txt Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: 209 [text/plain] Saving to: `diameizi.diandian.com/robots.txt' 0K 100% 20.8M=0s 2013-03-29 23:00:12 (20.8 MB/s) - `diameizi.diandian.com/robots.txt' saved [209/209] Removing diameizi.diandian.com/robots.txt. Removing diameizi.diandian.com/index.html. Spider mode enabled. Check if remote file exists. --2013-03-29 23:00:12-- http://diameizi.diandian.com/rss Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: 0 [text/xml] Remote file exists but does not contain any link -- not retrieving. Removing diameizi.diandian.com/rss. unlink: No such file or directory Spider mode enabled. Check if remote file exists. --2013-03-29 23:00:12-- http://diameizi.diandian.com/archive Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: 82303 (80K) [text/html] Remote file exists and could contain links to other resources -- retrieving. --2013-03-29 23:00:12-- http://diameizi.diandian.com/archive Reusing existing connection to diameizi.diandian.com:80.
从上面的文本文件中寻找需要的相关资料。
上面的代码还没有测试成功,因为是2.7.3平台的缘故吧。
例子上给的应该是python3.x版本。有些出入
相关文章推荐
- python3.x爬虫:爬取大学排名数据
- python网络爬虫之初识网络爬虫
- Python爬虫入门(5):URLError异常处理
- Python爬虫入门(6):Cookie的使用
- [Python]网络爬虫(12):爬虫框架Scrapy的第一个爬虫示例入门教程
- Python实例:网络爬虫抓取豆瓣3万本书(9)
- python爬虫
- Python爬虫之——爬取妹子图片
- python爬虫多线程编程
- 利用Python进行爬虫及识别验证码
- python Scrapy 框架做爬虫 ——入门地图
- 一个简单的不用cookie的人人网状态爬取的python爬虫,使用beautifulsoup
- Python爬虫和情感分析简介(现在基本都用深度学习做情感分析了)
- python 爬虫爬取内容时, \xa0 、 \u3000 的含义
- python爬虫学习第二十四天
- 知乎:你是如何开始能写python爬虫?
- Python爬虫
- Python爬虫002浏览器的模拟Header属性
- Python爬虫入门三之Urllib库的基本使用
- 使用requests库制作Python爬虫