您的位置：首页 > 编程语言 > Python开发

把玩之python爬虫urllib2高级篇

2016-06-17 09:40 651 查看

原文：点击打开链接

之前我们设置了一个headers 在构建request时传入。但有些服务器会识别headers中的referer是不是它自己，如果不是，有些服务器是不会响应的。为了对付“反盗链”我们>可以在headers中加入referer，如下:

import urllib
import urllib2

url="http://..."
values={"username":"1357200562@qq.com","password":"123"}
data=urllib.urlencode(values)
user-agent="Mozilla/4.0..."
headers={"User-Agent":user-agent,"Referer":"http://..."}
request=urllib2.request(url,data,headers)
response=urllib2.urlopen(request)
print response.read()

headers 还有一些属性：

        content-type:使用REST接口时，服务器会检查该值，用来确定HTTP Body中的内容

如何解析

        application/xml：在XML RPC,如RESTful/SOAP调用时使用

        application.json：在json RPC调用时使用

        application/x-www-form-urlencodeed:浏览器提交web表单时使用
        在使用服务器提供的RESTful或SOAP服务时，Content-Type设置错误会导致服务器>拒绝服务

Proxy（代理）设置:        urllib2默认会使用环境变量http_proxy来设置HTTP Proxy。假如一个网站会检测>在一段时间内会一个ip的访问次数。如果访问次数太多就会禁止你的访问。所以可以设置代理服务器帮助你做工作，即每个一段时间换一个代理，如下：

import urllib2
enable_proxy=True
proxy_handler=urllib2.ProxyHandler({"http"："http://some-proxy.com:8080"})
null_proxy_handler=urllib2.ProxyHandler({})
if enable_proxy:
opener=urllib2.build_opener(proxy_handler)
else:
opener=urllib2.build_opener(null_proxy_handler)

4000
urllib2.install_opener(opener)

Timeout设置：

timeout就是设置超时时间，为了解决响应时间过慢而造成影响

import urllib2
response=urllib2.urlopen(url,timeout=30)

使用HTTP的PUT和DELETE方法：

PUT和POST基本相似,都是向服务器发送数据,但是PUT通常指定了资源的存放位置,>而POST是由服务器指定存放位置。DELETE:删除某一个资源

import urllib2
request=urllib2.Request(url,data=data)
request.get_method=lambda:"PUT"#or"DELETE"
response=urllib2.urlopen(request)

使用DebugLog：
可以通过如下方法把Debug Log打开，方便调试，不太常用：
import urllib2
httpHandler=urllib2.HTTPHandler(debuglevel=1)
httpHandler=urllib2.HTTPHandler(debuglevel=1)
openurl=urllib2.build_opener(httpHandler,httpHandler)
urllib2.install_opener(opener)
response=urllib2.urlopen("http://....")

URLError:

我们使用try-except语句来捕捉URLError异常：（HTTPError，URLError）

import urllib2
requeset=urllib2.Rquest("http://...")
try:
response=urllib2.urlopen()
except urllib2.HTTPError,e:
print e.code
except urllib2.URLError,e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print reason

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航