您的位置：首页 > 编程语言 > Python开发

urllib2库.官方文档翻译

2015-10-21 18:37 603 查看

urllib2库.官方文档翻译

标签（空格分隔）：译文

作者：Michael Foord

简介：

urllib2 是python中一个用于抓取URLs的模块。它提供了非常简单的接口，形如urlopne函数。此函数可以抓取采用各种协议的URLs。此外，库中还提供了一些稍微复杂点的接口用于处理其它常见的情形，例如

basic authentication，cookies,proxies

等情况。上面提到的处理各种事物的接口都是由handlers 和 openers 对象提供的。

urllib2支持抓取各种形式的URLs（由冒号前的字符串指定。例如，ftp是一种模式，如

ftp://python.org/

)，在抓取中使用了相关联的协议（如我们熟知的FTP/HTTP）.此教程着重于最常见的协议，

HTTP

。

对于最简单的情形，

urlopen

函数是非常易于使用的。但是当你打开

HTTP URLs

,遇到错误或者要处理一些重要的事情，你需要对超文本传输协议有一些理解。关于HTPP最复杂和权威的参考文档莫过于

RFC 2616

.这是一份技术文档，并且晦涩难懂的。这个教程旨在说明使用

urllib2

，并且我们会介绍足够的HTTP协议知识帮你度过难关。上面的RFC文档不会替换urllib2文档，但是我们把它作为一个补充。

抓取 URLs :

如下是urllib2的最简单使用方式：

import urllib2
response=urllib2.urlopen('http://www.python.org/')
html=response.read()

对urllib2库的使用就是那么简单（注意除了HTTP类型的URL可以写进去，我们也可以使用形如ftp:/file:等类型的URL）。但是此教程更多关注与HTTP，旨在解释一些复杂的案例。

HTTP

基于请求和应答。客户端做出请求，服务器端发送响应。

urllib2

创建一个

Request

对象代表你发送

HTTP

的请求。在最简单的使用例子中，我们创建了一个

Request

对象，对象之中指定了我们将要抓取的URL。调用

urlopen

函数，传入

Request

参数，返回一个针对欲访问的URL的

response

对象。这个

response

是一个类文件类型的对象，这点意味着你可以使用read函数读取它。（译者注：关于类文件类型对象，可以参考下鸭子类型和多态）

import urllib2
url='http://www.voidspace.org.uk' # specify the url we will fetch .
req=urllib2.Request(url) #create the request .
# next,we pass the req into the function urlopen .
response=urllib2.urlopen(req)
the_page=response.read() # we use the read method the read the file wo fetche from the URL.

可以到到我们的urllib2使用相同的Request接口去处理所有的URL模式。例如我们可以制作一个FTP请求：

req=urllib2.Request('ftp://example.com/')

在

HTTP

的例子中，

Request

请求对象可以允许我们做两件额外的事情：第一，我们可以把数据传送给目的服务器。第二，我们可以传递额外的信息（元数据,即关于数据或者是请求自身的）给服务器。这个信息是被当作

HTTP headers

.下面我们以此研究下他们。

数据：

有时候，我们想发送一些数据到一个URL(通常，这个url指的是CGI脚本或者其他的web应用)。对于HTTP来说，发送数据是通过

POST request

。这个也是我们使用浏览器访问时，我们在网络上填写一个HTML表单。并不是所有的posts源于表单：我们可以使用一个POST传输任意的数据给我们自己的应用。常见的HTML表单中，数据需要被编码为标准方式，然后传递到

Request

对象中，作为

data

参数。编码工作并不是由urllib2库完成，而是使用了urllib库中的

encode

函数。

import urllib
import urllib2
# specify the url we will fetch .
url='http://www.voidspace.com.uk/'
#create a dict to store the  data.
dict_data={'use_name':'aibilim','password':'xxxx','language':'python'}
# encode the dict_data in order to pass to the Request.
data_pass=urllib.encode(dict_data)
# next,we make a req.
req=urllib2.Request(url,data)
# now we get the response
response=urllib2.urlopen(req)
the_page=response.read()
# print the page.
print the page

注意，有时候可能会需要其它的一些编码。

如果你不传递一个data参数，urllib2允许使用

GET request

。

get & post

请求的一个区别在于post请求通常伴随着

副作用

：他们以某种方式改变了系统的状态。没有什么阻止GET请求有副作用，也没有什么阻止post请求没有副作用。尽管HTTP标准说的很清楚，posts请求本意永远导致副作用，get请求永远不会导致副作用。数据也可以被传递进一个HTTP get请求通过在url自身里面编码它。

import urllib
import urllib2
url='http://www.baidu.com'
data_dict={} # a empty dict.
# we append elements  in the dict.
data_dict['user_name']='aibilim'
data_dict['password']='xxxx'
data_dict['language']='Python'
# we encode the data_dict into standard way .
data_pass=urllib.encode(data_dict)
# print the url before appending the data_pass
print url
# now we append the data_pass
new_url=url+'?'+data_pass
# now ,we can make req
req=urllib2.Request(new_url) #notice we don't pass data argu
response=urllib2.urlopen(req)
the_page=response.read()
print the new_url
print the_page
# remember we must encode the data_dict before adding it into post or get request !!!

可以看到，在url中添加了一个

，尾随的是被编码的数据，这样构成了一个新的

url

。

译于2015年10月18日23:15:43 未完待续。

Headers：

我们将会探讨下

HTTP headers

,目的在于说明如何把headers添加到你的HTTP request中。一些网站[^2]不喜欢被程序访问或者发送不同的信息针对不同的浏览器。默认情况，urllib2被标识为

Python-urllib2/x.y

(此处的x&y分别代表python的发现版本号，比如我的版本号是

Python-urllib2/2.7

),python自身的标识可能让欲访问的站点感到困惑或者就是完全的不工作（我理解为对我们做出的请求不进行应答）。浏览器通过一个

User-Agent header

伪装自己。当你创建一个请求对象，你可以把一个字典类型的headers传递到请求对象的参数列表里。接下来的例子做出了和上文一样的请求，但是我们把程序伪装成

IE

浏览器。

import urllib
import urllib2
url='http://www.baidu.com'
user-agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
data_dict={'user_name':'aibilim','password':'xxxx','language':'Python'}
# now ,we make a header .
header={'User-Agent':user-agent'}
# the type is dict
# now ,we encode the data_dict by using the urllib.encode function
data_pass=urllib.encode(data_dict)
# ok,we can make req .
req=urllib2.Request(url,data_pass,header)
response=urllib2.urlopen(req)
the_page=response.read()

response

对象也有两个有用的方法。具体内容在

info and geturl

章节。我们将在研究过异常处理之后探讨它。

异常处理 :

urlopen

函数抛出一个

URLError

异常当函数不能处理一个响应时（内置异常诸如

ValueError

TypeError

等也可能被抛出）

HTTPError

异常是

URLError

异常的子集，

HTTPError

异常在特定的HTTP URLS中被抛出。

URLError :

通常，URLError被抛出是由于没有网络连接建立（或者没有到达指定服务器的路线），或者指定的服务器不存在。在这种情况下抛出的异常将会有一个

reason

属性（我理解为用来解释错误，所以取名为reason），reason是一个包含错误代码和错误文本消息的元组。

例如：

import urllib2
req=urllib2.Request=('http://www.xxxxx.com')
try:
urllib2.urlopen(req)
except urllib2.URLError as e :
print e.reason
# below is the result :
#[Errno 11002] getaddrinfo failed

ps:此处和文档给的网址不一样，因为我运行的时候，发现文档给的

http://www.pretend_server.org

链接可以抓取到内容，并不是引发异常。所有我替换了网址。此外，文档中在调用中省略了urllib2。

HTTPError :

每个来自服务器的HTTP应答都包含一个数值的状态码。有时候，状态码表明服务器不能满足我们做出的请求。默认的

handlers

将会帮我们处理一些应答（例如，应答是一个重定向，要求客户端从不同的URL抓取资源，urllib2将会替你处理好）。但是总有一些不能处理好，

urloprn

将会抛出一个

HTTPError

异常。典型的异常有

404(页面丢失），403（请求被禁止），401（要求验证）

所有的

HTTP error codes

可以在

RFC 2616

的第十章节查看。

HTTPError

异常的实例拥有一个整形的

code

属性，这个

code

对应着服务器发送回来的异常。

错误码

我们通常只会看见400-599范围内的错误码，因为默认的handlers会自动处理重定向（错误码以300开始），此外100-299范围内的错误码表明没有问题需要处理。

BaseHTTPServer.BaseHTTPRequestHandler.responses

是一个十分有用的字典，内含错误码以及错误的描述。在这里，为了方便，错误码字典被重新处理下：

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),

200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),

300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),

400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),

500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}

############################
#  看起来很简单，不一一翻译   #
############################

当一个异常被抛出，服务器通过返回一个HTTP错误代码和一个错误页面进行响应。我们可以使用

HTTPError

实例代表响应返回的页面（类似前面我们用response代表抓取的页面）.这意外着除了拥有

code

属性，它还有着

read ,geturl ,info

方法。

注意，上面提到过

URLError

，有一个

reason

属性，不要混淆。

import urllib2
requset = urllib2.Request('http://www.python.org/fish.html')
try:
response = urllib2.urlopen(requset)
except urllib2.HTTPError as e:
print e.code
print e.read()
# below is the part of the result
404
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->

解决方案：

有两个基本方法用了解决

HTTPError ,URLError

，我推荐第二种。

import urllib2
url = 'http://www.xxxxx.com'
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError as e:
print e.code
print 'we can not fulfill the request \n'
except urllib2.URLError as e:
print e.reason
print 'we can not reach a server'
else:
print('No problem')

注意，

HTTPError

一定要放在最前面进行捕获。因为

HTTPError

是

URLError

的子集，不然的话会一直捕获到的是

URLError

。

2.

import urllib2
url = 'http://www.python.org/fish.html'
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except urllib2.URLError as e:
if hasattr(e, 'reason'):
print 'we can not reach a server '
print 'The reason is %s' %e.reason
elif hasattr(e, 'code'):
print 'we can not fulfill the request '
print 'The error code is %s' %e.code
else:
print('No problem')

info and geturl

由

urlopen

函数返回的应答对象

response

(或者是httperror的实例）有两个有用的方法：

info() geturl()

geturl

:返回真正抓取到页面的地址。这一点很有用，因为

urlopen

函数可能会伴随一个重定向。抓取的页面的地址可能不同于请求中传入的地址。

info

:这个返回一个类字典类型的对象，用来描述抓取到的页面，尤其是由服务器端发送回来的

headers

.它目前是类

httplib.HTTPMessage

的实例。

典型的headers包括

Content-length,Content-type

等内容。具体的内容可以参阅[Quick Reference to HTTP Headers][1]，里面简明的列出了headers以及解释，含义，使用。

Openers 和 Handlers

当你抓取一个URL时，你会使用一个

opener

（urllib2.OpenDirector类的实例）。通常，我们经由

urlopen

使用默认的

opener

。但是我们可以定制自己的

opener

。

opener

调用

handlers

。所有的累活都是由handlers完成。每个handlers知道如何针对不同的URL模式（http，ftp,file)打开相应的URLs或者是知道如何处理打开url过程中的其它方面，例如http重定向问题抑或是htpp cookies问题。

如果你想抓取一个站点，并且试图用特定的handlers处理它，那么你需要定制自己的openers.例如你抓取的站点你希望你的opener可以处理cookies,或者希望你的openers不要处理重定向。

为了创建一个openers，我们可以实例化一个

OpenDirector

，然后反复调用

add_handler(some_handler_instance)

来添加handlers.

我们还有一个替代的解决方案，我们可以使用

build_opener

函数来创建我们自己的opener。仅通过一次函数的调用，我们就可以方便的创建一个opener对象。

build_opener

函数默认添加了几个handlers,但是也提供了提供了添加或者重写默认handlers的快速解决方案。

你可能需要一些用来处理代理，验证，以及一些常见但是属于某些特定情形的问题，这个时候我们就需要一些其他类型的handlers.

install_opener

函数可以使得一个opener对象成为全局默认的opener。这意味着当我们调用

urlopen

函数时，我们使用的将会是我们自己安装的opener。

opener

对象有一个

open

方法，这个方法可以直接调用，抓取URLs，它的过程和你调用

urloprn

方法是一样的。实际上，除了带来一些便利之外，没有必要调用

install_opener

函数。（此处的意思我们可以自己调用open函数，没必要让opener对象成为全局的，然后去调用urlopen函数）

Basic Authentication :

为了说明创建和安装一个handler,我们将会使用

HTTPBasicAuthHandler

。关于Basic Authentication如何工作的具体细节的讨论和解释，我们可以参阅[Basic Authentication Tutorial][2].

当要求验证的时候，服务器发送一个header和401错误代码，通知我们要进行验证。这个指定了验证验证模式和一个

realm

。headers看起来可能是形如：

WWW-Authenticate: SCHEME realm="REALM"

。

例如

WWW-Authenticate: Basic realm="cPanel Users"

客户端收到验证的应答后，应该尝试重新进行请求，并且在此次请求中附上合适的用户名和密码用于

realm

。

这就是

Basic Authentication

。为了简化这个过程，我们创建一个

HTTPBasicAuthHandler

实例和一个opener来使用上面创建的handler。

HTTPBasicAuthHandler

使用一个叫做密码管理器的对象来处理URL，用户名，密码。如果你知道

realm

是什么（可以从服务器发回的验证header知晓），那么你可以使用一个

HTTPPasswordMgr

。通常，我们不关心

realm

是什么。在这种情形下，使用

HTTPPasswordMgrWithDefaultRealm

是很方便的。这个允许我们替URL指定一个默认的用户名和密码。默认的用户名和密码将会被提供给

realm

当我们不提供一个可选的组合。我们通过传递

None

作为

realm

的参数，传递到

add_password

函数里。

最顶层的URL是第一个需要验证的URL。更深层次的URL和你传递到

add_password()

函数里的URL也会匹配。

import urllib2
url = 'http://www.python.org/fish.html'
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password .
#if we knew he real, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None,top_level_url,username,password)

handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
# use the opener to fetch the URL
opener.open(a_url)
# install the opener
#now all the calls to urllib2.urlopen use the opener
urllib2.install_opener(opener)

注：在上面的例子中，我们仅仅把我们的

HTTPBasicAuthHandler

提供给

build_opener

。默认情况下，openers拥有handlers来处理正常的情形-例如

ProxyHandler,UnKnownHandler,HTTPHandler,HTTPDefaultErrorHandler,HTTPRedirectHandler,FTPHandler,FileHandler,HTTPErrorProcessor

。

实际上，

top_level_url

不是一个完全的URL（包括http模式成分，主机名，以及可选的端口号，e.g. “http://example.com/“)就是一个

authority

（例如主机名，可选的包括端口号）例如”example.com”或者”example.com:8080”(后面的例子包含端口号）。如果

authority

存在，那么一定不能包含用户信息成分，例如

joe@password:examole.com

，这个例子是错误的。

代理：

urllib2

库自动侦测你的代理设置，并且付诸使用。这是通过

ProxyHandler

实现的，它是常见handler处理链的一部分。通常，这是好事，但是偶尔可能帮了倒忙。一个解决方案就是创建我们自己的

ProxyHandler

，不设置代理。创建的步骤类似于上面的

Basic Authentication

handler.

proxy_support = urllib2.ProxyHandler({})
opener=urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)

注：目前，

urllib2

库并不支持通过代理抓取https站点。但是这个可以实现通过延伸urllib2库。具体我们可以看下小窍门。

Sockets 和 Layers：

支持

Python

从网络抓取资源调用的库是呈现层次化结构的。

urllib2

库使用了

httplib

库，但是

httplib

库又使用了

socket

库。

在

Python

2.3版本中，在等待超时之前，我们可以自行设置一个socket应该等待response多长时间。这一点在不得不抓取网页的应用中十分有用。默认情况下，socket模块没有超时，可以悬空。但是目前，socket超时时间不暴露于

httplib

和

urllib2

层。但是你可以设置全局默认超时等待时间给所有的socket使用。

import  urllib2
import socket
# timeout in seconds
time_out = 10
socket.setdefaulttimeout(time_out)

#this call to urllib2.urlopen now use the default time_out .
#we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

Footnotes

此篇文档由John Lee审查和修订。

[1]关于

CGI

协议的介绍可以参阅

Writing Web Applications in Python

[2]谷歌

[3]对网站设计来说，浏览器嗅探是一个非常糟糕的举措。用web标准构建网站更明智。不幸的是，许多站点仍然发送不同版本的信息针对不同的浏览器。

[4]MISE 6的用户代理是

'Mozilla/4.0(compatible;MISE 6.0;Windows NT 5.1;SV1;.NET CLR 1.1.4322)'

[5]关于HTTP头的细节可以参阅

Quick Reference to HTTP Headers

译者说：

因为纯新手学爬虫，发现各种教程都要学习

urllib2

库的使用，索性自己撸了一遍，加深对库的了解。有的地方，我的理解可能出现了重大偏差，望您不吝赐教。关于文章中有些地方为什么不使用中文，诸如

cookie，realm

等，那是因为我并没有发现合适的词来描述它们，所以暂时搁置了，待以后对网络的理解更为深入之后，我可能会维护一下。文章中可能会有若干错别字，望见谅。可以直接在回复中提醒我该正，再次感谢。最后吐槽下，CSDN的编辑器还是挺难用的，完全赶不上作业部落的编辑器。

完结于 2015年10月20日21:36:36/by 莫利斯安

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 文档库

相关文章推荐

新的分享

章节导航