您的位置:首页 > 编程语言 > Python开发

Python urllib urllib2

2016-12-22 00:00 169 查看
摘要: Python两个常用模块urllib urllib2,它们可以被用来抓取网页信息,是爬虫的基础。

urlli2是对urllib的扩展。

相似与区别:

最常用的urllib.urlopen和urllib2.urlopen是类似的,但是参数有区别,例如超时和代理。

urllib接受url字符串来获取信息,而urllib2除了url字符串,也接受Request对象,而在Request对象中可以设置headers,而urllib却不能设置headers。

urllib有urlencode方法来对参数进行encode操作,而urllib2没有此方法,所以他们两经常一起使用。

相对来说urllib2功能更多一些,包含了各种handler和opener。

另外还有httplib模块,它提供了最基础的http请求的方法,例如可以做get/post/put等操作。

参考:http://blog.csdn.net/column/details/why-bug.html

最基本的应用:

import urllib2
response = urllib2.urlopen('http://www.baidu.com/')
html = response.read()
print html

使用Request对象:

import urllib2
req = urllib2.Request('http://www.baidu.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page

发送表单数据:

import urllib
import urllib2

url = 'http://www.someserver.com/register.cgi'

values = {'name' : 'WHY',
'location' : 'SDU',
'language' : 'Python' }

data = urllib.urlencode(values) # 编码工作
req = urllib2.Request(url, data)  # 发送请求同时传data表单
response = urllib2.urlopen(req)  #接受反馈的信息
the_page = response.read()  #读取反馈的内容

import urllib2
import urllib

data = {}

data['name'] = 'WHY'
data['location'] = 'SDU'
data['language'] = 'Python'

url_values = urllib.urlencode(data)
print url_values

name=Somebody+Here&language=Python&location=Northampton
url = 'http://www.example.com/example.cgi'
full_url = url + '?' + url_values

data = urllib2.urlopen(full_url)

在http请求中设置headers:

import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'WHY',
'location' : 'SDU',
'language' : 'Python' }

headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()


下面是关于opener和handler的应用:

from urllib2 import Request, urlopen, URLError, HTTPError

old_url = 'http://t.cn/RIxkRnO'
req = Request(old_url)
response = urlopen(req)
print 'Old url :' + old_url
print 'Real url :' + response.geturl()

这里得到url即response.geturl()与old_url不同,是因为重定向。

查看页面信息info():

from urllib2 import Request, urlopen, URLError, HTTPError

old_url = 'http://www.baidu.com'
req = Request(old_url)
response = urlopen(req)
print 'Info():'
print response.info()

一个opener和handler的实例:

# -*- coding: utf-8 -*-
import urllib2

# 创建一个密码管理者
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

# 添加用户名和密码

top_level_url = "http://example.com/foo/"

# 如果知道 realm, 我们可以使用他代替 ``None``.
# password_mgr.add_password(None, top_level_url, username, password)
password_mgr.add_password(None, top_level_url,'why', '1223')

# 创建了一个新的handler
handler = urllib2.HTTPBasicAuthHandler(password_mgr)

# 创建 "opener" (OpenerDirector 实例)
opener = urllib2.build_opener(handler)

a_url = 'http://www.baidu.com/'

# 使用 opener 获取一个URL
opener.open(a_url)

# 安装 opener.
# 现在所有调用 urllib2.urlopen 将用我们的 opener.
urllib2.install_opener(opener)

下面是一些技巧:

代理设置:

import urllib2
enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
if enable_proxy:
opener = urllib2.build_opener(proxy_handler)
else:
opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)

timeout设置,

python2.6前:

import urllib2
import socket
socket.setdefaulttimeout(10) # 10 秒钟后超时
urllib2.socket.setdefaulttimeout(10) # 另一种方式

2.6之后:

import urllib2
response = urllib2.urlopen('http://www.google.com', timeout=10)

Request中加入header:

import urllib2
request = urllib2.Request('http://www.baidu.com/')
request.add_header('User-Agent', 'fake-client')
response = urllib2.urlopen(request)
print response.read()

redirect:

import urllib2
my_url = 'http://www.google.cn'
response = urllib2.urlopen(my_url)
redirected = response.geturl() == my_url
print redirected

my_url = 'http://rrurl.cn/b1UZuP'
response = urllib2.urlopen(my_url)
redirected = response.geturl() == my_url
print redirected

import urllib2
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
print "301"
pass
def http_error_302(self, req, fp, code, msg, headers):
print "303"
pass

opener = urllib2.build_opener(RedirectHandler)
opener.open('http://rrurl.cn/b1UZuP')

cookie:

import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.baidu.com')
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value

http的put和delete方法:

import urllib2
request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

得到http返回码:

import urllib2
try:
response = urllib2.urlopen('http://bbs.csdn.net/why')
except urllib2.HTTPError, e:
print e.code

debug log:

import urllib2
httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com')
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python urllib urllib2