python urllib、urlparse、urllib2、cookielib
2016-04-22 11:46
661 查看
1、urllib模块
1.urllib.urlopen(url[,data[,proxies]])
打开一个url的方法,返回一个文件对象,然后可以进行类似文件对象的操作。本例试着打开googleimport urllib f = urllib.urlopen('http://www.google.com.hk/') firstLine = f.readline() #读取html页面的第一行urlopen返回对象提供方法:- read([bytes]):读所以字节或者bytes个字节- readline():读一行- readlines() :读所有行- fileno() :返回文件句柄- close() :关闭url链接- info():返回一个httplib.HTTPMessage对象,表示远程服务器返回的头信息- getcode():返回Http状态码。如果是http请求,200请求成功完成;404网址未找到- geturl():返回请求的url
2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])
urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename,则会存为临时文件。urlretrieve()返回一个二元组(filename,mine_hdrs)临时存放:filename = urllib.urlretrieve('http://www.google.com.hk/') type(filename) <type 'tuple'> print filename[0] print filename[1]输出:
'/tmp/tmp8eVLjq' <httplib.HTTPMessage instance at 0xb6a363ec>存为本地文件:
filename = urllib.urlretrieve('http://www.baidu.com/',filename='/home/dzhwen/python文件/Homework/urllib/google.html') print type(filename) print filename[0] print filename[1]输出:
<type 'tuple'> '/home/dzhwen/python\xe6\x96\x87\xe4\xbb\xb6/Homework/urllib/google.html' <httplib.HTTPMessage instance at 0xb6e2c38c>
reporthook参数使用如下:
def process(blk,blk_size,total_size): print('%d/%d - %.02f%%' %(blk*blk_size,total_size,(float)(blk * blk_size) / total_size * 100)) def download(): filename,fileinfo = urllib.urlretrieve('http://cnblogs.com','index.html',reporthook=process)输出结果:
0/46164 - 0.00% 8192/46164 - 17.75% 16384/46164 - 35.49% 24576/46164 - 53.24% 32768/46164 - 70.98% 40960/46164 - 88.73% 49152/46164 - 106.47%blk * blk_size的有可能超过total_size,如上函数可以改写为:
def process(blk,blk_size,total_size): if total_size == -1: print "can't determine the file size, now retrived", blk * blk_size else: percentage = int((blk * blk_size * 100.0) / total_size) if percentage >= 100: print('%d/%d - %d%%' % (total_size, total_size, 100)) else: print('%d/%d - %d%%' % (blk * blk_size, total_size, percentage))运行后输出:
0/46238 - 0% 8192/46238 - 17% 16384/46238 - 35% 24576/46238 - 53% 32768/46238 - 70% 40960/46238 - 88% 46238/46238 - 100%
3.urllib.urlcleanup()
清除由于urllib.urlretrieve()所产生的缓存4.urllib.quote(url)和urllib.quote_plus(url)
将url数据获取之后,并将其编码,从而适用与URL字符串中,使其能被打印和被web服务器接受。urllib.quote('http://www.baidu.com')转换结果:
'http%3A//www.baidu.com'
urllib.quote_plus('http://www.baidu.com')转换结果:
'http%3A%2F%2Fwww.baidu.com'
5.urllib.unquote(url)和urllib.unquote_plus(url)
与4的函数相反。6.urllib.urlencode(query)
将URL中的键值对以连接符&划分这里可以与urlopen结合以实现post方法和get方法:GET方法:import urllib params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0}) f=urllib.urlopen("http://python.org/query?%s" % params) print f.read()POST方法:
import urllib parmas = urllib.urlencode({'spam':1,'eggs':2,'bacon':0}) f=urllib.urlopen("http://python.org/query",parmas) f.read()
2.urlparse模块
1.urlparse
作用:反向解析urldef parse_html(): url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980' result = urlparse.urlparse(url) # params = urlparse.parse_qs(result.query) print result # print params运行结果:
ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980', fragment='')如上返回的是一个parseResult对象,其中包括协议类型、主机地址、路径、参数以及query
2.parse_qs
import urllib import urlparse def parse_html(): url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980' result = urlparse.urlparse(url) params = urlparse.parse_qs(result.query) # print result print params if __name__ == '__main__': # demo() # demo2() parse_html()运行结果:
{'wd': ['python'], 'rsv_spt': ['1'], 'rsv_iqid': ['0xad2dc5550032146a'], 'inputT': ['22'], 'f': ['8'], 'rsv_enter': ['1'], 'rsv_bp': ['0'], 'rsv_idx': ['2'], 'tn': ['baiduhome_pg'], 'rsv_sug4': ['4980'], 'rsv_sug7': ['100'], 'rsv_sug1': ['5'], 'issp': ['1'], 'rsv_sug3': ['7'], 'rsv_sug2': ['0'], 'ie': ['utf-8']}
3、urllib2模块
urllib2提供更加强大的功能,如cookie的管理,但并不能完全代替urllib,因为urllib.urlencode函数urllib2中是没有的3.1 urllib2.urlopen()
作用:打开url参数:urldata = Nonetimeout = <object>import urllib import urllib2 def demo(): url = 'http://www.cnblogs.com/hester/sllsl' try: s = urllib2.urlopen(url,timeout = 3) except urllib2.HTTPError,e: print e else: print s.read(100) if __name__ == '__main__': demo()运行结果:
<!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8"/> <title>”温故而知新“如果url更改为未知的网址:
url = 'http://www.cnblogs.com/hester/asdfas'运行结果:
HTTP Error 404: Not Found
3.2 urllib2.Request()
作用:添加或者修改http头参数:urldataheadersimport urllib import urllib2 def demo(): url = 'http://www.cnblogs.com/hester' headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'} req = urllib2.Request(url,headers=headers) s = urllib2.urlopen(req) print s.read(100) print req.headers s.close() if __name__ == '__main__': demo()运行结果:
<!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8"/> <title>”温故而知新“{'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}
3.3 urllib2.bulid_opener()
作用:创建一个打开器参数:Handler列表ProxyHandlerUnknownHandlerHTTPHandlerHTTPDefaultHandlerHTTPRedirectHandlerFTPHandlerFileHandlerHTTPErrorHandlerHTTPSHandler返回:OpenerDirectorimport urllib import urllib2 def request_post_debug(): data = {'username':'hester_ge','password':'xxxxxxx'} headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'} req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers) opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)) s = opener.open(req) print s.read(100) s.close() if __name__ == '__main__': request_post_debug()运行结果:
send: 'POST /hester HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 35\r\nHost: www.cnblogs.com\r\nX-My-Hester: my value\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\nContent-Type: application/x-www-form-urlencoded\r\n\r\nusername=hester_ge&password=xxxxxxx'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sun, 03 Jul 2016 08:28:37 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 14096
header: Connection: close
header: Vary: Accept-Encoding
header: Cache-Control: private, max-age=10
header: Expires: Sun, 03 Jul 2016 08:28:45 GMT
header: Last-Modified: Sun, 03 Jul 2016 08:28:35 GMT
header: X-UA-Compatible: IE=10
<!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8"/> <title>”温故而知新“
3.4 urllib2.install_opener
作用:保存创建的openerimport urllib import urllib2 def demo(): url = 'http://www.cnblogs.com/hester' headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'} req = urllib2.Request(url,headers=headers) s = urllib2.urlopen(req) print s.read(100) print req.headers s.close() # def request_post_debug(): # data = {'username':'hester_ge','password':'xxxxxxx'} # headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'} # req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers) # opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)) # s = opener.open(req) # print s.read(100) # s.close() def install_opener(): opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1), urllib2.HTTPSHandler(debuglevel=1)) urllib2.install_opener(opener) if __name__ == '__main__': # request_post_debug() demo()运行结果:
<!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8"/> <title>”温故而知新“{'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}如上代码更改为:
if __name__ == '__main__': # request_post_debug() install_opener() demo()运行结果:
send: 'GET /hester HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.cnblogs.com\r\nConnection: close\r\nX-My-Hester: my value\r\nUser-Agent: Mozilla/5.0\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Sun, 03 Jul 2016 08:39:31 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 14096
header: Connection: close
header: Vary: Accept-Encoding
header: Cache-Control: private, max-age=10
header: Expires: Sun, 03 Jul 2016 08:39:41 GMT
header: Last-Modified: Sun, 03 Jul 2016 08:39:31 GMT
header: X-UA-Compatible: IE=10
<!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8"/> <title>”温故而知新“{'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}
4、cookies模块
因HTTP协议是无状态的,服务器无法识别请求是否为同一计算机,所以需要使用cookies进行标示。客户见浏览器先发送request给服务器,服务器收到请求后进行解析,然后发送response给客户机,set_cookies就存在与response中,由浏览器进行设置。我们这边用到两个模块cookielib.CookieJar 提供解析并保存cookie的接口HTTPCookieProcessor 提供自动出来cookie的功能#encoding=utf8 import urllib2 import cookielib def handler_cookie(): cookiejar = cookielib.CookieJar() handler = urllib2.HTTPCookieProcessor(cookiejar=cookiejar) opener = urllib2.build_opener(handler,urllib2.HTTPHandler(debuglevel=1)) s = opener.open('http://www.douban.com/') print s.read(100) s.close() print '=' * 20 print cookiejar._cookies print '=' * 20 #发送第二次请求时,自动带上cookie s2 = opener.open('http://www.douban.com/') print s2.read(100) s2.close() if __name__ == '__main__': handler_cookie()运行结果:
/usr/bin/python2.7 /home/hester/PycharmProjects/untitled/demo4.py send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.douban.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n' reply: 'HTTP/1.1 301 Moved Permanently\r\n' header: Date: Sun, 03 Jul 2016 10:01:41 GMT header: Content-Type: text/html header: Content-Length: 178 header: Connection: close header: Location: https://www.douban.com/ header: Server: dae <!DOCTYPE HTML> <html lang="zh-cms-Hans" class=""> <head> <meta charset="UTF-8"> <meta name="descrip ==================== {'.douban.com': {'/': {'ll': Cookie(version=0, name='ll', value='"118163"', port=None, port_specified=False, domain='.douban.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1499076101, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), 'bid': Cookie(version=0, name='bid', value='dDz4rCqWvcQ', port=None, port_specified=False, domain='.douban.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1499076101, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)}}} ==================== send: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.douban.com\r\nCookie: ll="118163"; bid=dDz4rCqWvcQ\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n' reply: 'HTTP/1.1 301 Moved Permanently\r\n' header: Date: Sun, 03 Jul 2016 10:01:42 GMT header: Content-Type: text/html header: Content-Length: 178 header: Connection: close header: Location: https://www.douban.com/ header: Server: dae <!DOCTYPE HTML> <html lang="zh-cms-Hans" class=""> <head> <meta charset="UTF-8"> <meta name="descrip Process finished with exit code 0
相关文章推荐
- python之HTMLParser解析HTML文档
- 步步学习Python 编程error篇:import package error:Import error: No module named mayavi
- Python开发之快速搭建自动回复微信公众号功能
- python学习笔记之numpy入门
- Python命令行添加Tab键自动补全功能
- How to overcome “datetime.datetime not JSON serializable” in python?
- NumPy简明教程
- python中用来提醒自己的小知识点
- Python爬虫入门笔记:爬虫简介
- Python scrapy学习入门
- Python用pip 安装lxml时出现 “Unable to find vcvarsall.bat ”解决方案
- Fabric集群简单部署
- python中requests爬去网页内容出现乱码的解决方案
- [转]如何打开.ipynb文件
- Python机器学习库和深度学习库总结
- Python 请用sorted对上述列表按名字和分数排序
- Python 利用 filter() 滤掉非回数
- Python中字符串切片操作
- Python批量操作文件,批量合并
- python strip()函数介绍