您的位置:首页 > 理论基础 > 计算机网络

Python爬虫学习篇——————网络爬虫用到的库

2017-04-10 16:20 381 查看
抓取用到的python自带模块:urllib、urllib2、requests、httplib2等

Requests:
import?requests
response?=?requests.get(url)
content?=?requests.get(url).content
print "response?headers:", response.headers
print "content:", content

Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()      

print "response headers:", response.headers
print "content:", content

Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url,''GET)
print "response headers:", response_headers
print "content:", content

此外,对于带有查询字段的url,get请求一般会将来请求的数据附在url之后,以?分割url和传输数据,多个参数用&连接。

data = {'data1':'XXXXX', 'data2':'XXXXX'}

Requests:data为dict,json
import requests
response = requests.get(url=url, params=data)

Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
full_url = url+'?'+data
response = urllib2.urlopen(full_url)

对于登陆情况的处理:

使用表单登陆,属于post请求

data = {'data1':'XXXXX', 'data2':'XXXXX'}

Requests:data为dict,json
import requests
response = requests.post(url=url, data=data)

Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
req = urllib2.Request(url=url, data=data)
response = urllib2.urlopen(req)

使用cookie登陆

import requests

requests_session = requests.session()

response = requests_session.post(url=url_login, data=data)

若存在验证码:

response_captcha = requests_session.get(url=url_login, cookies=cookies)

response1 = requests.get(url_login) # 未登陆

response2 = requests_session.get(url_login) # 已登陆,因为之前拿到了Response Cookie!response3 = requests_session.get(url_results) # 已登陆,因为之前拿到了Response Cookie!

对于反爬出机制:

使用代理--适用于限制ip,也可解决由于“频繁点击”而需要输入验证码登陆的情况

proxies = {'http':'http://XX.XX.XX.XX:XXXX'}

Requests:
import requests
response = requests.get(url=url, proxies=proxies)

Urllib2:
import urllib2
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener) # 安装opener,此后调用urlopen()时都会使用安装过的opener对象?
response = urllib2.urlopen(url)

时间设置,适用于限制频率情况:

Requests,Urllib2都可以使用time库的sleep()函数

import time 

time.sleep(1)

伪装成为浏览器或者“反盗链”

有些网站会检查你是不是真的浏览器访问,还是机器自动访问的。这种情况,加上User-Agent,表明你是浏览器访问即可。有时还会检查是否带Referer信息还会检查你的Referer是否合法,一般再加上Referer

headers = {'User-Agent':'XXXXX'} # 伪装成浏览器访问,适用于拒绝爬虫的网站

headers = {'Referer':'XXXXX'}

headers = {'User-Agent':'XXXXX', 'Referer':'XXXXX'}

Requests:
response = requests.get(url=url, headers=headers)

Urllib2:
import urllib, urllib2
req = urllib2.Request(url=url, headers=headers)
response = urllib2.urlopen(req)

对于断线重连:

def multi_session(session, *arg):
while True:
retryTimes = 20
while retryTimes>0:
try:
return session.post(*arg)               
except:
print '.',
retryTimes -= 1

或者

def multi_open(opener, *arg):
while True:
retryTimes = 20
while retryTimes>0:
try:
return opener.open(*arg)        
except:
print '.', 
retryTimes -= 1
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐