您的位置：首页 > 编程语言 > Python开发

python扩展之爬虫基础

2016-03-19 00:59 633 查看

URL管理器

网页下载器

urllib2下载网页的方法

1. 简洁方法

import urllib2

response = urllib2.urlopen('www.baidu.com')  //直接请求

print response.getcode()  //获取状态码，如果是200则成功

cont = response.read() //读取下载内容

2. 添加data、http header

request = urllib2.Request(url)  //生成request对象

request.add_data('a','1')  //添加数据（key,value）

request.add_header('User-Agent','Mozilla/5.0')  //添加http的header，伪装为mozilla浏览器

response = urllib2.urlopen(request)  //发送请求获取结果

3. 添加特殊情景的处理器

HTTPCookieProcessor //需要登陆的，借助cookie

ProxyHandler  //需要代理才能访问

HTTPSHandler //https加密访问的

HTTPRedirectHandler //url相互跳转的网页

eg:

import urllib2, cookielib

cj = cookielib.CookieJar()  //创建cookie容器

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  //创建一个opener

urllib2.install_opener(opener)  //给urllib2安装opener,增强处理能力

urllib2.urlopen('www.baidu.com')  //使用带有cookie的lib2访问网页

4. 实例

`#coding:utf8

import urllib2

import cookielib

url = ‘http://www.baidu.com’

print ‘test1’

response1 = urllib2.urlopen(url)

print response1.getcode()

print len(response1.read())

print ‘test2’

request = urllib2.Request(url)

request.add_header(‘user-agent’, ‘Mozilla/5.0’)

response2 = urllib2.urlopen(request)

print response2.getcode()

print len(response2.read())

print ‘test3’

cj = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

urllib2.install_opener(opener)

response3 = urllib2.urlopen(url)

print response3.getcode()

print cj

print response3.read()

`

网页解析器

正则表达式（字符串模糊匹配）

python自带：html.parser（结构化解析 dom树）

Beautiful Soup（结构化解析）

安装

原理

创建bs对象，可以自动把html转化为dom树

通过find_all或find（只返回第一个）搜索节点，通过使用名称，属性，文字搜索

访问节点名，属性，文字

实例：

`from bs4 import BeautifulSoup

//根据html字符串创建bs对象

soup = BeautifulSoup(

html_doc, #html字符串

‘html.parser’ #html解析器

from_encoding=’utf8’) #html文档的编码

find_all(name, attr, string)

node.name //获取节点名称

node[‘href’] //访问属性

node.get_text() //获取文字`

综合实例

`#coding:utf8

from bs4 import BeautifulSoup

import re

html_doc =

"""

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

“”“

soup = BeautifulSoup(html_doc, ‘html.parser’, from_encoding = ‘utf-8’)

print ‘get a’

links = soup.find_all(‘a’)

for link in links:

print link.name, link[‘href’], link.get_text()

print ‘get lacie’

link_node = soup.find(‘a’, href = ‘http://example.com/lacie‘)

print link_node.name, link_node[‘href’], link_node.get_text()

print ‘get regex’

link_node = soup.find(‘a’, href = re.compile(r”ill”)) #正则表达式匹配，用r可以忽略转义字符

print link_node.name, link_node[‘href’], link_node.get_text()

print ‘by class’

link_node = soup.find(‘p’, class_ = ‘title’) #class要加下划线

print link_node.name, link_node.get_text()`

4. lxml（结构化解析）

扩展 ##

需登录、验证码、ajax、服务器防爬虫、多线程、分布式

简单爬虫地址：https://github.com/su526664687/Simple-Spider.git

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航