您的位置:首页 > 编程语言 > Python开发

python扩展之爬虫基础

2016-03-19 00:59 633 查看

URL管理器

网页下载器

urllib2下载网页的方法

1. 简洁方法

import urllib2

response = urllib2.urlopen('www.baidu.com')  //直接请求

print response.getcode()  //获取状态码,如果是200则成功

cont = response.read() //读取下载内容


2. 添加data、http header

request = urllib2.Request(url)  //生成request对象

request.add_data('a','1')  //添加数据(key,value)

request.add_header('User-Agent','Mozilla/5.0')  //添加http的header,伪装为mozilla浏览器

response = urllib2.urlopen(request)  //发送请求获取结果


3. 添加特殊情景的处理器

HTTPCookieProcessor //需要登陆的,借助cookie

ProxyHandler  //需要代理才能访问

HTTPSHandler //https加密访问的

HTTPRedirectHandler //url相互跳转的网页


eg:

import urllib2, cookielib

cj = cookielib.CookieJar()  //创建cookie容器

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  //创建一个opener

urllib2.install_opener(opener)  //给urllib2安装opener,增强处理能力

urllib2.urlopen('www.baidu.com')  //使用带有cookie的lib2访问网页


4. 实例

`#coding:utf8

import urllib2

import cookielib

url = ‘http://www.baidu.com

print ‘test1’

response1 = urllib2.urlopen(url)

print response1.getcode()

print len(response1.read())

print ‘test2’

request = urllib2.Request(url)

request.add_header(‘user-agent’, ‘Mozilla/5.0’)

response2 = urllib2.urlopen(request)

print response2.getcode()

print len(response2.read())

print ‘test3’

cj = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

urllib2.install_opener(opener)

response3 = urllib2.urlopen(url)

print response3.getcode()

print cj

print response3.read()

`

网页解析器

正则表达式(字符串模糊匹配)

python自带:html.parser(结构化解析 dom树)

Beautiful Soup(结构化解析)

安装

原理

创建bs对象,可以自动把html转化为dom树

通过find_all或find(只返回第一个)搜索节点,通过使用名称,属性,文字搜索

访问节点名,属性,文字

实例:

`from bs4 import BeautifulSoup

//根据html字符串创建bs对象

soup = BeautifulSoup(

html_doc, #html字符串

‘html.parser’ #html解析器

from_encoding=’utf8’) #html文档的编码

find_all(name, attr, string)

node.name //获取节点名称

node[‘href’] //访问属性

node.get_text() //获取文字`

综合实例

`#coding:utf8

from bs4 import BeautifulSoup

import re

html_doc =
"""


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>


“”“

soup = BeautifulSoup(html_doc, ‘html.parser’, from_encoding = ‘utf-8’)

print ‘get a’

links = soup.find_all(‘a’)

for link in links:

print link.name, link[‘href’], link.get_text()

print ‘get lacie’

link_node = soup.find(‘a’, href = ‘http://example.com/lacie‘)

print link_node.name, link_node[‘href’], link_node.get_text()

print ‘get regex’

link_node = soup.find(‘a’, href = re.compile(r”ill”)) #正则表达式匹配,用r可以忽略转义字符

print link_node.name, link_node[‘href’], link_node.get_text()

print ‘by class’

link_node = soup.find(‘p’, class_ = ‘title’) #class要加下划线

print link_node.name, link_node.get_text()`

4. lxml(结构化解析)

扩展 ##

需登录、验证码、ajax、服务器防爬虫、多线程、分布式

简单爬虫地址:https://github.com/su526664687/Simple-Spider.git
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: