python扩展之爬虫基础
2016-03-19 00:59
633 查看
URL管理器
网页下载器
urllib2下载网页的方法1. 简洁方法
import urllib2 response = urllib2.urlopen('www.baidu.com') //直接请求 print response.getcode() //获取状态码,如果是200则成功 cont = response.read() //读取下载内容
2. 添加data、http header
request = urllib2.Request(url) //生成request对象 request.add_data('a','1') //添加数据(key,value) request.add_header('User-Agent','Mozilla/5.0') //添加http的header,伪装为mozilla浏览器 response = urllib2.urlopen(request) //发送请求获取结果
3. 添加特殊情景的处理器
HTTPCookieProcessor //需要登陆的,借助cookie ProxyHandler //需要代理才能访问 HTTPSHandler //https加密访问的 HTTPRedirectHandler //url相互跳转的网页
eg:
import urllib2, cookielib cj = cookielib.CookieJar() //创建cookie容器 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) //创建一个opener urllib2.install_opener(opener) //给urllib2安装opener,增强处理能力 urllib2.urlopen('www.baidu.com') //使用带有cookie的lib2访问网页
4. 实例
`#coding:utf8
import urllib2
import cookielib
url = ‘http://www.baidu.com’
print ‘test1’
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())
print ‘test2’
request = urllib2.Request(url)
request.add_header(‘user-agent’, ‘Mozilla/5.0’)
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())
print ‘test3’
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()
`
网页解析器
正则表达式(字符串模糊匹配)python自带:html.parser(结构化解析 dom树)
Beautiful Soup(结构化解析)
安装
原理
创建bs对象,可以自动把html转化为dom树
通过find_all或find(只返回第一个)搜索节点,通过使用名称,属性,文字搜索
访问节点名,属性,文字
实例:
`from bs4 import BeautifulSoup
//根据html字符串创建bs对象
soup = BeautifulSoup(
html_doc, #html字符串
‘html.parser’ #html解析器
from_encoding=’utf8’) #html文档的编码
find_all(name, attr, string)
node.name //获取节点名称
node[‘href’] //访问属性
node.get_text() //获取文字`
综合实例
`#coding:utf8
from bs4 import BeautifulSoup
import re
html_doc =
"""
<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>
“”“
soup = BeautifulSoup(html_doc, ‘html.parser’, from_encoding = ‘utf-8’)
print ‘get a’
links = soup.find_all(‘a’)
for link in links:
print link.name, link[‘href’], link.get_text()
print ‘get lacie’
link_node = soup.find(‘a’, href = ‘http://example.com/lacie‘)
print link_node.name, link_node[‘href’], link_node.get_text()
print ‘get regex’
link_node = soup.find(‘a’, href = re.compile(r”ill”)) #正则表达式匹配,用r可以忽略转义字符
print link_node.name, link_node[‘href’], link_node.get_text()
print ‘by class’
link_node = soup.find(‘p’, class_ = ‘title’) #class要加下划线
print link_node.name, link_node.get_text()`
4. lxml(结构化解析)
扩展 ##
需登录、验证码、ajax、服务器防爬虫、多线程、分布式简单爬虫地址:https://github.com/su526664687/Simple-Spider.git
相关文章推荐
- Python中的爬虫输出编码问题
- python学习记录(1)
- 使用Python编写基于DHT协议的BT资源爬虫
- Python的Socket编程过程中实现UDP端口复用的实例分享
- python 基础二、列表【list】
- 笔记:HeadFirstPython(1)初识Python
- 12步轻松搞定python装饰器
- python编写工具之基础——处理命令行参数
- 发布你的Python模块
- Python爬虫实现半自动发微博
- python标准库学习4-time
- 初学者必知的Python中优雅的用法
- 在Ubuntu 14.04中升级python到2.7.11
- mysql-connector-python, mysql-connector-python-rf and mysql-connector-repackaged
- 粒子群算法实现之python
- pip install mysql-connector-python安装时报错不满足requirement
- (3)中文分词——Python结巴分词器
- (9)Python爬虫——下载PDF
- 《跟着小吴哥学python》之 02 python搭建开发环境
- Python中特殊函数和表达式 filter,map,reduce,lambda