您的位置：首页 > 编程语言 > Python开发

python小爬虫

2016-02-05 17:53 423 查看

遍历一个网页的所有链接，跟网上学的

import re
import urllib.request
import urllib

from collections import deque

que = deque()
vis = set()

url = 'http://news.dbanotes.net/'

que.append(url)
cnt = 0
li = []
f = open('G:/1.txt', 'w')
while que:
url = que.popleft()
vis |= {url}

urlopen = urllib.request.urlopen(url)

if 'html' not in urlopen.getheader('Content-Type'):
continue

try:
data = urlopen.read().decode('utf-8')
except:
continue

r = r'href=\"(.+?)\"'
com = re.compile(r)
ans = com.findall(data)
for i in ans:
if i not in vis and 'http' in i:
que.append(i)
f.write(i)
f.write('\n')
f.close()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

《Python核心编程》第一章：欢迎来到Python世界！
Python笔记：使用pywin32处理excel文件
python35
迭代器
[Leetcode] 179. Largest Number @python
python 11期第八天
[Leetcode]174. Dungeon Game @python
2.4鼠标事件
python编码问题
Beginning Python Chapter 2 Notes
python argparse模块解析命令行选项简单使用
[Leetcode]166. Fraction to Recurring Decimal @python
numpy.distutils.system_info.NotFoundError: no lapack/blas resources found
sublime 配置 python IDE
关于Python中的yield
[Leetcode]152. Maximum Product Subarray @python
Python ConfigParser的使用
Python错误： SyntaxError: Non-ASCII character
[Leetcode]149. Max Points on a Line @python
【Python】模拟radius coa报文

新的分享

#新闻拍一拍# 微软推出 Pylance，改善 VS Code 中的 Python 体验
跟我学Python图像处理丨5种图像阈值化处理及算法对比
基于Python设计一个具有基本功能的通讯录
liunx上升级python2至python3
es的查询、排序查询、分页查询、布尔查询、查询结果过滤、高亮查询、聚合函数、python操作es
python常用标准库（时间模块time和datetime）
python之logging日志
python之configparser类的使用
Python常用标准库（pickle序列化和JSON序列化）
MySQL（12） - Python+MySQL读取写入图片
MySQL（11） - Python+MySQL开发新闻管理系统
Python 什么是flask框架？快速入门(flask安装，登录，新手三件套，登录认证装饰器，配置文件，路由系统，CBV)

章节导航