python小爬虫
2016-02-05 17:53
423 查看
遍历一个网页的所有链接,跟网上学的
import re import urllib.request import urllib from collections import deque que = deque() vis = set() url = 'http://news.dbanotes.net/' que.append(url) cnt = 0 li = [] f = open('G:/1.txt', 'w') while que: url = que.popleft() vis |= {url} urlopen = urllib.request.urlopen(url) if 'html' not in urlopen.getheader('Content-Type'): continue try: data = urlopen.read().decode('utf-8') except: continue r = r'href=\"(.+?)\"' com = re.compile(r) ans = com.findall(data) for i in ans: if i not in vis and 'http' in i: que.append(i) f.write(i) f.write('\n') f.close()
相关文章推荐
- 《Python核心编程》第一章:欢迎来到Python世界!
- Python笔记:使用pywin32处理excel文件
- python35
- 迭代器
- [Leetcode] 179. Largest Number @python
- python 11期 第八天
- [Leetcode]174. Dungeon Game @python
- 2.4鼠标事件
- python编码问题
- Beginning Python Chapter 2 Notes
- python argparse模块解析命令行选项简单使用
- [Leetcode]166. Fraction to Recurring Decimal @python
- numpy.distutils.system_info.NotFoundError: no lapack/blas resources found
- sublime 配置 python IDE
- 关于Python中的yield
- [Leetcode]152. Maximum Product Subarray @python
- Python ConfigParser的使用
- Python错误: SyntaxError: Non-ASCII character
- [Leetcode]149. Max Points on a Line @python
- 【Python】模拟radius coa报文