python最简单的爬取邮箱地址
2015-06-18 15:48
676 查看
http://www.jb51.net/article/57161.htm
#!/usr/bin/env python #-*- coding:utf-8 -*- import re import sys def getIPAddFromFile(fobj): regex = re.compile(r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', re.IGNORECASE) ipadds = re.findall(regex, fobj) print ipadds return ipadds def getPhoneNumFromFile(fobj): regex = re.compile(r'1\d{10}', re.IGNORECASE) phonenums = re.findall(regex, fobj) print phonenums return phonenums def getMailAddFromFile(fobj): regex = re.compile(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b", re.IGNORECASE) mails = re.findall(regex, fobj) print mails return mails def getUrlFromFile(fobj): regex = re.compile(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", re.IGNORECASE) urls = regex.findall(fobj) print urls return urls def main(FilefilePath): fobj = open(FilefilePath, 'rb').read() urllist = getUrlFromFile(fobj) mailList = getMailAddFromFile(fobj) phoneNum = getPhoneNumFromFile(fobj) ipaddlist = getIPAddFromFile(fobj) if __name__ == '__main__': main(sys.argv[1])
</pre><pre name="code" class="python">
</pre><pre name="code" class="python">
# -*- coding: utf-8 -*- import re import urllib def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html): reg = r'src="(.+?\.jpg)" pic_ext' #p=re.compile('[^\._-][\w\.-]+@(?:[A-Za-z0-9]+\.)+[A-Za-z]+$|^0\d{2,3}\d{7,8}$|^1[358]\d{9}$|^147\d{8}') regex = re.compile(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b", re.IGNORECASE) imgre = re.compile(regex) imglist = re.findall(regex,html) print imglist return imglist #x=0 #for imgurl in imglist: #urllib.urlretrieve(imgurl,'%s.jpg' % x) #x=x+1 html = getHtml("http://tieba.baidu.com/p/3827945043") print getImg(html)
相关文章推荐
- python 小记
- ConfigParser
- 转:Python之全局变量
- python--的若干内置属性
- selenium python 环境搭建(64位 windows)
- Python学习 之 编程
- Python字符编码理解
- python操作mysql
- Python学习 之 走进python
- Theano2.1.13-基础知识之PyCUDA、CUDAMat、Gnumpy的兼容
- Theano2.1.13-基础知识之PyCUDA、CUDAMat、Gnumpy的兼容
- ubuntu下python+django开发环境搭建
- 转的:运维新手们,别再问需不需要学PYTHON了
- Python OS模块总结
- 【第六周:列表与元组】#根据单词的长度对一个单词列表进行排序
- python外部传参方法总结
- Python os 模块文件操作
- [Python]同是新手的我,分享一些经验
- python 正则表达式入门(匹配IP)
- Python中的base64、base32实例