python爬取知乎话题的精华问题下的用户信息
2016-01-12 21:22
676 查看
今天试着用自己的爬虫代码爬取了知乎【同性恋】话题下的所有精华问题的用户位置信息
代码:
代码:
__author__ = 'yang' # -*- coding: utf-8 -*- import configparser import requests import time import re import string def curTime(): curTime = time.strftime('%Y-%m-%d %H:%M:%S') timeStr = '\n<!--'+curTime+'-->' return timeStr def loginInfo(): #获取用户名,密码 filename = 'test.ini' #test.ini中有知乎账号、密码及浏览器cookies config = configparser.ConfigParser() config.read(filename) cookies = config.items('COOKIES') cookies = dict(cookies) username = config.get("USER","username") password = config.get("USER","password") #print username return username,password,cookies def create_session(): username, password, cookies = loginInfo() session = requests.session() login_data = {'email':username, 'password':password} header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36', 'Host': 'www.zhihu.com', 'Referer': 'http://www.zhihu.com/' } r = session.post('http://www.zhihu.com/login/email', data=login_data, headers=header) if r.json()['r'] == 1: print 'Login Failed, reason is:', for m in r.json()['data']: print r.json()['data'][m] print 'Use cookies to login...' has_cookies = False for key in cookies: if key != '__name__' and cookies[key] != '': has_cookies = True break if has_cookies is False: raise ValueError('请填写config.ini文件中的cookies项') else: r = session.get('http://www.zhihu.com/login/email', cookies=cookies) with open('login.html', 'w') as fp: fp.write(r.content) return session, cookies def writeFile(name,content): with open(name,'w') as fp: fp.write(content) if __name__ == '__main__': requests_session, requests_cookies = create_session() with open('tong.html','w') as fp: fp.write(curTime()) for page in range(0,49): url = 'https://www.zhihu.com/topic/19552984/top-answers?'+str(page) content = requests_session.get(url, cookies=requests_cookies).content f = file('tong.html', 'a+') f.write(content) #f = file('url.html', 'a+') #f.write(curTime()) #匹配问题连接字符串 str = re.compile(r'<a class="question_link.*?href="(.*?)">') with open('url.html') as file: content = file.read() questionLinks = str.findall(content) print (questionLinks) with open('resultLink.html','w') as fp: fp.write('\n'.join(questionLinks)) with open('resultLink.html') as fp: questionLinks = fp.readlines() #获取用户链接 usrRegex = re.compile(r'<a class="author-link.*?href="(.*?)">') for link in questionLinks: num = link.strip() url = 'https://www.zhihu.com'+str(num) page = requests_session.get(url,cookies=requests_cookies).content #获取页面内容 #过滤用户链接 usrLinks = usrRegex.findall(page) f = file('usrLinks.html','a+') f.write('\n'.join(usrLinks)) with open('usrLinks.html') as fp: ls = fp.readlines() links = [] for link in ls: links.append(link.strip()) #print len(links) links = list(set(links)) #print len(links) #获取用户个人页面 locationRegex = re.compile(r'<span class="location item.*?title="(.*?)"') locations = [] for link in links: url = 'https://www.zhihu.com'+str(link) page = requests_session.get(url, cookies=requests_cookies).content #获取位置信息 #locations.append(locationRegex.findall(page)) location = locationRegex.findall(page)+'\n' if (location): f = file('locations.html','a+') f.write('\n'.join(location))
相关文章推荐
- 笨方法学python学习笔记 练习20
- Python学习手册 - 10
- python的LEGB原则
- python exception
- [Leetcode]5.Longest Palindromic Substring @ Python
- [Leetcode]4. Median of Two Sorted Arrays @python
- 笨办法学Python学习笔记 练习19
- Python 字符串
- 笨办法学Python学习笔记 练习18
- [Leetcode]3. Longest Substring Without Repeating Characters @python
- 深入思考python的super()
- ubuntu14下python3.4安装PIL的步骤
- python数字图像处理(10):图像简单滤波
- 关于ImportError: xxxx.so: undefined symbol: PyFPE_jbuf的解决方案
- python搭建虚拟环境
- python库tqdm(进度条工具库)
- python使用@property
- python-->>基础入门
- Python 2.5.8 map/reduce
- python 装饰器模式 我的理解