python并行爬虫
2018-01-19 12:34
183 查看
Python并行化
并行化介绍Map的使用
1)并行化介绍
[x] 多个线程同时处理任务[x] 高效
[x] 快速
2)Map的使用
map函数一手包办了序列的操作,参数传递和结果保存等一系列的操作。from multiprocessing.dummy import Pool
pool = Pool(计算机核数)
results = pool.map(爬取函数,网址列表)
# -*-coding: utf-8 -*- from multiprocessing.dummy import Pool as ThreadPool import requests import time def getsource(url): html = requests.get(url) urls = [] for i in range(1,21): newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i) urls.append(newpage) time1 = time.time() for i in urls: print i getsource(i) time2 = time.time() print u'单线程耗时:' + str(time2-time1) pool = ThreadPool(2) time3 = time.time() results = pool.map(getsource, urls) pool.close() pool.join() time4 = time.time() print u'并行耗时:' + str(time4-time3)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
5.实战–百度贴吧爬虫
目标网站:http://tieba.baidu.com/p/3522395718目标内容:前20页的跟帖用户名,跟帖内容,跟帖时间
涉及知识:Requests获取网页,XPath提取内容,map实现多线程爬虫
tiebaspider.py
# -*- coding: utf-8 -*- from lxml import etree from multiprocessing.dummy import Pool as ThreadPool import requests import json #转义json格式的内容 import sys #将编码转义为utf-8 reload(sys) sys.setfaultencoding('utf-8') def towrite(contentdict): f.writelines(u'回帖时间:' +str(contentdict['topic_reply_time']) + '\n') f.writelines(u'回帖内容:' +unicode(contentdict['topic_reply_content']) + '\n') f.writelines(u'回帖人:' +str(contentdict['user_time']) + '\n\n') def spider(url): html = requests.get(url) selector = etree.HTML(html.text) content_field = selector.xpath('//div[@class="l_post l_post_bright "]') item = {} for each in content_field: reply_info = json.loads(each.xpath('@data-field')[0].replace('"','')) #因为获取的是xpath后的内容所以直接@ author = reply_info['author']['user_name'] content = each.xpath('div[@class="d_post_content_main"]/div/cc/div[@class="d_post_content j_d_post_content "]/text()')[0] reply_time = reply_info['content']['date'] print content print reply_time print author item['user_name'] = author item['topic_reply_content'] = content item['topic_reply_time'] = reply_time towrite(item) if __name__ == "__main__": pool = ThreadPool(2) f = open('content.txt', 'a') page = [] for i in range(1, 21): newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i) page.append(newpage) results = pool.map(spider, page) pool.close() pool.join() f.close()
相关文章推荐
- 【Python爬虫4】并发并行下载
- python爬虫实战笔记---以轮子哥为起点Scrapy爬取知乎用户信息
- 【网络爬虫】-WP0001-Anaconda_Python2_Python3_conda_Pip_pycharm环境配置
- 数据爬虫(二):python爬虫中urllib库详解,parse和request使用方法
- 第三百四十二节,Python分布式爬虫打造搜索引擎Scrapy精讲—爬虫数据保存
- python3爬虫1--简单网页源代码获取
- python爬虫之re正则表达式库
- [转]用python爬虫抓站的一些技巧总结 zz
- python爬虫日志(11)--json简单了解
- Python爬虫学习记录(5)——python mongodb + 爬虫 + web.py 的acfun视频排行榜
- Python爬虫(二)--利用百度地图API批量获取城市所有的POI点
- Python爬虫实战02:分析Ajax请求并抓取今日头条街拍
- python编写网络爬虫程序
- Python 爬虫学习(一)
- Python3爬虫之urllib爬取异步Ajax数据,使用post请求!
- 爬虫入门【9】Python链接Excel操作详解-openpyxl库
- Python贴吧小爬虫
- 用python编写第一个简易爬虫(Requests库、BeatifulSoup、正则表达式入门)
- 【Python爬虫系列】Python 爬取上海链家二手房数据
- Python爬虫入门-利用requsets库爬取煎蛋网妹子图