Scrapy学习笔记(4)分布式爬取京东商品详情,评论和评论总结
2017-12-12 16:30
316 查看
目标:分布式爬取京东商品详情,评论和评论总结
Power by:
Python 3.6Scrapy 1.4
pymysql
json
redis
项目地址:https://github.com/Dengqlbq/JDSpider
Step 1——相关简介
本文将注意力放在代码实现上,代码思路的描述将另开一文代码思路:http://blog.csdn.net/sinat_34200786/article/details/78954617
Step 2——总体框架
分析目标后可以发现有如下需求:指定关键词并爬取关键词商品的id 爬取商品详情 爬取商品评论
如果将所有需求的实现放在同一个Spider中代码难免显得臃肿,所以决定将整个项目分为四部分
JDSpider
ProjectStart
JDUrlsSpider
JDDetailSpider
JDCommentSpider
ProjectStart 指定关键词并抛出指定数量页面的url JDUrlsSpider 提取页面中所有商品id并形成detail-url 和comment-url JDDetailSpider 根据detail-url提取商品详情 JDCommentSpider 根据comment-url提取商品评论
Spider之间通过服务器端redis进行通信,主要就是detail-url和comment-url的传递
Step 3——ProjectStart
指定关键词并抛出指定数量页面的url页面指在京东浏览商品时某一页
# JDSpider/ProjectStart/Test.py import redis from urllib import parse # Redis configuration r = redis.Redis(host='HOST', port=6379, password='PASS') # 改写keywords和page_count keywords = '手机' page_count = 100 keywords = parse.quote(keywords) current_page = 1 start_index = 1 url = 'https://search.jd.com/Search?keyword={0}&enc=utf-8&qrst=1&rt' \ '=1&stop=1&vt=2&wq={1}&page={2}&s={3}&click=0' for i in range(page_count): # 提供给JDUrlsSpider r.lpush('JDUrlsSpider', url.format(keywords, keywords, current_page, start_index)) current_page += 2 start_index += 60
Step4——JDUrlsSpider
提取页面中所有商品id并形成detail-url 和comment-url创建项目:
cd JDSpider scrapy startproject JDUrls
浏览商品的某一页时,京东先返回一半的商品信息,另一半采用异步加载只有在滚动条到尾时才加载
所以在提取某页所有商品id时还需要构造异步请求才能真正获得所有商品id
# JDSpider/JDUrls/spiders/JDUrlsSpider.py from scrapy_redis.spiders import RedisSpider from JDUrls.items import JDUrlsItem from scrapy.utils.project import get_project_settings import scrapy import re class JDUrlsSpider(RedisSpider): # 获取指定页面中所有商品编号并整合成detail-relate url 和comment-relate url name = 'JDUrlsSpider' allow_domains = ['www.jd.com'] redis_key = 'JDUrlsSpider' settings = get_project_settings() hide_url = settings['HIDE_URL'] def parse(self, response): # 页面中未隐藏的所有商品编号 nums = response.xpath('//ul[@class="gl-warp clearfix"]/li[@class="gl-item"] [@data-sku]/@data-sku').extract() keyword = re.findall(r'keyword=(.*?)&enc', response.url)[0] # 虽然是同一个页面的商品编号,但异步加载请求隐藏的商品编号时请求的页面编号不同 page = re.findall(r'page=(\d+)', response.url)[0] page = int(page) + 1 s = '' for i in nums: s += str(i) + ',' s = s[0:len(s)-1:] item = JDUrlsItem() item['num_list'] = nums yield item yield scrapy.Request(self.hide_url.format(keyword, page, s), callback=self.get_hidden) def get_hidden(self, response): # 页面中隐藏的所有商品编号 nums = response.xpath('//li[@class="gl-item"][@data-sku]/@data-sku').extract() item = JDUrlsItem() item['num_list'] = nums yield item
<
be69
p>提取出商品id后构造出detail-url和comment-url并存入服务器端redis
# JDSpider/JDUrls/pipelines.py import redis from scrapy.utils.project import get_project_settings class JDUrlsPipeline(object): def __init__(self): self.settings = get_project_settings() self.detail_url = self.settings['GOODS_DETAIL_URL'] self.comment_url = self.settings['COMMENT_URL'] self.r = redis.Redis(host=self.settings['REDIS_HOST'], port=self.settings['REDIS_PORT'], password=self.settings['REDIS_PARAMS']['password']) def process_item(self, item, spider): # 将商品编号整合成detail-relate url 和comment-relate url后存到服务器redis数据库 for n in item['num_list']: self.r.lpush('JDDetailSpider', self.detail_url.format(n)) self.r.lpush('JDCommentSpider', self.comment_url.format(n))
Step 5——JDDetailSpider
根据detail-url提取商品详情JDUrlsSpider已经将detail-url存入服务器端redis,JDDetailSpider只需从redis获取url爬取商品详情
创建项目:
cd JDSpider scrapy startproject JDDetail
要爬取的商品详情具体项如下:
# JDSpider/JDDetail/items.py import scrapy class JDDetailItem(scrapy.Item): # define the fields for your item here like: # TINYTEXT name = scrapy.Field() # FLOAT price = scrapy.Field() # TINYTEXT owner = scrapy.Field() # TINYINT jd_sel = scrapy.Field() # TINYINT global_buy = scrapy.Field() # TINYINT flag = scrapy.Field() # INT comment_count = scrapy.Field() # INT good_count = scrapy.Field() # INT default_good_count = scrapy.Field() # INT general_count = scrapy.Field() # INT poor_count = scrapy.Field() # INT after_count = scrapy.Field() # FLOAT good_rate = scrapy.Field() # FLOAT general_rate = scrapy.Field() # FLOAT poor_rate = scrapy.Field() # FLOAT average_score = scrapy.Field() # TINYTEXT num = scrapy.Field()
爬取详情时,价格数据和评论总结数据是异步加载的,所以需要另外构造异步请求
# JDSpider/JDDetail/JDDetailSpider from scrapy_redis.spiders import RedisSpider from JDDetail.items import JDDetailItem from scrapy.utils.project import get_project_settings import scrapy import re import json class JDDetailSpider(RedisSpider): # 获取指定商品的商品详情 name = 'JDDetailSpider' allow_domains = ['www.jd.com'] redis_key = 'JDDetailSpider' settings = get_project_settings() comment_url = settings['COMMENT_EXCERPT_URL'] price_url = settings['PRICE_URL'] def parse(self, response): item = JDDetailItem() # 全球购 if 'hk' in response.url: global_buy = True else: global_buy = False # 商品名 raw_name = re.findall(r'<div class="sku-name">(.*?)</div>', response.text, re.S)[0].strip() if '京东精选' in raw_name: jd_sel = True else: jd_sel = False # 确保商品名无多余字符,如可能出现的 "京东精选" name_list = raw_name.split('>') name = name_list[len(name_list) - 1].strip() # 全球购商铺名提取方法不同 if not global_buy: owner_list = response.xpath('//div[@class="J-hove-wrap EDropdown fr"]/div[@class="item"]/div[@class="name"]' '/a/text()').extract() else: owner_list = response.xpath('//div[@class="shopName"]/strong/span/a/text()').extract() # 是否自营 if len(owner_list) == 0: owner = '自营' flag = True else: owner = owner_list[0] if '自营' in owner: flag = True else: flag = False num = re.findall(r'(\d+)', response.url)[0] item['name'] = name item['owner'] = owner item['flag'] = flag item['global_buy'] = global_buy item['jd_sel'] = jd_sel item['num'] = num # 请求价格json数据 price_request = scrapy.Request(self.price_url.format(num), callback=self.get_price) price_request.meta['item'] = item yield price_request def get_price(self, response): item = response.meta['item'] price_json = json.loads(response.text) item['price'] = price_json[0]['p'] num = item['num'] # 请求评论总结json数据 comment_request = scrapy.Request(self.comment_url.format(num), callback=self.get_comment) comment_request.meta['item'] = item yield comment_request def get_comment(self, response): item = response.meta['item'] comment_json = json.loads(response.text) comment_json = comment_json['CommentsCount'][0] item['comment_count'] = comment_json['CommentCount'] item['good_count'] = comment_json['GoodCount'] item['default_good_count'] = comment_json['DefaultGoodCount'] item['general_count'] = comment_json['GeneralCount'] item['poor_count'] = comment_json['PoorCount'] item['after_count'] = comment_json['AfterCount'] item['good_rate'] = comment_json['GoodRate'] item['general_rate'] = comment_json['GeneralRate'] item['poor_rate'] = comment_json['PoorRate'] item['average_score'] = comment_json['AverageScore'] yield item
Step 6——JDCommentSpider
根据comment-url提取商品评论JDUrlsSpider已经将comment-url存入服务器端redis,JDCommentSpider只需从redis获取url爬取评论
创建项目:
cd JDSpider scrapy startproject JDComment
要爬取的商品评论具体项如下:
# JDSpider/JDComment/items.py class JDCommentItem(scrapy.Item): # TINYTEXT good_num = scrapy.Field() # TEXT content = scrapy.Field()
初始comment-url返回的json数据中只有10条评论,但是maxPage指明了可以获取评论的次数,加个循
环即可获取其他评论数据
# JDSpider/JDComment/JDCommentSpider.py from scrapy_redis.spiders import RedisSpider from JDComment.items import JDCommentItem from scrapy.utils.project import get_project_settings import scrapy import json import re class JDCommentSpider(RedisSpider): # 获取指定商品的评论(完整评论,非摘要) name = 'JDCommentSpider' allow_domains = ['www.jd.com'] redis_key = 'JDCommentSpider' settings = get_project_settings() comment_url = settings['COMMENT_URL'] def parse(self, response): comment_json = json.loads(response.text) good_number = re.findall(r'productId=(\d+)', response.url)[0] max_page_num = comment_json['maxPage'] for com in comment_json['comments']: item = JDCommentItem() item['good_num'] = good_number item['content'] = com['content'] yield item for i in range(2, max_page_num): yield scrapy.Request(self.comment_url.format(good_number, i), callback=self.get_leftover) def get_leftover(self, response): comment_json = json.loads(response.text) good_number = re.findall(r'productId=(\d+)', response.url)[0] for com in comment_json['comments']: item = JDCommentItem() item['good_num'] = good_number item['content'] = com['content'] yield item
Step 7——启动爬虫
cd ProjectStart python Test.py
cd JDUrlsSpider scrapy crawl JDUrlsSpider
cd JDDetailSpider scrapy crawl JDDetailSpider (This is distributed crawler, you can run more than one JDDetailSpider)
cd JDCommentSpider scrapy crawl JDCommentSpider (This is distributed crawler, you can run more than one JDCommentSpider)
成果展示
参考资料
总体框架参考
相关文章推荐
- Android 仿淘宝京东商品详情页阻力翻页效果
- android 仿淘宝、京东商品详情页 向上拖动查看图文详情控件
- ecshop 详情页面获取商品销量和评论数
- 京东手机商品详情页技术解密
- Android仿京东、天猫app的商品详情页的布局架构, 以及功能实现
- Android仿京东、天猫商品详情页
- 京东商品详情页服务闭环实践
- android 仿淘宝、京东商品详情页 向上拖动查看图文详情控件
- Android跳转淘宝、京东APP商品详情页
- 爬虫学习笔记_京东商品内容、评论
- python爬虫selenium+firefox抓取京东商品评论
- 京东技术架构(二)构建需求响应式亿级商品详情页
- 商品详情页面调用该商品评论数量
- python制作爬虫爬取京东商品评论教程
- 京东商品评论抓取(抓包方法)
- 京东商品及评论 数据采集
- 仿京东、淘宝商品详情页上拉显示详情的嵌套ScrollView
- 京东手机商品详情页技术解密
- 京东商品详情页服务闭环实践
- 仿京东IOS APP商品详情页购物车栏悬浮底部