您的位置:首页 > 其它

Scrapy学习笔记(4)分布式爬取京东商品详情,评论和评论总结

2017-12-12 16:30 316 查看

目标:分布式爬取京东商品详情,评论和评论总结

Power by:

Python 3.6

Scrapy 1.4

pymysql

json

redis

项目地址:https://github.com/Dengqlbq/JDSpider

Step 1——相关简介

本文将注意力放在代码实现上,代码思路的描述将另开一文

代码思路:http://blog.csdn.net/sinat_34200786/article/details/78954617

Step 2——总体框架

分析目标后可以发现有如下需求:

指定关键词并爬取关键词商品的id
爬取商品详情
爬取商品评论


如果将所有需求的实现放在同一个Spider中代码难免显得臃肿,所以决定将整个项目分为四部分

JDSpider

ProjectStart

JDUrlsSpider

JDDetailSpider

JDCommentSpider

ProjectStart         指定关键词并抛出指定数量页面的url
JDUrlsSpider         提取页面中所有商品id并形成detail-url 和comment-url
JDDetailSpider       根据detail-url提取商品详情
JDCommentSpider      根据comment-url提取商品评论


Spider之间通过服务器端redis进行通信,主要就是detail-url和comment-url的传递

Step 3——ProjectStart

指定关键词并抛出指定数量页面的url

页面指在京东浏览商品时某一页

# JDSpider/ProjectStart/Test.py

import redis
from urllib import parse

# Redis configuration
r = redis.Redis(host='HOST', port=6379, password='PASS')

# 改写keywords和page_count
keywords = '手机'
page_count = 100

keywords = parse.quote(keywords)
current_page = 1
start_index = 1

url = 'https://search.jd.com/Search?keyword={0}&enc=utf-8&qrst=1&rt' \
'=1&stop=1&vt=2&wq={1}&page={2}&s={3}&click=0'

for i in range(page_count):
# 提供给JDUrlsSpider
r.lpush('JDUrlsSpider', url.format(keywords, keywords, current_page, start_index))
current_page += 2
start_index += 60


Step4——JDUrlsSpider

提取页面中所有商品id并形成detail-url 和comment-url

创建项目:

cd JDSpider
scrapy startproject JDUrls


浏览商品的某一页时,京东先返回一半的商品信息,另一半采用异步加载只有在滚动条到尾时才加载

所以在提取某页所有商品id时还需要构造异步请求才能真正获得所有商品id

# JDSpider/JDUrls/spiders/JDUrlsSpider.py

from scrapy_redis.spiders import RedisSpider
from JDUrls.items import JDUrlsItem
from scrapy.utils.project import get_project_settings
import scrapy
import re

class JDUrlsSpider(RedisSpider):
# 获取指定页面中所有商品编号并整合成detail-relate url 和comment-relate url
name = 'JDUrlsSpider'
allow_domains = ['www.jd.com']
redis_key = 'JDUrlsSpider'

settings = get_project_settings()
hide_url = settings['HIDE_URL']

def parse(self, response):
# 页面中未隐藏的所有商品编号
nums = response.xpath('//ul[@class="gl-warp clearfix"]/li[@class="gl-item"]
[@data-sku]/@data-sku').extract()

keyword = re.findall(r'keyword=(.*?)&enc', response.url)[0]

# 虽然是同一个页面的商品编号,但异步加载请求隐藏的商品编号时请求的页面编号不同
page = re.findall(r'page=(\d+)', response.url)[0]
page = int(page) + 1

s = ''
for i in nums:
s += str(i) + ','
s = s[0:len(s)-1:]

item = JDUrlsItem()
item['num_list'] = nums
yield item

yield scrapy.Request(self.hide_url.format(keyword, page, s), callback=self.get_hidden)

def get_hidden(self, response):
# 页面中隐藏的所有商品编号
nums = response.xpath('//li[@class="gl-item"][@data-sku]/@data-sku').extract()

item = JDUrlsItem()
item['num_list'] = nums
yield item


<
be69
p>提取出商品id后构造出detail-url和comment-url并存入服务器端redis

# JDSpider/JDUrls/pipelines.py

import redis
from scrapy.utils.project import get_project_settings

class JDUrlsPipeline(object):

def __init__(self):
self.settings = get_project_settings()
self.detail_url = self.settings['GOODS_DETAIL_URL']
self.comment_url = self.settings['COMMENT_URL']

self.r = redis.Redis(host=self.settings['REDIS_HOST'], port=self.settings['REDIS_PORT'],
password=self.settings['REDIS_PARAMS']['password'])

def process_item(self, item, spider):
# 将商品编号整合成detail-relate url 和comment-relate url后存到服务器redis数据库
for n in item['num_list']:
self.r.lpush('JDDetailSpider', self.detail_url.format(n))
self.r.lpush('JDCommentSpider', self.comment_url.format(n))


Step 5——JDDetailSpider

根据detail-url提取商品详情

JDUrlsSpider已经将detail-url存入服务器端redis,JDDetailSpider只需从redis获取url爬取商品详情

创建项目:

cd JDSpider
scrapy startproject JDDetail


要爬取的商品详情具体项如下:

# JDSpider/JDDetail/items.py

import scrapy

class JDDetailItem(scrapy.Item):
# define the fields for your item here like:

# TINYTEXT
name = scrapy.Field()
# FLOAT
price = scrapy.Field()
# TINYTEXT
owner = scrapy.Field()
# TINYINT
jd_sel = scrapy.Field()
# TINYINT
global_buy = scrapy.Field()
# TINYINT
flag = scrapy.Field()
# INT
comment_count = scrapy.Field()
# INT
good_count = scrapy.Field()
# INT
default_good_count = scrapy.Field()
# INT
general_count = scrapy.Field()
# INT
poor_count = scrapy.Field()
# INT
after_count = scrapy.Field()
# FLOAT
good_rate = scrapy.Field()
# FLOAT
general_rate = scrapy.Field()
# FLOAT
poor_rate = scrapy.Field()
# FLOAT
average_score = scrapy.Field()
# TINYTEXT
num = scrapy.Field()


爬取详情时,价格数据和评论总结数据是异步加载的,所以需要另外构造异步请求

# JDSpider/JDDetail/JDDetailSpider

from scrapy_redis.spiders import RedisSpider
from JDDetail.items import JDDetailItem
from scrapy.utils.project import get_project_settings
import scrapy
import re
import json

class JDDetailSpider(RedisSpider):
# 获取指定商品的商品详情
name = 'JDDetailSpider'
allow_domains = ['www.jd.com']
redis_key = 'JDDetailSpider'

settings = get_project_settings()
comment_url = settings['COMMENT_EXCERPT_URL']
price_url = settings['PRICE_URL']

def parse(self, response):
item = JDDetailItem()

# 全球购
if 'hk' in response.url:
global_buy = True
else:
global_buy = False

# 商品名
raw_name = re.findall(r'<div class="sku-name">(.*?)</div>', response.text, re.S)[0].strip()
if '京东精选' in raw_name:
jd_sel = True
else:
jd_sel = False

# 确保商品名无多余字符,如可能出现的 "京东精选"
name_list = raw_name.split('>')
name = name_list[len(name_list) - 1].strip()

# 全球购商铺名提取方法不同
if not global_buy:
owner_list = response.xpath('//div[@class="J-hove-wrap EDropdown
fr"]/div[@class="item"]/div[@class="name"]'
'/a/text()').extract()
else:
owner_list = response.xpath('//div[@class="shopName"]/strong/span/a/text()').extract()

# 是否自营
if len(owner_list) == 0:
owner = '自营'
flag = True
else:
owner = owner_list[0]
if '自营' in owner:
flag = True
else:
flag = False

num = re.findall(r'(\d+)', response.url)[0]

item['name'] = name
item['owner'] = owner
item['flag'] = flag
item['global_buy'] = global_buy
item['jd_sel'] = jd_sel
item['num'] = num

# 请求价格json数据
price_request = scrapy.Request(self.price_url.format(num), callback=self.get_price)
price_request.meta['item'] = item
yield price_request

def get_price(self, response):
item = response.meta['item']

price_json = json.loads(response.text)
item['price'] = price_json[0]['p']
num = item['num']

# 请求评论总结json数据
comment_request = scrapy.Request(self.comment_url.format(num), callback=self.get_comment)
comment_request.meta['item'] = item
yield comment_request

def get_comment(self, response):
item = response.meta['item']

comment_json = json.loads(response.text)
comment_json = comment_json['CommentsCount'][0]

item['comment_count'] = comment_json['CommentCount']
item['good_count'] = comment_json['GoodCount']
item['default_good_count'] = comment_json['DefaultGoodCount']
item['general_count'] = comment_json['GeneralCount']
item['poor_count'] = comment_json['PoorCount']
item['after_count'] = comment_json['AfterCount']
item['good_rate'] = comment_json['GoodRate']
item['general_rate'] = comment_json['GeneralRate']
item['poor_rate'] = comment_json['PoorRate']
item['average_score'] = comment_json['AverageScore']

yield item


Step 6——JDCommentSpider

根据comment-url提取商品评论

JDUrlsSpider已经将comment-url存入服务器端redis,JDCommentSpider只需从redis获取url爬取评论

创建项目:

cd JDSpider
scrapy startproject JDComment


要爬取的商品评论具体项如下:

# JDSpider/JDComment/items.py

class JDCommentItem(scrapy.Item):

# TINYTEXT
good_num = scrapy.Field()
# TEXT
content = scrapy.Field()


初始comment-url返回的json数据中只有10条评论,但是maxPage指明了可以获取评论的次数,加个循

环即可获取其他评论数据

# JDSpider/JDComment/JDCommentSpider.py

from scrapy_redis.spiders import RedisSpider
from JDComment.items import JDCommentItem
from scrapy.utils.project import get_project_settings
import scrapy
import json
import re

class JDCommentSpider(RedisSpider):
# 获取指定商品的评论(完整评论,非摘要)
name = 'JDCommentSpider'
allow_domains = ['www.jd.com']
redis_key = 'JDCommentSpider'

settings = get_project_settings()
comment_url = settings['COMMENT_URL']

def parse(self, response):
comment_json = json.loads(response.text)
good_number = re.findall(r'productId=(\d+)', response.url)[0]
max_page_num = comment_json['maxPage']

for com in comment_json['comments']:
item = JDCommentItem()
item['good_num'] = good_number
item['content'] = com['content']
yield item

for i in range(2, max_page_num):
yield scrapy.Request(self.comment_url.format(good_number, i), callback=self.get_leftover)

def get_leftover(self, response):
comment_json = json.loads(response.text)
good_number = re.findall(r'productId=(\d+)', response.url)[0]

for com in comment_json['comments']:
item = JDCommentItem()
item['good_num'] = good_number
item['content'] = com['content']
yield item


Step 7——启动爬虫

cd ProjectStart
python Test.py


cd JDUrlsSpider
scrapy crawl JDUrlsSpider


cd JDDetailSpider
scrapy crawl JDDetailSpider
(This is distributed crawler, you can run more than one JDDetailSpider)


cd JDCommentSpider
scrapy crawl JDCommentSpider
(This is distributed crawler, you can run more than one JDCommentSpider)


成果展示







参考资料

总体框架参考

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: