scrapy爬虫之抓取京东机械键盘评论量并画图展示
2018-02-01 08:23
585 查看
简介
最近想了解一下机械键盘,因此使用scrapy抓取了京东机械键盘并使用python根据店铺名和评论量进行图片分析。
分析
在写爬虫前,我们需要先分析下京东机械键盘的是怎么访问的。1.进入京东,搜索机械键盘
#页面url https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=fdac35af19ef4c7bbe23defb205b1b59[/code]
2.查看网页源代码
通过源代码发现,默认情况下只显示30条信息,但是在浏览器中向下滚动到30条以后,页面通过ajax会自动加载后30条信息,
通过开发者工具查看:
通过上图可发现,页面通过ajax异步加载的url:#后30条 https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=2&s=27&scrolling=y&log_id=1517196404.59517&tpl=1_M&show_items=3378484,6218105,3204859,2629440,3491212,2991278,1832316,4103095,5028795,2694404,3034311,1543721098,3606368,1792545,4911552,10494209225,2818591,2155852,1882111,3491218,584773,2942614,4285176,4873773,4106737,3204891,1495945,5259880,12039586866,3093295[/code]
注意:
url中的”page=2”
url中的show_items值为源代码中前30条信息的”data-sku”
待ajax异步加载后30条内容后,此页的全部内容则全部加载完毕。
3.分析翻页
点击第二页查看url#第二页,前30条 https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=3&s=57&click=0 #第二页,后30条 https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=4&s=84&scrolling=y&log_id=1517225828.64245&tpl=1_M&show_items=14689611523,1365181,3890366,3086129,5455802,4237668,3931658,3491228,1654797409,2361918,5442762,4237678,5225170,4960228,4237662,3931616,3491188,5009394,10151123711,4838698,4911578,1543721097,3093301,4838762,1836476,5910288,1135833,4277018,5028785,1324969[/code]
点击第三页查看url#第三页,前30条 https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=5&s=110&click=0 #第三页,后30条 https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&page=6&s=137&scrolling=y&log_id=1517225931.50937&tpl=1_M&show_items=5965870,3093297,14758401114,4825074,1247140,4911566,3634890,3212216,2329142,5155156,5225170,1812788,613970,5391428,1836460,1771658520,1308971,2512327,15428123588,2512333,3176567,6039820,10048750474,3093303,3724961,338871,10235508261,2144773,1939376,1543721095[/code]
通过以上我们可以看到,page是按3、5奇数方式增长的,而ajax加载的后30条信息中page是按2、4、6偶数方式增长的。
通过以上,我们的爬虫方案也就有了,先爬取当前页的前30条item,然后获取data-sku,模拟ajax请求异步加载获取后30条item;当前页全部抓取完毕后,翻页俺上面的方式继续爬取,直至最后。实现
1.定义item
vim items.py #将评论量转化由字符串为float,并将万按单位计算,便于后续分析计算 def filter_comment(x): str = x.strip('+') if str[-1] == u'万': return float(str[:-1])*10000 else: return float(str) class KeyboardItem(scrapy.Item): #店铺名 shopname = scrapy.Field(input_processor=MapCompose(unicode.strip),output_processor=TakeFirst()) #产品名 band = scrapy.Field(output_processor=TakeFirst()) #价格 price = scrapy.Field(output_processor=TakeFirst()) #评价量 comment = scrapy.Field(input_processor=MapCompose(filter_comment),output_processor=TakeFirst())
其中:
filter_comment函数,是将评论量转化由字符串为float,并将万按单位计算,便于后续分析计算。因为评论量有的以万为单位,如1.5万。
MapCompose(unicode.strip),去掉空格
output_processor=TakeFirst(),获取shopname的第一个字段,否则我们获得的shopname、price、band、comment都为列表。
如果不经过已经处理,我们最终生成的json文件为一下:[ {"comment": [1.2万+], "band": ["新盟游戏", "机械键盘"], "price": ["129.00"], "shopname": [罗技G官方旗舰店"]}, ...... ]
经过处理后[ {"comment": 120000.0, "band": "新盟游戏", "price": "129.00", "shopname": 罗技G官方旗舰店"}, ...... ]
这种格式更方便我们通过python的pandas进行科学计算。爬虫实现
1.编写爬虫vim keyboard.py # -*- coding: utf-8 -*- #京东搜索机械键盘 import scrapy from jingdong.items import KeyboardItem from scrapy.loader import ItemLoader class KeyboardSpider(scrapy.Spider): name = 'keyboard' allowed_domains = ['jd.com'] #start_urls = ['https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=361c7116408b4a10b5e769e3fd25bbbf'] headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"} def start_requests(self): #重写,增加headers yield scrapy.Request(url='https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&wq=机械键盘&pvid=361c7116408b4a10b5e769e3fd25bbbf', meta={'pagenum':1}, headers=self.headers, callback=self.parse_first30) def parse_first30(self, response): #爬取前30条 pagenum = response.meta['pagenum'] print '进入机械键盘第' + str(pagenum) + '页,显示前30条' for eachitem in response.xpath('//li[@class="gl-item"]'): load = ItemLoader(item=KeyboardItem(),selector=eachitem) info = load.nested_xpath('div') info.add_xpath('shopname', 'div[@class="p-shop"]/span/a/@title') info.add_xpath('band', 'div[@class="p-name p-name-type-2"]/a/em/text()') info.add_xpath('price', 'div[@class="p-price"]/strong/i/text()') info.add_xpath('comment', 'div[@class="p-commit"]/strong/a/text()') yield load.load_item() #获取前30条记录的sku skulist = response.xpath('//li[@class="gl-item"]/@data-sku').extract() skustring = ','.join(skulist) #后30条为偶数页 pagenum_more = pagenum*2 baseurl = 'https://search.jd.com/s_new.php?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&&s=28&scrolling=y&log_id=1517052655.49883&tpl=1_M&' #ajax加载的后30条url ajaxurl = baseurl + 'page=' + str(pagenum_more) + '&show_items'+ skustring.encode('utf-8') yield scrapy.Request(ajaxurl, meta={'pagenum':pagenum},headers=self.headers, callback=self.parse_next30) def parse_next30(self, response): #爬取后30条 pagenum = response.meta['pagenum'] print '进入机械键盘第' + str(pagenum) + '页,显示后30条' for eachitem in response.xpath('//li[@class="gl-item"]'): load = ItemLoader(item=KeyboardItem(),selector=eachitem) info = load.nested_xpath('div') info.add_xpath('shopname', 'div[@class="p-shop"]/span/a/@title') info.add_xpath('band', 'div[@class="p-name p-name-type-2"]/a/em/text()') info.add_xpath('price', 'div[@class="p-price"]/strong/i/text()') info.add_xpath('comment', 'div[@class="p-commit"]/strong/a/text()') yield load.load_item() #获取后30条记录的sku skulist = response.xpath('//li[@class="gl-item"]/@data-sku').extract() pagenum = pagenum+1 #下一页的实际数字 nextreal_num = pagenum*2-1 #下一页url next_page = 'https://search.jd.com/Search?keyword=机械键盘&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=机械键盘&s=56&click=0&page=' + str(nextreal_num) yield scrapy.Request(next_page, meta={'pagenum':pagenum}, headers=self.headers, callback=self.parse_first30)
注意:我们将访问的第n页通过meta进行传递。例如:第一页,pagenum=1,只显示前30条 pagenum_more = pagenum*2=2 ,ajax加载的后30条url中的page值 第二页nextreal_num = pagenum*2-1=3,下一页url中的page值
2.运行scrapy crawl keyboard -o keyboard.json [ {"comment": 120000.0, "band": "新盟游戏", "price": "129.00"}, {}, {}, {"comment": 15000.0, "band": "罗技(Logitech)G610 Cherry轴全尺寸背光", "price": "599.00", "shopname": "罗技G官方旗舰店"}, {"comment": 9900.0, "band": "ikbc c104 樱桃轴", "price": "389.00", "shopname": "ikbc京东自营旗舰店"}, {"comment": 11000.0, "band": "美商海盗船(USCorsair)Gaming系列 K70 LUX RGB 幻彩背光", "price": "1299.00", "shopname": "美商海盗船京东自营旗舰店"}, {"comment": 34000.0, "band": "达尔优(dareu)108键", "price": "199.00", "shopname": "达尔优京东自营旗舰店"}, {"comment": 74000.0, "band": "雷柏(Rapoo) V700S合金版 混光", "price": "189.00", "shopname": "雷柏京东自营官方旗舰店"}, {"comment": 8100.0, "band": "罗技(Logitech)G610 Cherry轴全尺寸背光", "price": "599.00", "shopname": "罗技G官方旗舰店"}, {"comment": 26000.0, "band": "雷蛇(Razer)BlackWidow X 黑寡妇蜘蛛X幻彩版 悬浮式游戏", "price": "799.00", "shopname": "雷蛇RAZER京东自营旗舰店"}, {"comment": 74000.0, "band": "雷柏(Rapoo) V500PRO 混光", "price": "169.00", "shopname": "雷柏京东自营官方旗舰店"}, {"comment": 150000.0, "band": "前行者游戏背光发光牧马人", "price": "65.00", "shopname": "敏涛数码专营店"}, {"comment": 11000.0, "band": "樱桃(Cherry)MX-BOARD 2.0 G80-3800 游戏办", "price": "389.00"}, {"comment": 12000.0, "band": "美商海盗船(USCorsair)STRAFE 惩戒者 ", "price": "699.00", "shopname": "美商海盗船京东自营旗舰店"}, {"comment": 6700.0, "band": "罗技(Logitech)G413", "price": "449.00", "shopname": "罗技G官方旗舰店"}, {"comment": 120000.0, "band": "新盟游戏", "price": "89.00", "shopname": "敏涛数码专营店"}, {"comment": 26000.0, "band": "雷蛇(Razer)BlackWidow X 黑寡妇蜘蛛X 竞技版87键 悬浮式游戏", "price": "299.00", "shopname": "雷蛇RAZER京东自营旗舰店"}, {"comment": 110000.0, "band": "达尔优(dareu)108键", "price": "199.00", "shopname": "达尔优京东自营旗舰店"}, {"comment": 61000.0, "band": "狼蛛(AULA)F2008混光跑马 ", "price": "129.00", "shopname": "狼蛛外设京东自营官方旗舰店"}, ....... ]科学计算
通过scrapy爬取到数据后,我们使用python科学计算进行分析
店铺名的评论量并画图展示。vim keyboard_analyse.py #!/home/yanggd/miniconda2/envs/science/bin/python # -*- coding: utf-8 -*- import matplotlib.pyplot as plt import pandas as pd from pandas import DataFrame import json filename= 'keyboard.json' #从json文件生成DataFrame with open(filename) as f: pop_data = json.load(f) df =DataFrame(pop_data) group_shopname = df.groupby('shopname') group =group_shopname.mean() #print group #字体设置 plt.rcParams['font.family'] = 'sans-serif' plt.rcParams['font.sans-serif'] = ['simhei'] plt.rcParams['axes.unicode_minus'] = False #柱状图 group.plot(kind='bar') plt.xlabel(u"店铺名") plt.ylabel(u"评论量") plt.show() #运行 python keyboard_analyse.py
相关文章推荐
- [置顶] [爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)
- 网络爬虫框架scrapy介绍及应用——抓取新浪新闻的标题内容评论
- 【爬虫】利用Scrapy抓取京东商品、豆瓣电影、技术问题
- python爬虫selenium+firefox抓取京东商品评论
- scrapy爬虫之抓取《芳华》短评及词云展示
- Python爬虫框架Scrapy实战 - 抓取BOSS直聘招聘信息
- Python爬虫框架Scrapy实战 - 抓取BOSS直聘招聘信息
- Python爬虫入门-fiddler抓取手机新闻评论
- 京东商品评论抓取(抓包方法)
- Python爬虫框架Scrapy 学习笔记 10.2 -------【实战】 抓取天猫某网店所有宝贝详情
- 一个scrapy框架的爬虫(爬取京东图书)
- 使用爬虫抓取网易云音乐热门评论生成好玩的词云
- 网页采集实践:配置京东商品评论爬虫(值得收藏)
- Python爬虫,抓取淘宝商品评论内容
- 爬虫案例---Python2X版本抓取京东手机页面的图片
- scrapy爬虫-1-初试页面抓取
- python爬虫(7)——获取京东商品评论信息
- 手把手教你写电商爬虫-第五课 京东商品评论爬虫 一起来对付反爬虫
- php+phpquery简易爬虫抓取京东商品分类