您的位置:首页 > 编程语言 > Python开发

我的大数据之路 -- python3+Ajax实战+selenium获取-南瓜屋的爬取(初学者必看)

2019-05-07 16:38 246 查看

南瓜屋

  1. 安装 selenium ,安装PhantomJS
  2. 测试成功之后可以先看一下selenium的基本使用方法 selenium使用
  3. 当理解之后,我们就开始做第一个demo

先分析南瓜屋的网页,顺便看几个故事吧,先放松放松,然后一天就过去了,哈哈。说笑的,怎么可能。

第二天

咳咳咳~ 今天呢,我们开始具体的分析网页吧,首先我们看到首页,一般来说爬虫需要先找到URL的规律,然后我们就往下滚动,接着滚,一直滚,始终是没有第二页的链接,所以可以很大猜测是后台Ajax或者js管理的。在NetWork下的XHR可以发现url的规律

https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=2&page=1&per_page=10
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=1&page=2&per_page=10
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=1&page=3&per_page=10
https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=1&page=4&per_page=10

找到规律了吗?
除了第一个的action=2之外,之后的URL变化的就只有page,测试:将第一个的action变成1访问,看成不成功。结果是可以的。
但是要拿到所有的故事信息,就要访问详情页。分析详情页的URL

https://story.hao.360.cn/story/MtTcQkq5NHOBPD

挺简单的,主要是后面那一串字母,查看后发现,首页URL可以返回一个data

好了,整个的URL都已经解析完成。现在开始写代码。下面代码只是爬取一页,想要多的自己增加。
也可以使用其他的,例如xpath,beautiful,re等,我这里就当做是selenium的入门吧

Pumpkin_House.py
-----------------------------------------------------------
import json
import time
import requests
import random
from selenium import webdriver

class PumpKinHouse():
def __init__(self):
# 定义浏览器,让爬虫伪装成不同的浏览器,可有效降低被反爬
user_agent = ['Mozilla/5.0 (Windows NT 6.1; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;  Trident/5.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (iPad; CPU OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0 Mobile/14B100 Safari/602.1',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
]
##定义随机函数抽取浏览器访问
num = random.randint(0, 9)
user_agent = user_agent[num]

self.headers = {'user_agent': user_agent,  # 伪装
'Connection': 'keep - alive',  # 保持连接不断
'Host': 'story.hao.360.cn',
'Referer': 'https://story.hao.360.cn/plaza' # 增加伪装
}
self.fp=open('story.txt', 'a+', encoding='utf-8')

def req_url(self,url):
response=[]
try:
response=requests.get(url=url,headers=self.headers)
except ConnectionError or ConnectionRefusedError:
time.sleep(2)
self.req_url(url=url)

return response

def get_id(self,response):
response=response.text.encode('utf-8')
id_list=[]
data=json.loads(response)['data']['data']
for ids in data:
id=ids['id']
id_list.append(id)

return id_list

def get_detail(self,url):

broswer = webdriver.PhantomJS()

broswer.get(url)

broswer.implicitly_wait(2)

try:
title = broswer.find_element_by_css_selector('.title').text

username = broswer.find_element_by_class_name('username').text

times = broswer.find_element_by_css_selector('.time.fr').text

content = broswer.find_element_by_css_selector('.content.clearfix').text

print(title)
self.fp.write('\n'.join([title, username, times, content]))
self.fp.write('\n' + '=' * 50 + '\n')

except ConnectionRefusedError or ConnectionError:
time.sleep(2)
self.get_detail(url=url)

def main(self):
urls=['https://story.hao.360.cn/api/recommend/storyList?user_id=8a6b83d87bd37e3fff9d8c5480e1c191&session_id=afb30b4dbfad9cb60a5a25bb834a80b9&action=2&page={}&per_page=10'.format(i) for i in range(1,2)]
for url in urls:
response=self.req_url(url=url)
id_list=self.get_id(response=response)
for id in id_list:
detail_url = 'https://story.hao.360.cn/story/' + id
self.get_detail(detail_url)

if __name__=='__main__':
PKH=PumpKinHouse()
PKH.main()

获得的txt文件如下,感觉还不错。

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: