您的位置：首页 > 编程语言 > Python开发

python爬虫----初次使用selenium

2017-07-21 22:13 453 查看

这两天都在研究selenium，光是装这个鬼东西就费了好大劲

不过这东西确实挺好用的

为了熟悉使用selenium，我还是跟随大佬的脚步，用他们的项目来练练手

可以去看看州的先生的知乎文章：https://www.zhihu.com/people/zmister/pins/posts。写的都很基础，容易理解

这次是要使用selenium来抓取QQ空间好友的说说

关于selenium的具体操作可以去看《selenium webdriver(python)第三版》，网上有资料。安装selenium的方法也在里面了

思路分析：

Selenium是一个用于Web应用的功能自动化测试工具，Selenium 直接运行在浏览器中，就像真正的用户在操作一样。

我用的是Chrome浏览器

1、首先访问好友空间，输入链接后，会有一个登录界面，这时候就要用selenium模拟人的操作完成登陆。

在做这一步时，在网页中右击点开审查元素，会帮你定位到想要的位置（很实用的一招，以前没怎么用过，这次学习了）

2、通过审查元素定位到说说那一部分，就可以抓取数据了

代码：

from selenium import webdriver
import time
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.maximize_window()

def get_friend_shuoshuo(qq):
url = 'http://user.qzone.qq.com/{}/311'.format(qq)
driver.get(url)
time.sleep(3)#为了脚本运行的稳定性，需要脚本中添加等待时间
print(url)

try:
driver.find_element_by_id('login_frame')
a = True
except:
a = False

#print('a ：' + str(a))

if a == True:
driver.switch_to_frame('login_frame')
driver.find_element_by_id('switcher_plogin').click()#帐号密码登录
driver.find_element_by_id('u').clear()
driver.find_element_by_id('u').send_keys('你的qq')
driver.find_element_by_id('p').clear()
driver.find_element_by_id('p').send_keys('你的密码')
driver.find_element_by_id('login_button').click()
time.sleep(3)

driver.implicitly_wait(3)#智能等待时间

try:
driver.find_element_by_id('QM_OwnerInfo_Icon')
b = True
except:
b = False
print('sorry,你没有权限访问该好友的空间')

#print('b：' + str(b))

if b == True:
driver.switch_to_frame('app_canvas_frame')#框架定位，在审查元素中找到iframe的标签，括号里写id
content = driver.find_elements_by_css_selector('pre.content')
stime = driver.find_elements_by_css_selector('a.c_tx.c_tx3')

count = 0
for con,sti in zip(content,stime):
count += 1
print('第%d条' % count)
print('内容：' + str(con.text))
print('时间：' + str(sti.text))
print('\n\n')
else:
print('出错了！')

#pages = driver.page_source
#soup = BeautifulSoup(pages,'html.parser')
#print(soup)

print("==========完成================")

qq = input('请输入要访问的qq：')
get_friend_shuoshuo(qq)

遇到的问题：

1、driver.switch_to_frame('login_frame')这一行代码，开始没弄明白。这一行是多层框架定位的意思

要在审查元素中找到iframe标签，才是这个框架的部分。括号里写id属性

2、driver.find_elements_by_css_selector（）返回的是一个list列表

3、在写完后，我还打算加上翻页的功能。就是可以把好友全部的说说都抓取下来。

但是我发现审查元素和查看源代码中的代码并不一样，审查元素里的才是网页上看到的。

可我并没有办法弄到审查元素的代码，于是我试了下driver.page_source 和 BeautifulSoup。

但是driver.page_source返回的有时候是审查元素的代码，有时候又是查看源代码中的代码

这就有点搞不懂了。好像是什么js动态数据，在网上看了很久也没找到个好的解决方案。

等过几天学会了怎么弄动态的数据，在加上这个功能吧

解决方法：

本来我昨天是打算用BeautifulSoup解析审查元素的代码，在用正则表达式获取id = "pager_next_\d+"这一段，

但是没办法获取到审查元素的代码，只好作罢，用一种比较常规的办法。

通过点击审查元素可观察到，下一页那个button，是这样的

因此可以用driver.find_element_by_id().click来定位并点击，从而实现翻页功能

但因为pager_next_后面的数字会改变。数字初始是0，改变的规律是每次点击页码或是下一页都会加1。

我们要实现的只是一页页的翻，所以设置一个计数器，每次加1就好了。

值得一提的是，每次翻页后要设置一个等待时间，不然可能会因为网页没加载出来而出现错误

修改后的代码：

from selenium import webdriver
import time
import re
from bs4 import BeautifulSoup
import requests

driver = webdriver.Chrome()
driver.maximize_window()

global count
count = 0

def next_page():#获取每一页的说说
global count

content = driver.find_elements_by_css_selector('pre.content')
stime = driver.find_elements_by_css_selector('a.c_tx.c_tx3')

for con,sti in zip(content,stime):
count += 1
print('第%d条' % count)
print('内容：' + str(con.text))
print('时间：' + str(sti.text))
print('\n\n')
content.clear()
stime.clear()

def get_friend_shuoshuo(qq):
url = 'http://user.qzone.qq.com/{}/311'.format(qq)
driver.get(url)
time.sleep(3)
print(url)

try:
driver.find_element_by_id('login_frame')
a = True
except:
a = False

if a == True:
driver.switch_to_frame('login_frame')
driver.find_element_by_id('switcher_plogin').click()#帐号密码登录
driver.find_element_by_id('u').clear()
driver.find_element_by_id('u').send_keys('qq')
driver.find_element_by_id('p').clear()
driver.find_element_by_id('p').send_keys('密码')
driver.find_element_by_id('login_button').click()
time.sleep(3)

driver.implicitly_wait(2)

try:
driver.find_element_by_id('QM_OwnerInfo_Icon')
b = True
except:
b = False
print('sorry,你没有权限访问该好友的空间')

#pages = driver.page_source
#print(pages)
#soup = BeautifulSoup(pages,'html.parser')
#print(soup)
#pnext = soup.find('div',attrs = {'id':'pager'})
#print(str(pnext) + '\n')

count1 = 0 #审查元素的代码是pager_next_再加上一个数字，初始是0，每次点击页码或下一页会加1
#因为这里是点下一页，不用跳页，所以弄一个count1来实现翻页的功能

driver.switch_to_frame('app_canvas_frame')#框架定位，在审查元素中找到iframe的标签，括号里写id
while b == True:
next_page()

try:
driver.find_element_by_id('pager_next_' + str(count1)).click()
b = True
except:
b = False
continue
count1 += 1
time.sleep(5)#给个缓冲时间，不然网页没加载出来导致出错

cookie = driver.get_cookies()
cookie_dict = []
for c in cookie:
ck = "{0}={1};".format(c['name'],c['value'])
cookie_dict.append(ck)
i = ''
for c in cookie_dict:
i += c
print('Cookies:',i)
print("==========完成================")

qq = input('请输入要访问的qq：')
get_friend_shuoshuo(qq)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航