您的位置:首页 > 编程语言 > Python开发

python爬虫(二)爬取知乎问答

2017-08-13 00:23 381 查看
都说知乎上问答的质量挺高,刚学爬虫没几天,现在对其问答内容进行爬虫实验。

在知乎首页,通过输入关键词,搜索问题,之后点击问题找到该问题对应的网友回答。

根据该过程,爬虫过程需要分为两步:

1、通过关键词(Java)搜索问题,得到url=https://www.zhihu.com/search?type=content&q=java,根据该url爬取该页面下所有的问题及其对应的问题id;

2、根据第一步得到的问题及其id,得到url=https://www.zhihu.com/question/31437847,爬取该url页面下所有的网友回答。

具体代码如下(https://github.com/tianyunzqs/crawler/tree/master/zhihu)

#!usr/bin/env python
# -*- coding: utf-8 -*-

import re
from urllib import request, parse
from bs4 import BeautifulSoup

keyword_list = ['svm', '支持向量机', 'libsvm']
fout = open("E:/python_file/zhihu.txt", "w", encoding="utf-8")
for keyword in keyword_list:
print(keyword)
url = 'https://www.zhihu.com/search?type=content&q=' + parse.quote(keyword)
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/39.0.2171.95 Safari/537.36'
headers = {'User-Agent': user_agent}
keyword_question_url_list = {}
try:
req = request.Request(url, headers=headers)
response = request.urlopen(req, timeout=5)
content = response.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
all_div = soup.find_all('li', attrs={'class': re.compile('item clearfix.*?')})
question_url_list = {}
for e_div in all_div:
title = e_div.find_all('a', attrs={'class': 'js-title-link',
'target': '_blank',
'href': re.compile('/question/[0-9]+')})
if title:
title = title[0].text
_id = e_div.find_all('link', attrs={'itemprop': 'url',
'href': re.compile('/question/[0-9]+/answer/[0-9]+')})
href = _id[0].attrs.get('href')
pattern = re.compile('/question/(.*?)/answer/(.*?)$', re.S)
items = re.findall(pattern, href)
question_id = items[0][0]
question_url_list[title] = 'https://www.zhihu.com/question/' + question_id
else:
title_id = e_div.find_all('a', attrs={'class': 'js-title-link',
'target': '_blank',
'href': re.compile('https://zhuanlan.zhihu.com/p/[0-9]+')})
if title_id:
title = title_id[0].text
href = title_id[0].attrs.get('href')
question_url_list[title] = href
else:
continue
keyword_question_url_list[keyword] = question_url_list
# for q, d in question_url_list.items():
#     print(q, d)
except:
continue

for keyword, question_url_list in keyword_question_url_list.items():
for question, url in question_url_list.items():
fout.write(question + "\n")
try:
req = request.Request(url, headers=headers)
with request.urlopen(req, timeout=5) as response:
content = response.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
all_div = soup.find_all('div', attrs={'class': 'List-item'})
for e_div in all_div:
answer = e_div.find_all('span', attrs={'class': 'RichText CopyrightRichText-richText',
'itemprop': 'text'})
answer = answer[0].text
fout.write(answer + "\n")
except request.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)


存在的问题:
以上程序可以很好完成第一步,但是第二步只能取到问题的前2个回答。

根据http://www.cnblogs.com/buzhizhitong/p/5697526.html的介绍,应该可以用Selenium+Phantomjs来解决,以后再尝试。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: