您的位置：首页 > 理论基础 > 计算机网络

“支部工作”的网络爬虫实践（二）

2018-02-27 21:32 288 查看

4. 阅读新闻
打开要闻的前n条，逐个打开并阅读，显示并保存其新闻内容以供学习。

网址为：https://zhibugongzuo.com/News/ImportantIndex
4.1 获取新闻列表
首先，我们要得到要阅读n条新闻的链接，可以看到，目录页一次只显示8条，如果想要读取更多内容需要点击“加载更多”，当新闻目录的长度<n的时候，就不断地点击加载更多，直到页面里面的新闻目录长度大于n。“加载更多”的HTML代码如下：<a id="j-news-more" class="zbgz-infolist-more" href="javascript:;">加载更多</a>这里我们换一种方法点击加载更多，这里使用selenium调用JavaScript脚本的方法，具体操作如下：js2 = 'document.getElementById("class": "j-news-more").click();'
driver.exeute_script(js2)新闻目录的源代码如下：

<div class="zbgz-infolist-item">
...
<div class="zbgz-infolist-item-container">
<div class="zbgz-infolist-item-title"><a href="/News/Show/7619" title="5年两会，习近平这些话语历久弥新" target="_blank">5年两会，习近平这些话语历久弥新</a></div>
...
</div>
</div>

可以看到，所有的新闻信息都放在'zbgz-infolist-item-container'里面，BeautifulSoup提供了findAll函数可以匹配该网页的所有满足条件的信息并返回一个列表，具体源代码如下：pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'lxml')
NewsList = soup.findAll("div", {"class": "zbgz-infolist-item-container"})新闻的标题和链接可以很方便的用下面方式查看：for news in NewsList:
title = news.a.get_text()
url = 'https://www.zhibugongzuo.com' + news.a["href"] 4.2 打开新闻，获取正文
打开新闻之后，可以看到新闻的内容都放在<div class = "indexinfotext">里面，由于一篇新闻只有一个正文，所以使用BeautifulSoup的find函数就可以，该函数可以返回第一个匹配的对象。与findAll函数不同，后者返回一个列表，这点要注意区别。正文可以通过.get_text()方式获得，具体代码如下：text = soup.find("div", {"class": "indexinfotext"}).get_text()将上述全部内容串联起来，获取n条新闻的列表，并逐一打开这些网页的全部代码如下：def reading_news(num_news=10):
import time
from bs4 import BeautifulSoup
global driver
# 进入要闻学习，阅读十条重要新闻
print("进入要闻学习，阅读 %d 条重要新闻" % num_news)
web_address = "https://www.zhibugongzuo.com/News/ImportantIndex"
driver.get(web_address)
time.sleep(3)
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
js2 = 'document.getElementById("j-news-more").click();'
driver.execute_script(js2)
# driver.find_element_by_id("j-news-more").click() # 加载更多
time.sleep(3)
pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'lxml')
NewsList = soup.findAll("div", {"class": "zbgz-infolist-item-container"})
while (len(NewsList) < num_news): #获取新闻列表
js2 = 'document.getElementById("j-news-more").click();'
driver.execute_script(js2)
pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'lxml')
NewsList = soup.findAll("div", {"class": "zbgz-infolist-item-container"})
time.sleep(1)
for i in range(0, num_news): #逐一打开并保存
title = NewsList[i].a.get_text()
url = 'https://www.zhibugongzuo.com' + NewsList[i].a["href"]
print('正在学习第 %d 条新闻，%s：%s' % (i + 1, title, url))
driver.get(url)
print('正在阅读' + title)
pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'lxml')
text = soup.find("div", {"class": "indexinfotext"}).get_text()
print(text)
f = open(title + '.txt', "w")
f.write(text)
f.close()全部代码请见：https://github.com/sh39o/scraping-zhibugongzuo-/blob/master/study_zhibu.py

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 网络爬虫新闻学习

相关文章推荐

新的分享

章节导航