您的位置：首页 > 编程语言 > Python开发

使用Python和BeautifulSoup爬取历史上的今天网站并将描述及网址存储到csv文件中

2017-09-30 10:24 896 查看

Python版本：python3.6.0 | Anocanda 4.3.1 （64-bit）

BeautifulSoup版本：4.0.0.1

使用的python库：csv，bs4， os， urllib.request

爬取网站：历史上的今天（http://www.lssdjt.com/2/29/）

简述：历史上的今天网站，网页源码比较规整，便于进行爬取，将爬虫的基本知识融合在一起进行使用，作为自己python学习之路的小小记录。

备注：在进行代码爬取的过程中，需要有个初始化的日期，初始化日期如上所示，之所以选择2月29号，是因为，爬取的年份不是闰年，不从2月29号开始，会漏掉2月29号这一天。

爬取的结果（共一万多条数据）：

1、导入所需要的python库

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import os

urlopen：用于打开网页

BeautifulSoup：在进行网页中元素查找时，有了BeautifulSoup这种将页面结构化的工具，会方便很多

csv：用于将数据写入csv文件

os：用于获取文件的路径

2、初始化将要进行写入的csv文件：

# 初始化写入文件
csvFile = open(os.getcwd() + "/files/history.csv", 'w+', newline='')
writer = csv.writer(csvFile)
writer.writerow(('discription', 'url'))

3、分析待爬取的网页，对网页进行爬取：

网页显示如图所示：

在网页上点击右键（本例使用的是Chrome浏览器），在弹出的菜单中选择检查。（或者按Ctrl + Shift + I）

如图所示：

通过对网页的分析可知，我们需要的内容及内容对应的网址，在"class"为"main"的<div>元素中，<div>元素下有个<ul>元素，类为"clear fix"

我们需要后一天的日期来进行下一步的爬取，在页面中有后一天的位置，再次对后一天按钮进行检查，可以得到：

使用BeautifulSoup中的find及findAll函数，可以得到历史上的今天内容及内容相对应的网址，同时返回下一步要进行爬取的链接地址：

def WriteItems(url):
html = urlopen(url)
bsObj = BeautifulSoup(html, 'lxml')

nextUrlClass = bsObj.find('ul', {'class': 'bot'}).find('li', {'class': 'r'})
next_url = nextUrlClass.a['href']
print(next_url)
page_items = bsObj.find('div', {'class': 'main'}).find('ul', {'class': 'list clearfix'})
items = page_items.findAll('li', {'class': 'gong'})

return next_url, items
4、爬取完成后，将items写入到csv文件中：
def WriteToCSV(items, date):
# dateSplit = date.split('/')
# dateDir = dateSplit[0] + ' ' + dateSplit[1]
# os.mkdir(os.getcwd() + "/files/images/" + dateDir)
writer = csv.writer(csvFile, delimiter=' ')
writer.writerow(date)
writer = csv.writer(csvFile)
for item in items:
# try:
# imgUrl = item.a["rel"]
# imgName = item.a['title']
# urlretrieve(imgUrl[0], os.getcwd() + "/files/images/" + dateDir + '/' + imgName + ".jpg")
# print(imgUrl[0])
# except:
# pass
try:
writer.writerow((item.a['title'], item.a['href']))
except:
pass（注：注释部分表示的是对存在的图片进行爬取，因为图片爬取不顺畅，故将其注释，但程序可跑，将图片爬取完毕后，可以做一个展示网站，当然这是自己的设想啦……）
使用try和except，以使程序能够在遇到错误时，不会挂掉

5、因为一年最多是有366天，所以用for循环进行爬取，爬取写入完毕后，将csv文件关闭：

for _ in range(366):
startUrlSplit = startUrl.split('/')
startUrl, items = WriteItems(startUrl)
date = startUrlSplit[-2] + '/' + startUrlSplit[-1]
WriteToCSV(items, date)

csvFile.close()所有可执行代码下载链接：http://download.csdn.net/download/zhangwellyear/10003634

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python3 爬虫历史上的今天 csv

相关文章推荐

新的分享

章节导航