您的位置：首页 > 编程语言 > Python开发

Python爬虫练习

2016-12-06 21:55 507 查看

今日爬虫练习，爬取的内容是我校的就业中心网中的内容。是一个基础的爬虫，很适合初学者学习。

使用的是requests和BeautifulSoup。

过程中遇到的问题是乱码问题和url不规则问题：

看这个url获取到是无法直接打开这个链接的。

代码如下：

# -*- coding: utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def get_subject(url):
try:
html=requests.get(url)
except requests.HTTPError,e:
if hasattr(e,"reason"):
print u"链接失败，错误原因",e.reason

soup=BeautifulSoup(html.text,'html.parser',from_encoding='UTF-8')
link=soup.find_all('a',href=re.compile(r'zdgz.htm'))
return link[0]

def enter_zdgz(base,link):
try:
# 下一步很重要，括号里面的参数
info=requests.get(base+link['href']).text
except requests.HTTPError,e:
if hasattr(e,"reason"):
print u"链接失败,错误原因",e.reason

soup=BeautifulSoup(info,'html.parser',from_encoding='UTF-8')
link=soup.find_all('a',title=re.compile(r'安排表（12月）'))
return link

def get_html(base,link):
try:
l=link['href']
str='../'
temp=l.split(str)[1]
print "全部安排表链接",base+temp
info=requests.get(base+temp).text
except requests.HTTPError,e:
if hasattr(e,"reason"):
print u"链接失败，错误原因",e.reason
return info

def get_info(html):
try:
html=html.encode('ISO 8859-1')
with open('info.txt','w') as file:
soup=BeautifulSoup(html,'html.parser',from_encoding='utf-8')
infos=soup.find_all('table',style=re.compile(r'width: 565px;border-collapse: collapse'))
for info in infos:
file.write(info.get_text())
except IOError,e:
print "文件错误"+str(e)

def main():
base='http://jiuye.xupt.edu.cn/'
link=get_subject(base)
tests=enter_zdgz(base,link)
for test in tests:
print test
html=get_html(base,test)
print html
get_info(html)

main()

以上有值得注意的地方有：
1.get_subject获取指定的链接时，可能获取的不止一条链接，find_all()返回的是一个列表。根据具体情况自己取你需要的。例如之前爬豆瓣电影的时候，这里用了for循环来处理。

2.想要获取你找到的标签的具体链接，就用link['href']这样的表示即可

3.对于解决url不规则的问题，我找的办法是字符串的替换。

不知道为什么直接用字符串的replace没效果，之后就改用正则表达式。

但是就今天这个问题而言，正则表达式还是不行，因为# str=re.compile('../') ../表示的是替换凡是XX/类型的字符吧

所以，在用了python的字符串分离函数split()之后，问题解决了。让人感觉python这个函数真的好用。

问题还有：

1.在最后爬取需要写入文本的内容时，正则匹配的识别标签很不合适

infos=soup.find_all('table',style=re.compile(r'width: 565px;border-collapse: collapse'))

找如此标签属性可能是很准确，但是万一类似内容的网站这里就改成566px，564px怎么办？

2.怎样把爬取的内容按格式写入文本中？（其实之前都是写过的）

3.怎样把爬取的内容写入数据库中？

4.是否可以把需要匹配的信息从键盘输入，然后得到结果。

5.python里面怎样把程序写成应用程序。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航