您的位置：首页 > 其它

爬取药智网的中药材基本信息库

2015-10-31 11:22 176 查看

最近正在做爬虫系列的东西，也是刚刚开始写，写的也不怎么样.....

下面是我写的爬取中药材基本信息库的代码，还请大家多多指导。

首先先看网页，刚开始的时候，</p>中的信息无法读出来，导致后面的信息也读不出来，所以下面就改了一下网址的源码。

然后，信息与信息之间的分割应该是空白符（原来我一直以为是换行符

，多亏有大神指导！

），

所以代码为：

#coding=utf-8
from bs4 import BeautifulSoup
import urllib2
import re
import time
class ZYC():
def __init__(self):
#伪装成浏览器访问，适用于拒绝爬虫的网站
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.24 (KHTML, like '}
#获取中药材基本信息库的网址
def gethtml(self,yeshu):
full_url = "http://db.yaozh.com/zhongyaocai?p="+str(yeshu)
req = urllib2.Request(full_url,None,self.headers)
req_timeout=5
response = urllib2.urlopen(req,None,req_timeout)
html = response.read()
return html
#获取自己想要的内容
def getinformation(self):
for m in range(1,11):
#修改网址标签
html=self.gethtml(m)
reg=re.compile(r"</p>")
html=reg.sub('',html)
reg=re.compile(r"<p>")
html=reg.sub('',html)
soup=BeautifulSoup(html,"html.parser")
Trlist=soup.find_all('tr')
if m==1:
#获取标题
for item in Trlist[0]:
if item not in ['\n','\t',' ']:
item=item.get_text(strip=True)
with open("ZYC.txt","a") as file:
file.write(item.encode('utf-8')+'|')
#获取内容
file=open("ZYC.txt","a")
for te in Trlist[1:]:
file.write('\n')
for item in te:
if item not in ['\n',' ','\s']:
item=item.get_text(strip=True)
reg=re.compile(r'\s+')
item=reg.sub('',item)
file.write(item.encode('UTF-8')+'|')

file.close()
print("--正在采集%d/11的页数--"%m)
time.sleep(5)

if __name__ == '__main__':

ZYC().getinformation()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航