scrapy 爬取新民网
2017-03-30 10:16
127 查看
scrapy 爬取新民网
scrapy 爬取新民网scrapy编写
未完善的地方
scrapy编写
1.建立scrapy项目scrapy startproject xinmin
项目结构如图:
2.编写items,定义爬取内容
# 发布者 publishername = scrapy.Field() # 分类 category = scrapy.Field() # 标题 title = scrapy.Field() # 正文 text = scrapy.Field() # 网址链接 linkurl = scrapy.Field() # 发布时间 publishtime = scrapy.Field()
3.编写spilder
# -*- coding: utf-8 -*- import scrapy from scrapy.spiders import CrawlSpider from scrapy.selector import Selector from xinmin.items import XinminItem class XinminSpider(CrawlSpider): #初始化 name = "xinmin" allowed_domains = ["shanghai.xinmin.cn"] start_urls = ['http://shanghai.xinmin.cn/t/gdbd/'] #srartUrl = 'http://shanghai.xinmin.cn/t/gdbd/' #解析页面的所以新闻链接 def parse(self, response): selector = Selector(response) # 获取当前页面所有新闻的链接 newslink_list = selector.xpath("//div[contains(@class,'list_list')]/a/@href").extract() # 读取每一个链接 for link_url in newslink_list: #print("新闻链接:", link_url) yield scrapy.Request(link_url, callback=self.parse_item) # 下一页 next_url = self.start_urls[0] + response.xpath('//a[@class="unselect"]/@href').extract_first() # 翻页 print('next_url', next_url) if next_url: yield scrapy.Request(next_url, callback=self.parse) # 抓取新闻详细内容 def parse_item(self, response): #print('开始处理item...') item = XinminItem() selector = Selector(response) # 网址链接 item['linkurl'] = response.url # 分类 + item['category'] = selector.xpath("//div[contains(@class,'xinminMianbaoxue')]/a[3]/text()").extract() # 标题 item['title'] = selector.xpath("//h1[contains(@class,'article_title')]/text()").extract() # 正文 item['text'] = selector.xpath("//div[contains(@class,'a_p')]/p/text()").extract() # 发布时间 这两个会变化 info=selector.xpath("html/body/div[4]/div[1]/div[2]/div[2]/div/span[4]/text()").extract() item['publishtime'] = selector.xpath("//div[contains(@class,'info')]/span[3]/text()").extract() # 发布者 # xpath语句太长 下一版修改 item['publishername'] = selector.xpath("//div[contains(@class,'info')]/span[4]/text()").extract() yield item
4.settings 修改设置
# Obey robots.txt rules ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 4 COOKIES_ENABLED = False
5.编写
# -*- coding: utf-8 -*- class XinminPipeline(object): #去掉空格换行符 @staticmethod def get_list(data_list): data_str = '' for i in data_list: i = str(i).strip() data_str = data_str + i return data_str #写入文件 def process_item(self, item, spider): today = time.strftime("%Y-%m-%d", time.localtime()) # 文件路径 file_path = "../data/"+today+".txt" with open(file_path, "a", encoding="utf-8") as f: print('开始写入文件....') f.write("linkurl:" + self.get_list(item['linkurl']) + "\n") f.write("title:" + self.get_list(item['title']) + "\n") f.write("category:" + self.get_list(item['category']) + "\n") f.write("publishername:" + self.get_list(item['publishername']) + "\n") f.write("publishtime:" + self.get_list(item['publishtime']) + "\n") f.write("text:" + self.get_list(item['text']) + "\n") f.write("\n") print('成功写入...') return item
未完善的地方
1.有两个抓取的指标(作者,发布日期)会变,没有处理2.没有考虑反爬的问题
相关文章推荐
- 【转】scrapy爬取深度设置
- linux centos python scrapy 环境配置
- OSX 上安装 Scrapy 的那些坑
- 使用Scrapy爬虫递归爬取多层界面(至少3级界面),同时根据类别保存成.txt文件
- ubuntu安装scrapy错误的解决方案
- scrapy mysql
- window7上爬虫框架Scrapy的安装 --错误分析lxml
- Scrapy 教程(Scrapy Tutorial)
- win7,32位系统安装scrapy
- scrapy 让404不走failerr路线
- Pip & Virtualenv & Fabric & Scrapy & PIL & BeautifulSoup
- scrapy高级用法之自动分页
- 讲解Python的Scrapy爬虫框架使用代理进行采集的方法
- python学习手册之Scrapy学习
- Scrapy爬虫架构安装过程
- scrapy ☞ 发送邮件配置
- spider for doubantop250 -- scrapy框架
- 通过scrapy爬取一号店商品信息
- Scrapy设置之Downloading media