python爬取的豆瓣top250的一些信息
2015-12-17 14:15
936 查看
核心spider:(入门简单参考,进阶的以后会更新,不要照抄,xpath的脚本自己去核对一下,不一定还能用)
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from doubanmovie.items import DoubanmoiveItem
class MoiveSpider(CrawlSpider):
name="doubanmovie"
allowed_domains=["movie.douban.com"]
start_urls=["http://movie.douban.com/top250"]
rules=[
Rule(SgmlLinkExtractor(allow=(r'http://movie.douban.com/top250\?start=\d+.*'))),
Rule(SgmlLinkExtractor(allow=(r'http://movie.douban.com/subject/\d+')),callback="parse_item"),
]
def parse_item(self,response):
sel=Selector(response)
item=DoubanmoiveItem()
item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()
item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d+)\)')
return item
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from doubanmovie.items import DoubanmoiveItem
class MoiveSpider(CrawlSpider):
name="doubanmovie"
allowed_domains=["movie.douban.com"]
start_urls=["http://movie.douban.com/top250"]
rules=[
Rule(SgmlLinkExtractor(allow=(r'http://movie.douban.com/top250\?start=\d+.*'))),
Rule(SgmlLinkExtractor(allow=(r'http://movie.douban.com/subject/\d+')),callback="parse_item"),
]
def parse_item(self,response):
sel=Selector(response)
item=DoubanmoiveItem()
item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()
item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d+)\)')
return item
相关文章推荐
- Python异常处理和反射
- Python装饰器
- Python异常处理
- python内置函数大全
- 最小生成树,克鲁斯卡尔算法(Python实现)
- python --标准库 路径与文件 (os.path包, glob包)
- python3.5+selenium打开chrome浏览器,去掉ignore-certificate-errors提示
- [Ubuntu]Python的Web开发环境之mod_wsgi
- python--标准库 时间与日期 (time, datetime包)
- Python:浅拷贝与深拷贝copy.deepcopy()
- Python脚本实现项目工程自动远程部署
- Python]新手写爬虫全过程
- Python脚本实现发送邮件功能
- python_类的设计模式
- python科学计算库安装
- python科学计算库安装
- python科学计算库安装
- python科学计算库安装
- python科学计算库安装
- python科学计算库安装