您的位置:首页 > 其它

Scrapy框架学习 - 搭建开发环境

2017-12-24 17:02 435 查看

安装

sudo pip3 install scrapy

测试是否安装成功



创建一个项目



创建一个爬虫

items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html 
import scrapy

# class MyscrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass

class MyItem(scrapy.Item):
# h1标题
h1=scrapy.Field()
# h2标题
h2=scrapy.Field()


spiders/myspider.py

# !/usr/bin/env python
# -*- coding:utf-8 -*-

import scrapy
from myscrapy.items import MyItem

class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains=['https://docs.scrapy.org/en/latest',]
start_urls=['https://docs.scrapy.org/en/latest/intro/tutorial.html',]

def parse(self, response):
print('----------\n'+response.body+'----------\n')

items=[]
# h1,只有一个
h1=response.xpath('//h1/text()').extract()[0]
h1item=MyItem()
h1item['h1']=h1
items.append(h1item)

# h2,有多个
h2_list=response.xpath('//div[@class="section"]/h2/text()').extract()
for h2 in h2_list:
h2item=MyItem()
h2item['h2']=h2
items.append(h2item)

return items


运行爬虫并将数据保存为json格式的文件

scrapy crawl myspider -o scrapy.json

运行完毕,如果 scrapy.json文件为空,查找日志,发现报错:连接被拒绝
Connection was refused by other side: 111: Connection refused.

解决这个问题的思路:
1.在settings.py文件中设置User-Agent
2.在settings.py文件中设置DOWNLOAD_DELAY
3.如果经过以上2步骤还不行,就使用sudo命令运行爬虫
sudo scrapy crawl myspider -o scrapy.json

爬取成功!!!!

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: