您的位置:首页 > 运维架构

URLError: <urlopen error [Errno 10051] >

2016-09-06 14:58 525 查看
在写一个简单小爬虫时,命令行执行时遇到下面这个错误:

Traceback (most recent call last):
File "E:\Anaconda2\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "E:\Anaconda2\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "E:\Anaconda2\lib\urllib2.py", line 449, in _open
'_open', req)
File "E:\Anaconda2\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "E:\Anaconda2\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "E:\Anaconda2\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
<span style="color:#ff0000;">URLError: <urlopen error [Errno 10051] ></span>


百度之后,发现原因:

That particular error message is being generated by 
boto
 (boto 2.38.0 py27_0), which is used to connect to Amazon S3. Scrapy doesn't have this enabled by default。

解决方法:

在settings.py文件中添加

DOWNLOAD_HANDLERS = {'s3': None,}

另贴出一个关于海投网的超简单爬虫:
items.py文件如下:
import scrapy
from scrapy.item import Item,Field

class XuanjianghuiItem(Item):
# define the fields for your item here like:
title = Field()
holdTime = Field()
settings.py文件如下:
BOT_NAME = 'XuanJiangHui'
SPIDER_MODULES = ['XuanJiangHui.spiders']
NEWSPIDER_MODULE = 'XuanJiangHui.spiders'
DOWNLOAD_HANDLERS = {'s3': None,}
ITEM_PIPELINES = {
'XuanJiangHui.pipelines.XuanjianghuiPipeline': 300,
}

pipelines.py文件如下:
import codecs

class XuanjianghuiPipeline(object):
def __init__(self):
self.file = codecs.open('F://XuanJiangHui.txt','wb',encoding='utf-8')
def process_item(self,item,spider):
title = item['title'].strip()
holdTime = item['holdTime']
self.file.write(title+'\n'+holdTime)
self.file.write('\r\n')
self.file.write('\r\n')
return item
XuanJiangHui.py文件如下:
# -*- coding:utf-8 -*-

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from XuanJiangHui.items import XuanjianghuiItem

class XuanjianghuiSpider(Spider):
name = "XuanJiangHui"
download_deplay = 1
start_urls = [
"http://xjh.haitou.cc/wh/uni-1",
"http://xjh.haitou.cc/bj/uni-13",
"http://xjh.haitou.cc/cd/uni-147",
"http://xjh.haitou.cc/hf/uni-47",
"http://xjh.haitou.cc/gz/uni-32",
"http://xjh.haitou.cc/gz/uni-34",
"http://xjh.haitou.cc/gz/uni-36"
]
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def parse(self,response):
sel = HtmlXPathSelector(response)
item = XuanjianghuiItem()
for tr in sel.xpath('//div[@id="w0"]//tbody/tr'):
title = tr.xpath('./td[@class="cxxt-title"]/a/@title')
holdTime = tr.xpath('./td[@class="text-left cxxt-holdtime"]/span[@class="hold-ymd"]/text()')
item['title'] = title.extract()[0]
item['holdTime'] = holdTime.extract()[0]
yield item
urls = sel.xpath('//*[@id="w0"]/ul/li[@class="next"]/a/@href').extract()
for url in urls:
url = "http://xjh.haitou.cc"+url
yield Request(url,headers=self.header,callback=self.parse)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: