您的位置:首页 > 其它

scrapy文档学习笔记(scrapy tutorial)

2017-06-21 18:48 375 查看
scrapy文档学习:提取重点

官方文档路径:

https://docs.scrapy.org/en/latest/topics/spiders.html

从打开putty连接到Ubuntu虚拟环境,进入workspace,到成功爬取到一个网站的内容的过程。

docker start mybuntu
先启动docker
然后使用putty连接到虚拟环境
进入ubuntu,在ubuntu下新建项目


Creating a project

scrapy startproject tutorial


This will create a tutorial directory with the following contents:

tutorial/
scrapy.cfg            # deploy configuration file
tutorial/             # project's Python module, you'll import your code from here
__init__.py

items.py          # project items definition file

pipelines.py      # project pipelines file

settings.py       # project settings file

spiders/          # a directory where you'll later put your spiders
__init__.py


Our first Spider

Save it in a file named
quotes_spider.py
under the
tutorial/spiders
directory in your project:

import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)


How to run our spider

使用 shell 来启动Scrapy终端:

scrapy crawl quotes


response200说明请求成功

A shortcut to the start_requests method

import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)


Extracting data

scrapy shell 'http://quotes.toscrape.com/page/1/'


You will see something like:

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object
4000
at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/> [s]   response   <200 http://quotes.toscrape.com/page/1/> [s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>


Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]


注意:到这儿有可能会出现编码问题,我的报了错UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 68-73:ordinal not in range(128)

首先通过命令

import sys
sys.getdefaultencoding()


查看python默认的编码格式。python3不用看默认都是utf-8,如果想要确定一下,也可以去看看咯,我的是linux环境的Python是3.4,此时我爬的网站也是utf-8的编码,然而还是会出现UnicodeEncodeError。


解决办法:使用命令
vi ~/.bashrc


(vi使用vi文本编辑器打开,.bashrc文件是环境变量的配置文件)

进入这个文件,添加

export PYTHONIOENCODING=UTF-8


进去,

保存退出之后,使用命令
source ~/.bashrc
添加的配置就会生效了。

此时再进入执行以上爬取的命令就不会有错了。

To extract the text from the title above, you can do:

>>> response.css('title::text').extract()
['Quotes to Scrape']


including its tags:

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']


When you know you just want the first result, as in this case, you can do:

>>> response.css('title::text').extract_first()
'Quotes to Scrape'


As an alternative, you could’ve written:

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'


using .extract_first() avoids an IndexError and returns None when it doesn’t find any element matching the selection.

you can also use the re() method to extract using regular expressions:

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']


XPath: a brief intro

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'


why using xpath?

Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

Extracting quotes and authors

In [1]: response
Out[1]: <200 http://quotes.toscrape.com/page/1/> 
In [3]: response.css("div.quote")[0]
Out[3]: <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>

In [8]: title=quote.css("span.text::text").extract_first()
In [9]: title
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [10]: author=quote.css("small.author::text").extract_first()
In [11]: author
Out[11]: 'Albert Einstein'

Given that the tags are a list of strings, we can use the .extract() method to get all of them:
In [12]: tags=quote.css("div.tags a.tag::text").extract()
In [13]: tags
Out[13]: ['change', 'deep-thoughts', 'thinking', 'world']

we can now iterate over all the quotes elements and put them together into a Python dictionary:
迭代所有的引号元素,把他们放的python字典中:
In [14]: for quote in response.css("div.quote"):
...:     text=quote.css("span.text::text").extract_first()
...:     author=quote.css("small.author::text").extract_first()
...:     tags=quote.css("div.tags a.tag::text").extract()
...:     print(dict(text=text,author=author,tags=tags))
...:
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
...


Extracting data in our spider

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}


If you run this spider, it will output the extracted data with the log:

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}


Storing the scraped data

scrapy crawl quotes -o quotes.json


That will generate an quotes.json file containing all scraped items, serialized in JSON.

但是当你没有删除该文件 运行两次的时候就会破坏原来的json文件。

You can also used other formats, like JSON Lines:

scrapy crawl quotes -o quotes.jl


你可以添加新纪录,when you run twice就不会产生跟上面同样的问题。

Following links

抓取网页上的链接,再顺着链接抓取链接的网页

例如这个样子的源码

<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>


We can try extracting it in the shell:

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'


but we want the attribute href:

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'


修改蜘蛛来递归的follow the link to the next page, extracting data from it:

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)


抓取数据后,parse方法找到the link to the next page,使用urljoin方法建立一个绝对路径,(因为抓取到的href属性有可能是相对路径),生成一个抓取下一页的请求,并注册为回调函数来处理下一页的数据提取,并保持爬过所有的页面。

A shortcut for creating Requests

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)


response.follow
supports relative URLs directly - no need to call urljoin.

response.follow supports relative URLs directly - no need to call urljoin,so还可以简化成如下:

for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)


For
<a>
elements there is a shortcut: response.follow uses their href attribute automatically.So the code can be shortened further:

for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)


爬下来的json或jl文件会保存在根目录下
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: