scrapy文档学习笔记(scrapy tutorial)
2017-06-21 18:48
375 查看
scrapy文档学习:提取重点
官方文档路径:
https://docs.scrapy.org/en/latest/topics/spiders.html
从打开putty连接到Ubuntu虚拟环境,进入workspace,到成功爬取到一个网站的内容的过程。
This will create a tutorial directory with the following contents:
response200说明请求成功
You will see something like:
Using the shell, you can try selecting elements using CSS with the response object:
注意:到这儿有可能会出现编码问题,我的报了错UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 68-73:ordinal not in range(128)
首先通过命令
解决办法:使用命令
(vi使用vi文本编辑器打开,.bashrc文件是环境变量的配置文件)
进入这个文件,添加
进去,
保存退出之后,使用命令
此时再进入执行以上爬取的命令就不会有错了。
To extract the text from the title above, you can do:
including its tags:
When you know you just want the first result, as in this case, you can do:
As an alternative, you could’ve written:
using .extract_first() avoids an IndexError and returns None when it doesn’t find any element matching the selection.
you can also use the re() method to extract using regular expressions:
why using xpath?
Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.
If you run this spider, it will output the extracted data with the log:
That will generate an quotes.json file containing all scraped items, serialized in JSON.
但是当你没有删除该文件 运行两次的时候就会破坏原来的json文件。
You can also used other formats, like JSON Lines:
你可以添加新纪录,when you run twice就不会产生跟上面同样的问题。
例如这个样子的源码
We can try extracting it in the shell:
but we want the attribute href:
修改蜘蛛来递归的follow the link to the next page, extracting data from it:
抓取数据后,parse方法找到the link to the next page,使用urljoin方法建立一个绝对路径,(因为抓取到的href属性有可能是相对路径),生成一个抓取下一页的请求,并注册为回调函数来处理下一页的数据提取,并保持爬过所有的页面。
response.follow supports relative URLs directly - no need to call urljoin,so还可以简化成如下:
For
爬下来的json或jl文件会保存在根目录下
官方文档路径:
https://docs.scrapy.org/en/latest/topics/spiders.html
从打开putty连接到Ubuntu虚拟环境,进入workspace,到成功爬取到一个网站的内容的过程。
docker start mybuntu 先启动docker 然后使用putty连接到虚拟环境 进入ubuntu,在ubuntu下新建项目
Creating a project
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
Our first Spider
Save it in a file namedquotes_spider.pyunder the
tutorial/spidersdirectory in your project:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
How to run our spider
使用 shell 来启动Scrapy终端:scrapy crawl quotes
response200说明请求成功
A shortcut to the start_requests method
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body)
Extracting data
scrapy shell 'http://quotes.toscrape.com/page/1/'
You will see something like:
[ ... Scrapy log here ... ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object 4000 at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>>
Using the shell, you can try selecting elements using CSS with the response object:
>>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
注意:到这儿有可能会出现编码问题,我的报了错UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 68-73:ordinal not in range(128)
首先通过命令
import sys sys.getdefaultencoding()
查看python默认的编码格式。python3不用看默认都是utf-8,如果想要确定一下,也可以去看看咯,我的是linux环境的Python是3.4,此时我爬的网站也是utf-8的编码,然而还是会出现UnicodeEncodeError。
解决办法:使用命令
vi ~/.bashrc
(vi使用vi文本编辑器打开,.bashrc文件是环境变量的配置文件)
进入这个文件,添加
export PYTHONIOENCODING=UTF-8
进去,
保存退出之后,使用命令
source ~/.bashrc添加的配置就会生效了。
此时再进入执行以上爬取的命令就不会有错了。
To extract the text from the title above, you can do:
>>> response.css('title::text').extract() ['Quotes to Scrape']
including its tags:
>>> response.css('title').extract() ['<title>Quotes to Scrape</title>']
When you know you just want the first result, as in this case, you can do:
>>> response.css('title::text').extract_first() 'Quotes to Scrape'
As an alternative, you could’ve written:
>>> response.css('title::text')[0].extract() 'Quotes to Scrape'
using .extract_first() avoids an IndexError and returns None when it doesn’t find any element matching the selection.
you can also use the re() method to extract using regular expressions:
>>> response.css('title::text').re(r'Quotes.*') ['Quotes to Scrape'] >>> response.css('title::text').re(r'Q\w+') ['Quotes'] >>> response.css('title::text').re(r'(\w+) to (\w+)') ['Quotes', 'Scrape']
XPath: a brief intro
Besides CSS, Scrapy selectors also support using XPath expressions:>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape'
why using xpath?
Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.
Extracting quotes and authors
In [1]: response Out[1]: <200 http://quotes.toscrape.com/page/1/> In [3]: response.css("div.quote")[0] Out[3]: <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'> In [8]: title=quote.css("span.text::text").extract_first() In [9]: title Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' In [10]: author=quote.css("small.author::text").extract_first() In [11]: author Out[11]: 'Albert Einstein' Given that the tags are a list of strings, we can use the .extract() method to get all of them: In [12]: tags=quote.css("div.tags a.tag::text").extract() In [13]: tags Out[13]: ['change', 'deep-thoughts', 'thinking', 'world'] we can now iterate over all the quotes elements and put them together into a Python dictionary: 迭代所有的引号元素,把他们放的python字典中: In [14]: for quote in response.css("div.quote"): ...: text=quote.css("span.text::text").extract_first() ...: author=quote.css("small.author::text").extract_first() ...: tags=quote.css("div.tags a.tag::text").extract() ...: print(dict(text=text,author=author,tags=tags)) ...: {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']} ...
Extracting data in our spider
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
If you run this spider, it will output the extracted data with the log:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
Storing the scraped data
scrapy crawl quotes -o quotes.json
That will generate an quotes.json file containing all scraped items, serialized in JSON.
但是当你没有删除该文件 运行两次的时候就会破坏原来的json文件。
You can also used other formats, like JSON Lines:
scrapy crawl quotes -o quotes.jl
你可以添加新纪录,when you run twice就不会产生跟上面同样的问题。
Following links
抓取网页上的链接,再顺着链接抓取链接的网页例如这个样子的源码
<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul>
We can try extracting it in the shell:
>>> response.css('li.next a').extract_first() '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
but we want the attribute href:
>>> response.css('li.next a::attr(href)').extract_first() '/page/2/'
修改蜘蛛来递归的follow the link to the next page, extracting data from it:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
抓取数据后,parse方法找到the link to the next page,使用urljoin方法建立一个绝对路径,(因为抓取到的href属性有可能是相对路径),生成一个抓取下一页的请求,并注册为回调函数来处理下一页的数据提取,并保持爬过所有的页面。
A shortcut for creating Requests
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('span small::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
response.followsupports relative URLs directly - no need to call urljoin.
response.follow supports relative URLs directly - no need to call urljoin,so还可以简化成如下:
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)
For
<a>elements there is a shortcut: response.follow uses their href attribute automatically.So the code can be shortened further:
for a in response.css('li.next a'): yield response.follow(a, callback=self.parse)
爬下来的json或jl文件会保存在根目录下
相关文章推荐
- java学习笔记(四)----对象、数组作为参数传递,静态变量、静态方法的使用,内部类,使用文档注释
- 孙鑫VC学习笔记:第十三讲 设置文档标题
- studyMFC 学习笔记一:建立单文档多视图
- jQuery学习笔记:文档处理
- mysql帮助文档学习笔记
- 孙鑫老师教学视频学习笔记――单文档中实现画笔及画刷
- J2ee Tutorial 学习笔记(第一章)
- Jquery基础学习笔记(2)-文档处理
- modrewrite配置文档学习笔记
- 孙鑫老师教学视频学习笔记——单文档中菜单操作
- 孙鑫VC学习笔记:第十三讲 保存可串行化的类对象 如何获取文档与视类指针
- FreeBSD学习笔记14-pureftpd的README英文文档(1)
- 孙鑫VC学习笔记:第十三讲 (五) 保存可串行化的类对象 如何获取文档与视类指针
- AJAX 学习笔记(5) 处理XML文档的DOM元素属性和遍历DOM元素方法
- XML学习笔记(二): 在XML文档的元素间移动(Moving Between Elements)
- XML学习笔记之文档类型定义一
- Java 文档&注释 -Java学习笔记(32)
- XML学习笔记-第二章 XML文档
- Jquery 基础学习笔记之文档处理
- 《HTML & XHTML权威指南》的学习笔记01 -- 第三章.HTML/XHTML文档的元素