您的位置:首页 > 其它

scrapy - tutorial

2014-05-08 13:04 423 查看

Install

sudo pip install libxml2-dev libxslt1-dev lxml libffi-dev

git clone git://github.com/scrapy/scrapy.git

cd /path/to/scrapy/

sudo python setup.py install

Usage

[nixawk@core tutorial]$ scrapy -h

Scrapy 0.25.1 - project: tutorial

Usage:

  scrapy <command> [options] [args]

Available commands:

  bench         Run quick benchmark test

  check         Check spider contracts

  crawl         Run a spider

  deploy        Deploy project in Scrapyd target

  edit          Edit spider

  fetch         Fetch a URL using the Scrapy downloader

  genspider     Generate new spider using pre-defined templates

  list          List available spiders

  parse         Parse URL (using its spider) and print the results

  runspider     Run a self-contained spider (without creating a project)

  settings      Get settings values

  shell         Interactive scraping console

  startproject  Create new project

  version       Print Scrapy version

  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

Start a new project

[nixawk@core ~]$ scrapy startproject tutorial

2015-01-20 03:07:20+0000 [scrapy] INFO: Scrapy 0.25.1 started (bot: scrapybot)

2015-01-20 03:07:20+0000 [scrapy] INFO: Optional features available: ssl, http11

2015-01-20 03:07:20+0000 [scrapy] INFO: Overridden settings: {}

New Scrapy project 'tutorial' created in:

    /home/notfound/tutorial

You can start your first spider with:

    cd tutorial

    scrapy genspider example example.com

Files

[nixawk@core share]$ tree ./tutorial/

./tutorial/

├── scrapy.cfg

└── tutorial

    ├── __init__.py

    ├── __init__.pyc

    ├── items.py

    ├── items.pyc

    ├── pipelines.py

    ├── settings.py

    ├── settings.pyc

    └── spiders

        ├── __init__.py

        ├── __init__.pyc

        ├── tutorial_spider.py

        └── tutorial_spider.pyc

2 directories, 12 files

Demo – a simple spider

[nixawk@core tutorial]$ cat ./tutorial/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy

class TutorialItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title = scrapy.Field()

    link = scrapy.Field()

    desc = scrapy.Field()
[nixawk@core tutorial]$ cat ./tutorial/spiders/tutorial_spider.py

import scrapy

from tutorial.items import TutorialItem

from pprint import pprint

class TutorialSpider(scrapy.spider.Spider):

    name = "tutorial"

    allowed_domains = ["learnpythonthehardway.org"]

    start_urls = [

        "http://learnpythonthehardway.org/book/"

    ]

    def parse(self, response):

        # response.selector

        # response.selector.xpath()

        # response.selector.css()

        # response.xpath()

        # response.css()

        for sel in response.xpath('//ul[@class="simple"]'):

            item = TutorialItem()

            item['title'] = sel.xpath(

                'li/a[@class="reference external"]/text()').extract()

            item['link'] = sel.xpath(

                'li/a[@class="reference external"]/@href').extract()

            pprint(item)

Result

[nixawk@core tutorial]$ scrapy crawl tutorial

2015-01-20 03:00:11+0000 [scrapy] INFO: Scrapy 0.25.1 started (bot: tutorial)

2015-01-20 03:00:11+0000 [scrapy] INFO: Optional features available: ssl, http11

2015-01-20 03:00:11+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}

/usr/lib/python2.7/site-packages/Twisted-14.0.2-py2.7-linux-x86_64.egg/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the
service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification.  Many valid certificate/hostname mappings may be rejected.

  verifyHostname, VerificationError = _selectVerifyImplementation()

2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState

2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
ChunkedTransferMiddleware, DownloaderStats

2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2015-01-20 03:00:17+0000 [scrapy] INFO: Enabled item pipelines:

2015-01-20 03:00:17+0000 [tutorial] INFO: Spider opened

2015-01-20 03:00:17+0000 [tutorial] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2015-01-20 03:00:17+0000 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2015-01-20 03:00:18+0000 [tutorial] DEBUG: Crawled (200) <GET http://learnpythonthehardway.org/book/> (referer: None)

{'link': [u'preface.html',

          u'intro.html',

          u'ex0.html',

          u'ex1.html',

          u'ex2.html',

          u'ex3.html',

          u'ex4.html',

          u'ex5.html',

          u'ex6.html',

          u'ex7.html',

          u'ex8.html',

          u'ex9.html',

          u'ex10.html',

          u'ex11.html',

          u'ex12.html',

          u'ex13.html',

          u'ex14.html',

          u'ex15.html',

          u'ex16.html',

          u'ex17.html',

          u'ex18.html',

          u'ex19.html',

          u'ex20.html',

          u'ex21.html',

          u'ex22.html',

          u'ex23.html',

          u'ex24.html',

          u'ex25.html',

          u'ex26.html',

          u'ex27.html',

          u'ex28.html',

          u'ex29.html',

          u'ex30.html',

          u'ex31.html',

          u'ex32.html',

          u'ex33.html',

          u'ex34.html',

          u'ex35.html',

          u'ex36.html',

          u'ex37.html',

          u'ex38.html',

          u'ex39.html',

          u'ex40.html',

          u'ex41.html',

          u'ex42.html',

          u'ex43.html',

          u'ex44.html',

          u'ex45.html',

          u'ex46.html',

          u'ex47.html',

          u'ex48.html',

          u'ex49.html',

          u'ex50.html',

          u'ex51.html',

          u'ex52.html',

          u'advice.html',

          u'next.html',

          u'appendixa.html'],

 'title': [u'Preface',

           u'Introduction: The Hard Way Is Easier',

           u'Exercise 0: The Setup',

           u'Exercise 1: A Good First Program',

           u'Exercise 2: Comments And Pound Characters',

           u'Exercise 3: Numbers And Math',

           u'Exercise 4: Variables And Names',

           u'Exercise 5: More Variables And Printing',

           u'Exercise 6: Strings And Text',

           u'Exercise 7: More Printing',

           u'Exercise 8: Printing, Printing',

           u'Exercise 9: Printing, Printing, Printing',

           u'Exercise 10: What Was That?',

           u'Exercise 11: Asking Questions',

           u'Exercise 12: Prompting People',

           u'Exercise 13: Parameters, Unpacking, Variables',

           u'Exercise 14: Prompting And Passing',

           u'Exercise 15: Reading Files',

           u'Exercise 16: Reading And Writing Files',

           u'Exercise 17: More Files',

           u'Exercise 18: Names, Variables, Code, Functions',

           u'Exercise 19: Functions And Variables',

           u'Exercise 20: Functions And Files',

           u'Exercise 21: Functions Can Return Something',

           u'Exercise 22: What Do You Know So Far?',

           u'Exercise 23: Read Some Code',

           u'Exercise 24: More Practice',

           u'Exercise 25: Even More Practice',

           u'Exercise 26: Congratulations, Take A Test!',

           u'Exercise 27: Memorizing Logic',

           u'Exercise 28: Boolean Practice',

           u'Exercise 29: What If',

           u'Exercise 30: Else And If',

           u'Exercise 31: Making Decisions',

           u'Exercise 32: Loops And Lists',

           u'Exercise 33: While Loops',

           u'Exercise 34: Accessing Elements Of Lists',

           u'Exercise 35: Branches and Functions',

           u'Exercise 36: Designing and Debugging',

           u'Exercise 37: Symbol Review',

           u'Exercise 38: Doing Things To Lists',

           u'Exercise 39: Dictionaries, Oh Lovely Dictionaries',

           u'Exercise 40: Modules, Classes, And Objects',

           u'Exercise 41: Learning To Speak Object Oriented',

           u'Exercise 42: Is-A, Has-A, Objects, and Classes',

           u'Exercise 43: Gothons From Planet Percal #25',

           u'Exercise 44: Inheritance Vs. Composition',

           u'Exercise 45: You Make A Game',

           u'Exercise 46: A Project Skeleton',

           u'Exercise 47: Automated Testing',

           u'Exercise 48: Advanced User Input',

           u'Exercise 49: Making Sentences',

           u'Exercise 50: Your First Website',

           u'Exercise 51: Getting Input From A Browser',

           u'Exercise 52: The Start Of Your Web Game',

           u'Advice From An Old Programmer',

           u'Next Steps',

           u'Appendix A: Command Line Crash Course']}

2015-01-20 03:00:18+0000 [tutorial] INFO: Closing spider (finished)

2015-01-20 03:00:18+0000 [tutorial] INFO: Dumping Scrapy stats:

    {'downloader/request_bytes': 229,

     'downloader/request_count': 1,

     'downloader/request_method_count/GET': 1,

     'downloader/response_bytes': 4297,

     'downloader/response_count': 1,

     'downloader/response_status_count/200': 1,

     'finish_reason': 'finished',

     'finish_time': datetime.datetime(2015, 1, 20, 3, 0, 18, 468030),

     'log_count/DEBUG': 1,

     'log_count/INFO': 3,

     'response_received_count': 1,

     'scheduler/dequeued': 1,

     'scheduler/dequeued/memory': 1,

     'scheduler/enqueued': 1,

     'scheduler/enqueued/memory': 1,

     'start_time': datetime.datetime(2015, 1, 20, 3, 0, 17, 501193)}

2015-01-20 03:00:18+0000 [tutorial] INFO: Spider closed (finished)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: