您的位置:首页 > 其它

Scrapy爬虫框架笔记

2016-02-04 09:46 183 查看
1. 启动Scrapy爬虫

除了常用的
scrapy crawl
来启动Scrapy,您也可以使用 API 在脚本中启动Scrapy。

2. XPath 定位

Firebug(Firefox插件)

可以使用Chrome的XPath helper

firefox上的若干插件

3. 关于登陆爬取
http://outofmemory.cn/code-snippet/16528/scrapy-again-to-code
4. 随机User-agent

设置下载器中间件(DownloadMiddleWare)

4. 关于数据库存储(以MySQL为例)

# Cannot use this to create the table, must have table already created

from twisted.enterprise import adbapi
import datetime
import MySQLdb.cursors

class SQLStorePipeline(object):

def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb', db='mydb',
user='myuser', passwd='mypass', cursorclass=MySQLdb.cursors.DictCursor,
charset='utf8', use_unicode=True)

def process_item(self, item, spider):
# run db query in thread pool
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self.handle_error)

return item

def _conditional_insert(self, tx, item):
# create record if doesn't exist.
# all this block run on it's own thread
tx.execute("select * from websites where link = %s", (item['link'][0], ))
result = tx.fetchone()
if result:
log.msg("Item already stored in db: %s" % item, level=log.DEBUG)
else:
tx.execute(\
"insert into websites (link, created) "
"values (%s, %s)",
(item['link'][0],
datetime.datetime.now())
)
log.msg("Item stored in db: %s" % item, level=log.DEBUG)

def handle_error(self, e):
log.err(e)


5. 在脚本中运行Scrapy
http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/practices.html#run-from-script
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: