您的位置:首页 > 运维架构

Larbin learnin(3)——how to limit the scope of crawling

2010-04-07 14:00 453 查看
1 In larbin.conf, close to crawle external sites except in the list of startURL by setting noExternalLinks.

# do you want to follow external links,

noExternalLinks
2 The way to improve the crawlling speed

【引】http://hi.baidu.com/hustwk/blog/item/fd3325dde12598dc8c1029ef.html

1、将larbin.conf里面的waitDuration设置为1,这里不再考虑polite^_^,
设置为1大多数网站其实还能忍受;
2、将types.h里面的maxUrlsBySite修改为254;
3、将main.cc里面的代码做如下修改:
// see if we should read again urls in fifowait

if ((global::now % 30) == 0
) {

global::readPriorityWait =
global::URLsPriorityWait->getLength();

global::readWait = global::URLsDiskWait->getLength();

}

if ((global::now % 30) == 15)
{

global::readPriorityWait = 0;

global::readWait = 0;

}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: