您的位置:首页 > 其它

使用ChilkatDotNet组件构建自己的搜索引擎

2008-01-02 15:39 423 查看
ChilkatDotNet是一个非常强大的.NET组件,我们可以利用这个组件来做一些网页搜索的工作,有兴趣的朋友可以研究一下.接下来我会使用这个组件编写一个从网页收集Email地址的工具.

安装完ChilkatDotNet之后,在安装目录中会有一个dll文件,在项目中引用一下那个dll文件即可开始构建你的程序!

GetStart


This
isaverysimple"gettingstarted"exampleforspideringawebsite.As
you'llseeinfutureexamples,theChilkatSpiderlibrarycanbeused
tocrawltheWeb.Fornow,we'llconcentrateonspideringasinglesite.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework

[align=left]//TheChilkatSpidercomponent/libraryisfree.[/align]
[align=left]Chilkat.Spiderspider=newChilkat.Spider();[/align]
[align=left][/align]
[align=left]//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee[/align]
[align=left]//inlaterexamples,youcancollectoutboundlinksandusethemto[/align]
[align=left]//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com[/align]
[align=left]spider.Initialize("www.chilkatsoft.com");[/align]
[align=left][/align]
[align=left]//Addthe1stURL:[/align]
[align=left]spider.AddUnspidered("http://www.chilkatsoft.com/");[/align]
[align=left][/align]
[align=left]//BegincrawlingthesitebycallingCrawlNextrepeatedly.[/align]
[align=left]inti;[/align]
[align=left]for(i=0;i<=9;i++){[/align]
[align=left]boolsuccess;[/align]
[align=left]success=spider.CrawlNext();[/align]
[align=left]if(success==true){[/align]
[align=left]//ShowtheURLofthepagejustspidered.[/align]
[align=left]textBox1.Text+=spider.LastUrl+"\r\n";[/align]
[align=left]//TheHTMLisavailableintheLastHtmlproperty[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]//DidwegetanerrororaretherenomoreURLstocrawl?[/align]
[align=left]if(spider.NumUnspidered==0){[/align]
[align=left]MessageBox.Show("NomoreURLstospider");[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]MessageBox.Show(spider.LastErrorText);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left]}[/align]
[align=left][/align]
[align=left]//Sleep1secondbeforespideringthenextURL.[/align]
[align=left]spider.SleepMs(1000);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left]ExtractHTMLTitle,Description,Keywords[/align]
This
exampleexpandsonthe"gettingstarted"examplebyshowinghowto
accesstheHTMLtitle,description,andkeywordswithineachpage
spidered.ThesearethecontentsoftheMETAtagsforkeywords,
description,andtitlefoundintheHTMLheader.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework

[align=left]//TheChilkatSpidercomponent/libraryisfree.[/align]
[align=left]Chilkat.Spiderspider=newChilkat.Spider();[/align]
[align=left][/align]
[align=left]//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee[/align]
[align=left]//inlaterexamples,youcancollectoutboundlinksandusethemto[/align]
[align=left]//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com[/align]
[align=left]spider.Initialize("www.chilkatsoft.com");[/align]
[align=left][/align]
[align=left]//Addthe1stURL:[/align]
[align=left]spider.AddUnspidered("http://www.chilkatsoft.com/");[/align]
[align=left][/align]
[align=left]//BegincrawlingthesitebycallingCrawlNextrepeatedly.[/align]
[align=left]inti;[/align]
[align=left]for(i=0;i<=9;i++){[/align]
[align=left]boolsuccess;[/align]
[align=left]success=spider.CrawlNext();[/align]
[align=left]if(success==true){[/align]
[align=left]//ShowtheURLofthepagejustspidered.[/align]
[align=left]textBox1.Text+=spider.LastUrl+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left][/align]
[align=left]//TheHTMLMETAkeywords,title,anddescriptionareavailableintheseproperties:[/align]
[align=left]textBox1.Text+=spider.LastHtmlTitle+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left]textBox1.Text+=spider.LastHtmlDescription+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left]textBox1.Text+=spider.LastHtmlKeywords+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left][/align]
[align=left]//TheHTMLisavailableintheLastHtmlproperty[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]//DidwegetanerrororaretherenomoreURLstocrawl?[/align]
[align=left]if(spider.NumUnspidered==0){[/align]
[align=left]MessageBox.Show("NomoreURLstospider");[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]MessageBox.Show(spider.LastErrorText);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left]}[/align]
[align=left][/align]
[align=left]//Sleep1secondbeforespideringthenextURL.[/align]
[align=left]spider.SleepMs(1000);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left][/align]

Fetchrobots.txtforaSite

TheChilkatSpider
libraryisrobots.txtcompliant.Itautomaticallyfetchesasite's
robots.txtfileandadherestoit.Itwillnotdownloadpagesdeniedby
robots.txt.Pagesexcludedbyrobots.txtwillnotappearinthe
Spider's"unspidered"list.Thisexampleshowshowtoexplicitly
downloadandreviewtherobots.txtforagivensite.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
spider.Initialize("www.chilkatsoft.com");
stringrobotsText;
robotsText=spider.FetchRobotsText();
textBox1.Text+=robotsText+"\r\n";
textBox1.Refresh();


AvoidURLsMatchingAnyofaSetofPatterns

Demonstrateshowto
use"avoidpatterns"topreventspideringanyURLthatmatchesa
wildcardedpattern.ThisexampleavoidsURLscontainingthesubstrings
"java","python",or"perl".


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee
//inlaterexamples,youcancollectoutboundlinksandusethemto
//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com
spider.Initialize("www.chilkatsoft.com");
//Addthe1stURL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
//AvoidURLsmatchingthesepatterns:
spider.AddAvoidPattern("*java*");
spider.AddAvoidPattern("*python*");
spider.AddAvoidPattern("*perl*");
//BegincrawlingthesitebycallingCrawlNextrepeatedly.
inti;
for(i=0;i<=9;i++){
boolsuccess;
success=spider.CrawlNext();
if(success==true){
//ShowtheURLofthepagejustspidered.
textBox1.Text+=spider.LastUrl+"\r\n";
//TheHTMLisavailableintheLastHtmlproperty
}
else{
//DidwegetanerrororaretherenomoreURLstocrawl?
if(spider.NumUnspidered==0){
MessageBox.Show("NomoreURLstospider");
}
else{
MessageBox.Show(spider.LastErrorText);
}
}
//Sleep1secondbeforespideringthenextURL.
spider.SleepMs(1000);
}


SettingaMaximumResponseSize

TheMaxResponseSize
propertyprotectsyourspiderfromdownloadingapagethatistoo
large.Bydefault,MaxResponseSize=300,000bytes.Settingitto0
indicatesthatthereisnomaximum.Youmaysetittoanumber
indicatingthemaximumnumberofbytestodownload.URLswithresponse
sizeslargerthanthiswillbeskipped.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
spider.Initialize("www.chilkatsoft.com");
//Addthe1stURL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
//ThisexampledemonstratessettingtheMaxResponseSizeproperty
//Donotdownloadanythingwitharesponsesizegreaterthan100,000bytes.
spider.MaxResponseSize=100000;


SettingaMaximumURLLength

TheMaxUrlLenpropertypreventsthespiderfromretrievingURLsthatgrowtoolong.ThedefaultvalueofMaxUrlLenis300.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
spider.Initialize("www.chilkatsoft.com");
//Addthe1stURL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
//ThisexampledemonstratessettingtheMaxUrlLenproperty
//DonotaddURLslongerthan250characterstothe"unspidered"queue:
spider.MaxUrlLen=250;
//...


UsingtheDiskCache

TheChilkatSpider
componenthasdiskcachingcapabilities.Tosetupadiskcache,create
anewdirectoryanywhereonyourlocalharddriveandsettheCacheDir
propertytothepath.Forexample,youmightcreate"c:/spiderCache/".
TheUpdateCachepropertycontrolswhetherdownloadedpagesaresavedto
thecache.TheFetchFromCachepropertycontrolswhetherthecacheis
firstcheckedforpages.TheLastFromCachepropertytellswhetherthe
lastURLfetchedcamefromcacheornot.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
//Setourcachedirectoryandmakesuresaving-to-cacheandfetching-from-cache
//arebothturnedon:
spider.CacheDir="c:/spiderCache/";
spider.FetchFromCache=true;
spider.UpdateCache=true;
//Ifyourunthiscodetwice,you'llfindthatthe2ndrunisextremelyfast
//becausethepageswillberetrievedfromcache.
//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee
//inlaterexamples,youcancollectoutboundlinksandusethemto
//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com
spider.Initialize("www.chilkatsoft.com");
//Addthe1stURL:
spider.AddUnspidered("http://www.chilkatsoft.com/");
//BegincrawlingthesitebycallingCrawlNextrepeatedly.
inti;
for(i=0;i<=9;i++){
boolsuccess;
success=spider.CrawlNext();
if(success==true){
//ShowtheURLofthepagejustspidered.
textBox1.Text+=spider.LastUrl+"\r\n";
//TheHTMLisavailableintheLastHtmlproperty
}
else{
//DidwegetanerrororaretherenomoreURLstocrawl?
if(spider.NumUnspidered==0){
MessageBox.Show("NomoreURLstospider");
}
else{
MessageBox.Show(spider.LastErrorText);
}
}
//Sleep1secondbeforespideringthenextURL.
//Thereasonforwaitingashorttimebeforethenextfetchistoprevent
//unduestressonthewebserver.However,ifthelastpagewasretrieved
//fromcache,thereisnoneedtopause.
if(spider.LastFromCache!=true){
spider.SleepMs(1000);
}
}

CrawlingtheWeb

IftheChilkat
Spidercomponentonlycrawlsasinglesite,howdoyoucrawltheWeb?
Theanswerissimple:asyoucrawlasite,thespidercollectsoutbound
linksandmakesthemaccessibletoyou.Youmaytheninstantiatean
instanceoftheSpiderobjectforeachsite,andcrawlit.Thetaskof
keepingtrackofwhatsitesyou'vealreadycrawledislefttoyou(for
now).Thisexampleretrievesthehomepageof
http://www.joelonsoftware.com/anddisplaystheoutboundlinks.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
//TheInitializemethodmaybecalledwithjustthedomainname,
//suchas"www.joelonsoftware.com"orafullURL.Ifyoupassonly
//thedomainname,youmustaddURLstotheunspideredlistbycalling
//AddUnspidered.Otherwise,theURLyoupasstoInitializeisthe1st
//URLintheunspideredlist.
spider.Initialize("www.joelonsoftware.com");
spider.AddUnspidered("http://www.joelonsoftware.com/");
boolsuccess;
success=spider.CrawlNext();
inti;
for(i=0;i<=spider.NumOutboundLinks-1;i++){
textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";
textBox1.Refresh();
}

GetReferencedDomains

DemonstrateshowtoaccumulatealistofuniquedomainnamesreferencedfromoutboundURLs.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
Chilkat.StringArraydomainList=newChilkat.StringArray();
//SettheUniquepropertysothatduplicatesarenotadded.
domainList.Unique=true;
//Crawlthehomepageofjoelonsoftware.comandgettheoutboundURLs
spider.Initialize("www.joelonsoftware.com");
spider.AddUnspidered("http://www.joelonsoftware.com/");
boolsuccess;
success=spider.CrawlNext();
//Buildalistofuniquedomains.
inti;
stringurl;
for(i=0;i<=spider.NumOutboundLinks-1;i++){
url=spider.GetOutboundLink(i);
domainList.Append(spider.GetDomain(url));
}
//Displaythedomains.
for(i=0;i<=domainList.Count-1;i++){
textBox1.Text+=domainList.GetString(i)+"\r\n";
textBox1.Refresh();
}

GetBaseDomains

DemonstrateshowtoaccumulatealistofuniquedomainnamesreferencedfromoutboundURLs.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
Chilkat.StringArraydomainList=newChilkat.StringArray();
//SettheUniquepropertysothatduplicatesarenotadded.
domainList.Unique=true;
//Crawlthehomepageofjoelonsoftware.comandgettheoutboundURLs
spider.Initialize("www.joelonsoftware.com");
spider.AddUnspidered("http://www.joelonsoftware.com/");
boolsuccess;
success=spider.CrawlNext();
//Buildalistofuniquedomains.
inti;
stringurl;
for(i=0;i<=spider.NumOutboundLinks-1;i++){
url=spider.GetOutboundLink(i);
domainList.Append(spider.GetDomain(url));
}
//Displaythedomains.
for(i=0;i<=domainList.Count-1;i++){
textBox1.Text+=domainList.GetString(i)+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain(domainList.GetString(i))
+"\r\n"+"\r\n";
textBox1.Refresh();
}

GetBaseDomain

TheGetBaseDomain
methodisautilityfunctionthatconvertsadomainintoa"domain
base",whichisusefulforgroupingURLs.Forexample:
abc.chilkatsoft.com,xyz.chilkatsoft.com,andblog.chilkatsoft.comall
havethesamebasedomain:chilkatsoft.com.Thingsgetmorecomplicated
whenconsideringcountrydomains(.au,.uk,.se,.cn,etc.)and
government,state,and.usdomains.Also,domainssuchasblogspot,
tripod,geocities,wordpress,etc,aretreatedspeciallysothat
"xyz.blogspot.com"hasabasedomainof"xyz.blogspot.com".Note:If
youfindotherdomainsthatshouldbetreatedsimilarlyto
blogspot.com,sendarequesttosupport@chilkatsoft.com.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
textBox1.Text+=spider.GetBaseDomain("www.chilkatsoft.com")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("blog.chilkatsoft.com")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("www.news.com.au")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("blogs.bbc.co.uk")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("xyz.blogspot.com")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("www.heaids.org.za")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("www.hec.gov.pk")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("www.e-mrs.org")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.GetBaseDomain("cra.curtin.edu.au")+"\r\n";
textBox1.Refresh();
//Prints:
//chilkatsoft.com
//chilkatsoft.com
//news.com.au
//bbc.co.uk
//xyz.blogspot.com
//heaids.org.za
//hec.gov.pk
//e-mrs.org
//curtin.edu.au

CanonicalizeUrl

TheCanonicalizeUrl
methodisautilityfunctionthatcanonicalizesaURLintoastandard
formtoavoidduplicates.Forexample,"http://www.chilkatsoft.com/"
and"http://www.chilkatsoft.com/default.asp"arethesameURL.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
//DoesaDNSlookuptofindthedefaultdomain,whichmayormaynotincludethe"www."dependingontheDNSresults.
//Alsodomainnamesareconvertedtolowercase:
textBox1.Text+=spider.CanonicalizeUrl("http://www.ChilkatSoft.com/")+"\r\n";
textBox1.Refresh();
//CanonicalizeUrlwilldroptheHTMLfragment:
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/purchase2.asp#buyZip")+"\r\n";
textBox1.Refresh();
//Ifausername/passwordisintheURL,itgetsdropped:
textBox1.Text+=spider.CanonicalizeUrl("http://username:password@www.chilkatsoft.com/purchase2.asp#buyZip")+"\r\n";
textBox1.Refresh();
//Port80and443aredropped:
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com:80/purchase2.asp")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.CanonicalizeUrl("https://www.paypal.com:443/")+"\r\n";
textBox1.Refresh();
//Removesdefaultpages:
//default.asp,index.html,index.htm,default.html,index.htm,default.htm
//index.php,index.asp,default.php,.cfm,.aspx,,php3,.pl,.cgi,.txt,.shtml,.phtml
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.asp")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.asp")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.php")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.pl")+"\r\n";
textBox1.Refresh();
textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.htm")+"\r\n";
textBox1.Refresh();
//Output:
//http://chilkatsoft.com/
//http://chilkatsoft.com/purchase2.asp
//http://chilkatsoft.com/purchase2.asp
//http://chilkatsoft.com/purchase2.asp
//https://www.paypal.com/
//http://chilkatsoft.com/
//http://chilkatsoft.com/
//http://chilkatsoft.com/
//http://chilkatsoft.com/
//http://chilkatsoft.com/

AvoidingOutboundLinksMatchingPatterns

Thespider
accumulatesoutboundlinkswhencrawling.Yourprogrammayspecifyany
numberof"avoidpatterns"topreventanylinkmatchingatleastoneof
thewildcardedpatternsfrombeingadded.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
//First,we'llgettheoutboundlinksforapageinthe
//Googledirectory.Thenwe'lladdsomeavoidpatterns
//andthenre-fetch,toseeitwork...
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/");
boolsuccess;
success=spider.CrawlNext();
//Displaytheoutboundlinks
inti;
stringurl;
for(i=0;i<=spider.NumOutboundLinks-1;i++){
textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";
}
//Theoutput:
//http://www.cheese.com/
//http://www.cheesediaries.com/
//http://www.WisDairy.com/
//http://www.newenglandcheese.com
//http://www.ilovecheese.com
//http://www.cheesefromspain.com
//http://www.realcaliforniacheese.com/
//http://www.frencheese.co.uk/
//http://www.cheesesociety.org/
//http://www.specialcheese.com/queso.htm
//http://www.franceway.com/cheese/intro.htm
//http://www.foodsubs.com/Chesfirm.html
//http://www.cheeseboard.co.uk/
//http://www.thecheeseweb.com/
//http://www.vtcheese.com/
//http://www.coldbacon.com/cheese.html
//http://www.norwegiancheeses.co.uk/
//http://www.reluctantgourmet.com/cheese.htm
//http://www.lancewood.co.za/
//http://www.switzerlandcheese.ca
//http://www.frenchcheese.dk/
//http://www.dolcevita.com/cuisine/cheese/cheese.htm
//http://cheeseisland.net/
//http://www.cheestrings.ca/
//http://www.dreamcheese.co.uk
//http://hgic.clemson.edu/factsheets/HGIC3506.htm
//http://www.epicurious.com/cooking/how_to/food_dictionary/entry?id=1815
//http://www.mousetrapcheese.co.uk
//http://taquitos.net/yum/gc.shtml
//http://www.greek-recipe.com/static/greek-cheese
//http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html
//http://www.dairyfarmers.org/engl/recipes/4_1.asp
//http://www.prairieridgecheese.com/wischeesguid.html
//http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Food/Cheese
//http://dmoz.org/about.html
//http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Food/Cheese
//Doitagain,butthistimewithavoidpatterns.
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/");
//Addsomeavoidpatterns:
spider.AddAvoidOutboundLinkPattern("*dmoz.org*");
spider.AddAvoidOutboundLinkPattern("*?id=*");
spider.AddAvoidOutboundLinkPattern("*.co.uk*");
success=spider.CrawlNext();
textBox1.Text+="-----------------------"+"\r\n";
//Displaytheoutboundlinks
for(i=0;i<=spider.NumOutboundLinks-1;i++){
textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";
}
//Output:
//http://www.cheese.com/
//http://www.cheesediaries.com/
//http://www.WisDairy.com/
//http://www.newenglandcheese.com
//http://www.ilovecheese.com
//http://www.cheesefromspain.com
//http://www.realcaliforniacheese.com/
//http://www.cheesesociety.org/
//http://www.specialcheese.com/queso.htm
//http://www.franceway.com/cheese/intro.htm
//http://www.foodsubs.com/Chesfirm.html
//http://www.thecheeseweb.com/
//http://www.vtcheese.com/
//http://www.coldbacon.com/cheese.html
//http://www.reluctantgourmet.com/cheese.htm
//http://www.lancewood.co.za/
//http://www.switzerlandcheese.ca
//http://www.frenchcheese.dk/
//http://www.dolcevita.com/cuisine/cheese/cheese.htm
//http://cheeseisland.net/
//http://www.cheestrings.ca/
//http://hgic.clemson.edu/factsheets/HGIC3506.htm
//http://taquitos.net/yum/gc.shtml
//http://www.greek-recipe.com/static/greek-cheese
//http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html
//http://www.dairyfarmers.org/engl/recipes/4_1.asp
//http://www.prairieridgecheese.com/wischeesguid.html

Must-MatchPatterns

Youmayrestrict
thespidertoonlyfollowlinksthatmatchanyoneofasetof
"must-match"wildcardpatterns.TheAddMustMatchPatterncanbecalled
repeatedlytoaddmust-matchpatterns.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
//First,we'llgettheoutboundlinksforapageinthe
//Googledirectory.Thenwe'lladdsomemust-match
//andthenre-fetch,toseeitwork...
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");
boolsuccess;
success=spider.CrawlNext();
//Displaytheoutboundlinks
inti;
stringurl;
for(i=0;i<=spider.NumOutboundLinks-1;i++){
textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";
textBox1.Refresh();
}
//Theoutput:
//http://www.backpacker.com
//http://www.cmc.org
//http://www.backpacking.net
//http://www.thebackpacker.com/
//http://www.rei.com/online/store/LearnShareArticlesList?categoryId=Camping
//http://www.trailspace.com/
//http://www.catskillhikes.com/
//http://gorp.away.com/gorp/location/asia/nepal/favpicks.htm
//http://www.backpackinglight.com/cgi-bin/backpackinglight/index.html
//http://www.yetizone.com/
//http://www.backpackingfun.com
//http://www.freezerbagcooking.com/
//http://www.spadout.com/backpacking/
//http://sierrabackpacker.com
//http://www.abovecalifornia.com/
//http://www.personal.psu.edu/faculty/r/p/rpc1/bbb/
//http://www.thebackpackersguide.com
//http://www.journeywest.com/WB/index.html
//http://www.johann-sandra.com/backpackdir.htm
//http://www.geocities.com/amytys/
//http://www.cloudwalkersbasecamp.com
//http://www.netbackpacking.com
//http://members.tripod.com/~stooges/
//http://www.thebackpackingsite.com
//http://www.thruhikers.com/
//http://www.redcompservices.com/AT/
//http://members.aol.com/CMorHiker/backpack
//http://mywebpages.comcast.net/midwestpacker/
//http://www.midwesthiker.com/
//http://www.WeBackpack.com
//http://www.michiganhiker.com
//http://www.host33.com/backpack/
//http://www.wilderness-backpacking.com
//http://www.thetravelmonkey.net
//http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Outdoors/Hiking/Backpacking
//http://dmoz.org/about.html
//http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Outdoors/Hiking/Backpacking
//http://dmoz.org
//http://dmoz.org/profiles/cdog.html
//http://dmoz.org/profiles/justinwp.html
//Doitagain,butthistimewithavoidpatterns.
spider.Initialize("directory.google.com");
spider.AddUnspidered("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");
//Addsomemust-matchpatterns:
spider.AddMustMatchPattern("*.com/*");
spider.AddMustMatchPattern("*.net/*");
//Addsomeavoid-patterns:
spider.AddAvoidOutboundLinkPattern("*.mypages.*");
spider.AddAvoidOutboundLinkPattern("*.personal.*");
spider.AddAvoidOutboundLinkPattern("*.comcast.*");
spider.AddAvoidOutboundLinkPattern("*.aol.*");
spider.AddAvoidOutboundLinkPattern("*~*");
success=spider.CrawlNext();
textBox1.Text+="-----------------------"+"\r\n";
textBox1.Refresh();
//Displaytheoutboundlinks
for(i=0;i<=spider.NumOutboundLinks-1;i++){
textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";
textBox1.Refresh();
}
//Output:
//http://www.thebackpacker.com/
//http://www.rei.com/online/store/LearnShareArticlesList?categoryId=Camping
//http://www.trailspace.com/
//http://www.catskillhikes.com/
//http://gorp.away.com/gorp/location/asia/nepal/favpicks.htm
//http://www.backpackinglight.com/cgi-bin/backpackinglight/index.html
//http://www.yetizone.com/
//http://www.freezerbagcooking.com/
//http://www.spadout.com/backpacking/
//http://www.abovecalifornia.com/
//http://www.journeywest.com/WB/index.html
//http://www.johann-sandra.com/backpackdir.htm
//http://www.geocities.com/amytys/
//http://www.thruhikers.com/
//http://www.redcompservices.com/AT/
//http://www.midwesthiker.com/
//http://www.host33.com/backpack/

ASimpleWebCrawler

ThisdemonstratesaverysimplewebcrawlerusingtheChilkatSpidercomponent.


DownloadChilkat.NETfor2.0Framework


DownloadChilkat.NETfor1.0/1.1Framework
//TheChilkatSpidercomponent/libraryisfree.
Chilkat.Spiderspider=newChilkat.Spider();
Chilkat.StringArrayseenDomains=newChilkat.StringArray();
Chilkat.StringArrayseedUrls=newChilkat.StringArray();
seenDomains.Unique=true;
seedUrls.Unique=true;
seedUrls.Append("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");
//SetouroutboundURLexcludepatterns
spider.AddAvoidOutboundLinkPattern("*?id=*");
spider.AddAvoidOutboundLinkPattern("*.mypages.*");
spider.AddAvoidOutboundLinkPattern("*.personal.*");
spider.AddAvoidOutboundLinkPattern("*.comcast.*");
spider.AddAvoidOutboundLinkPattern("*.aol.*");
spider.AddAvoidOutboundLinkPattern("*~*");
//Useacachesowedon'thavetore-fetchURLspreviouslyfetched.
spider.CacheDir="c:/spiderCache/";
spider.FetchFromCache=true;
spider.UpdateCache=true;
while(seedUrls.Count>0){
stringurl;
url=seedUrls.Pop();
spider.Initialize(url);
//Spider5URLsofthisdomain.
//butfirst,savethebasedomaininseenDomains
stringdomain;
domain=spider.GetDomain(url);
seenDomains.Append(spider.GetBaseDomain(domain));
inti;
boolsuccess;
for(i=0;i<=4;i++){
success=spider.CrawlNext();
if(success!=true){
break;
}
//DisplaytheURLwejustcrawled.
textBox1.Text+=spider.LastUrl+"\r\n";
//IfthelastURLwasretrievedfromcache,
//wewon'twait.Otherwisewe'llwait1second
//beforefetchingthenextURL.
if(spider.LastFromCache!=true){
spider.SleepMs(1000);
}
}
//AddtheoutboundlinkstoseedUrls,except
//forthedomainswe'vealreadyseen.
for(i=0;i<=spider.NumOutboundLinks-1;i++){
url=spider.GetOutboundLink(i);
domain=spider.GetDomain(url);
stringbaseDomain;
baseDomain=spider.GetBaseDomain(domain);
if(!seenDomains.Contains(baseDomain)){
seedUrls.Append(url);
}
//Don'tletourlistofseedUrlsgrowtoolarge.
if(seedUrls.Count>1000){
break;
}
}
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: