您的位置：首页 > 其它

使用ChilkatDotNet组件构建自己的搜索引擎

2008-01-02 15:39 423 查看

ChilkatDotNet是一个非常强大的.NET组件,我们可以利用这个组件来做一些网页搜索的工作,有兴趣的朋友可以研究一下.接下来我会使用这个组件编写一个从网页收集Email地址的工具.

安装完ChilkatDotNet之后，在安装目录中会有一个dll文件，在项目中引用一下那个dll文件即可开始构建你的程序！

GetStart

This
isaverysimple"gettingstarted"exampleforspideringawebsite.As
you'llseeinfutureexamples,theChilkatSpiderlibrarycanbeused
tocrawltheWeb.Fornow,we'llconcentrateonspideringasinglesite.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

[align=left]//TheChilkatSpidercomponent/libraryisfree.[/align]
[align=left]Chilkat.Spiderspider=newChilkat.Spider();[/align]
[align=left][/align]
[align=left]//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee[/align]
[align=left]//inlaterexamples,youcancollectoutboundlinksandusethemto[/align]
[align=left]//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com[/align]
[align=left]spider.Initialize("www.chilkatsoft.com");[/align]
[align=left][/align]
[align=left]//Addthe1stURL:[/align]
[align=left]spider.AddUnspidered("http://www.chilkatsoft.com/");[/align]
[align=left][/align]
[align=left]//BegincrawlingthesitebycallingCrawlNextrepeatedly.[/align]
[align=left]inti;[/align]
[align=left]for(i=0;i<=9;i++){[/align]
[align=left]boolsuccess;[/align]
[align=left]success=spider.CrawlNext();[/align]
[align=left]if(success==true){[/align]
[align=left]//ShowtheURLofthepagejustspidered.[/align]
[align=left]textBox1.Text+=spider.LastUrl+"\r\n";[/align]
[align=left]//TheHTMLisavailableintheLastHtmlproperty[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]//DidwegetanerrororaretherenomoreURLstocrawl?[/align]
[align=left]if(spider.NumUnspidered==0){[/align]
[align=left]MessageBox.Show("NomoreURLstospider");[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]MessageBox.Show(spider.LastErrorText);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left]}[/align]
[align=left][/align]
[align=left]//Sleep1secondbeforespideringthenextURL.[/align]
[align=left]spider.SleepMs(1000);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left]ExtractHTMLTitle,Description,Keywords[/align]
This
exampleexpandsonthe"gettingstarted"examplebyshowinghowto
accesstheHTMLtitle,description,andkeywordswithineachpage
spidered.ThesearethecontentsoftheMETAtagsforkeywords,
description,andtitlefoundintheHTMLheader.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

[align=left]//TheChilkatSpidercomponent/libraryisfree.[/align]
[align=left]Chilkat.Spiderspider=newChilkat.Spider();[/align]
[align=left][/align]
[align=left]//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee[/align]
[align=left]//inlaterexamples,youcancollectoutboundlinksandusethemto[/align]
[align=left]//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com[/align]
[align=left]spider.Initialize("www.chilkatsoft.com");[/align]
[align=left][/align]
[align=left]//Addthe1stURL:[/align]
[align=left]spider.AddUnspidered("http://www.chilkatsoft.com/");[/align]
[align=left][/align]
[align=left]//BegincrawlingthesitebycallingCrawlNextrepeatedly.[/align]
[align=left]inti;[/align]
[align=left]for(i=0;i<=9;i++){[/align]
[align=left]boolsuccess;[/align]
[align=left]success=spider.CrawlNext();[/align]
[align=left]if(success==true){[/align]
[align=left]//ShowtheURLofthepagejustspidered.[/align]
[align=left]textBox1.Text+=spider.LastUrl+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left][/align]
[align=left]//TheHTMLMETAkeywords,title,anddescriptionareavailableintheseproperties:[/align]
[align=left]textBox1.Text+=spider.LastHtmlTitle+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left]textBox1.Text+=spider.LastHtmlDescription+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left]textBox1.Text+=spider.LastHtmlKeywords+"\r\n";[/align]
[align=left]textBox1.Refresh();[/align]
[align=left][/align]
[align=left]//TheHTMLisavailableintheLastHtmlproperty[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]//DidwegetanerrororaretherenomoreURLstocrawl?[/align]
[align=left]if(spider.NumUnspidered==0){[/align]
[align=left]MessageBox.Show("NomoreURLstospider");[/align]
[align=left]}[/align]
[align=left]else{[/align]
[align=left]MessageBox.Show(spider.LastErrorText);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left]}[/align]
[align=left][/align]
[align=left]//Sleep1secondbeforespideringthenextURL.[/align]
[align=left]spider.SleepMs(1000);[/align]
[align=left]}[/align]
[align=left][/align]
[align=left][/align]

Fetchrobots.txtforaSite

TheChilkatSpider
libraryisrobots.txtcompliant.Itautomaticallyfetchesasite's
robots.txtfileandadherestoit.Itwillnotdownloadpagesdeniedby
robots.txt.Pagesexcludedbyrobots.txtwillnotappearinthe
Spider's"unspidered"list.Thisexampleshowshowtoexplicitly
downloadandreviewtherobots.txtforagivensite.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

spider.Initialize("www.chilkatsoft.com");

stringrobotsText;

robotsText=spider.FetchRobotsText();

textBox1.Text+=robotsText+"\r\n";

textBox1.Refresh();

AvoidURLsMatchingAnyofaSetofPatterns

Demonstrateshowto
use"avoidpatterns"topreventspideringanyURLthatmatchesa
wildcardedpattern.ThisexampleavoidsURLscontainingthesubstrings
"java","python",or"perl".

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee

//inlaterexamples,youcancollectoutboundlinksandusethemto

//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

//Addthe1stURL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

//AvoidURLsmatchingthesepatterns:

spider.AddAvoidPattern("*java*");

spider.AddAvoidPattern("*python*");

spider.AddAvoidPattern("*perl*");

//BegincrawlingthesitebycallingCrawlNextrepeatedly.

inti;

for(i=0;i<=9;i++){

boolsuccess;

success=spider.CrawlNext();

if(success==true){

//ShowtheURLofthepagejustspidered.

textBox1.Text+=spider.LastUrl+"\r\n";

//TheHTMLisavailableintheLastHtmlproperty

else{

//DidwegetanerrororaretherenomoreURLstocrawl?

if(spider.NumUnspidered==0){

MessageBox.Show("NomoreURLstospider");

else{

MessageBox.Show(spider.LastErrorText);

//Sleep1secondbeforespideringthenextURL.

spider.SleepMs(1000);

SettingaMaximumResponseSize

TheMaxResponseSize
propertyprotectsyourspiderfromdownloadingapagethatistoo
large.Bydefault,MaxResponseSize=300,000bytes.Settingitto0
indicatesthatthereisnomaximum.Youmaysetittoanumber
indicatingthemaximumnumberofbytestodownload.URLswithresponse
sizeslargerthanthiswillbeskipped.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

spider.Initialize("www.chilkatsoft.com");

//Addthe1stURL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

//ThisexampledemonstratessettingtheMaxResponseSizeproperty

//Donotdownloadanythingwitharesponsesizegreaterthan100,000bytes.

spider.MaxResponseSize=100000;

SettingaMaximumURLLength

TheMaxUrlLenpropertypreventsthespiderfromretrievingURLsthatgrowtoolong.ThedefaultvalueofMaxUrlLenis300.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

spider.Initialize("www.chilkatsoft.com");

//Addthe1stURL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

//ThisexampledemonstratessettingtheMaxUrlLenproperty

//DonotaddURLslongerthan250characterstothe"unspidered"queue:

spider.MaxUrlLen=250;

//...

UsingtheDiskCache

TheChilkatSpider
componenthasdiskcachingcapabilities.Tosetupadiskcache,create
anewdirectoryanywhereonyourlocalharddriveandsettheCacheDir
propertytothepath.Forexample,youmightcreate"c:/spiderCache/".
TheUpdateCachepropertycontrolswhetherdownloadedpagesaresavedto
thecache.TheFetchFromCachepropertycontrolswhetherthecacheis
firstcheckedforpages.TheLastFromCachepropertytellswhetherthe
lastURLfetchedcamefromcacheornot.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

//Setourcachedirectoryandmakesuresaving-to-cacheandfetching-from-cache

//arebothturnedon:

spider.CacheDir="c:/spiderCache/";

spider.FetchFromCache=true;

spider.UpdateCache=true;

//Ifyourunthiscodetwice,you'llfindthatthe2ndrunisextremelyfast

//becausethepageswillberetrievedfromcache.

//Thespiderobjectcrawlsasinglewebsiteatatime.Asyou'llsee

//inlaterexamples,youcancollectoutboundlinksandusethemto

//crawltheweb.Fornow,we'llsimplyspider10pagesofchilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

//Addthe1stURL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

//BegincrawlingthesitebycallingCrawlNextrepeatedly.

inti;

for(i=0;i<=9;i++){

boolsuccess;

success=spider.CrawlNext();

if(success==true){

//ShowtheURLofthepagejustspidered.

textBox1.Text+=spider.LastUrl+"\r\n";

//TheHTMLisavailableintheLastHtmlproperty

else{

//DidwegetanerrororaretherenomoreURLstocrawl?

if(spider.NumUnspidered==0){

MessageBox.Show("NomoreURLstospider");

else{

MessageBox.Show(spider.LastErrorText);

//Sleep1secondbeforespideringthenextURL.

//Thereasonforwaitingashorttimebeforethenextfetchistoprevent

//unduestressonthewebserver.However,ifthelastpagewasretrieved

//fromcache,thereisnoneedtopause.

if(spider.LastFromCache!=true){

spider.SleepMs(1000);

CrawlingtheWeb

IftheChilkat
Spidercomponentonlycrawlsasinglesite,howdoyoucrawltheWeb?
Theanswerissimple:asyoucrawlasite,thespidercollectsoutbound
linksandmakesthemaccessibletoyou.Youmaytheninstantiatean
instanceoftheSpiderobjectforeachsite,andcrawlit.Thetaskof
keepingtrackofwhatsitesyou'vealreadycrawledislefttoyou(for
now).Thisexampleretrievesthehomepageof
http://www.joelonsoftware.com/anddisplaystheoutboundlinks.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

//TheInitializemethodmaybecalledwithjustthedomainname,

//suchas"www.joelonsoftware.com"orafullURL.Ifyoupassonly

//thedomainname,youmustaddURLstotheunspideredlistbycalling

//AddUnspidered.Otherwise,theURLyoupasstoInitializeisthe1st

//URLintheunspideredlist.

spider.Initialize("www.joelonsoftware.com");

spider.AddUnspidered("http://www.joelonsoftware.com/");

boolsuccess;

success=spider.CrawlNext();

inti;

for(i=0;i<=spider.NumOutboundLinks-1;i++){

textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";

textBox1.Refresh();

GetReferencedDomains

DemonstrateshowtoaccumulatealistofuniquedomainnamesreferencedfromoutboundURLs.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

Chilkat.StringArraydomainList=newChilkat.StringArray();

//SettheUniquepropertysothatduplicatesarenotadded.

domainList.Unique=true;

//Crawlthehomepageofjoelonsoftware.comandgettheoutboundURLs

spider.Initialize("www.joelonsoftware.com");

spider.AddUnspidered("http://www.joelonsoftware.com/");

boolsuccess;

success=spider.CrawlNext();

//Buildalistofuniquedomains.

inti;

stringurl;

for(i=0;i<=spider.NumOutboundLinks-1;i++){

url=spider.GetOutboundLink(i);

domainList.Append(spider.GetDomain(url));

//Displaythedomains.

for(i=0;i<=domainList.Count-1;i++){

textBox1.Text+=domainList.GetString(i)+"\r\n";

textBox1.Refresh();

GetBaseDomains

DemonstrateshowtoaccumulatealistofuniquedomainnamesreferencedfromoutboundURLs.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

Chilkat.StringArraydomainList=newChilkat.StringArray();

//SettheUniquepropertysothatduplicatesarenotadded.

domainList.Unique=true;

//Crawlthehomepageofjoelonsoftware.comandgettheoutboundURLs

spider.Initialize("www.joelonsoftware.com");

spider.AddUnspidered("http://www.joelonsoftware.com/");

boolsuccess;

success=spider.CrawlNext();

//Buildalistofuniquedomains.

inti;

stringurl;

for(i=0;i<=spider.NumOutboundLinks-1;i++){

url=spider.GetOutboundLink(i);

domainList.Append(spider.GetDomain(url));

//Displaythedomains.

for(i=0;i<=domainList.Count-1;i++){

textBox1.Text+=domainList.GetString(i)+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain(domainList.GetString(i))

+"\r\n"+"\r\n";

textBox1.Refresh();

GetBaseDomain

TheGetBaseDomain
methodisautilityfunctionthatconvertsadomainintoa"domain
base",whichisusefulforgroupingURLs.Forexample:
abc.chilkatsoft.com,xyz.chilkatsoft.com,andblog.chilkatsoft.comall
havethesamebasedomain:chilkatsoft.com.Thingsgetmorecomplicated
whenconsideringcountrydomains(.au,.uk,.se,.cn,etc.)and
government,state,and.usdomains.Also,domainssuchasblogspot,
tripod,geocities,wordpress,etc,aretreatedspeciallysothat
"xyz.blogspot.com"hasabasedomainof"xyz.blogspot.com".Note:If
youfindotherdomainsthatshouldbetreatedsimilarlyto
blogspot.com,sendarequesttosupport@chilkatsoft.com.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

textBox1.Text+=spider.GetBaseDomain("www.chilkatsoft.com")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("blog.chilkatsoft.com")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("www.news.com.au")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("blogs.bbc.co.uk")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("xyz.blogspot.com")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("www.heaids.org.za")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("www.hec.gov.pk")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("www.e-mrs.org")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.GetBaseDomain("cra.curtin.edu.au")+"\r\n";

textBox1.Refresh();

//Prints:

//chilkatsoft.com

//chilkatsoft.com

//news.com.au

//bbc.co.uk

//xyz.blogspot.com

//heaids.org.za

//hec.gov.pk

//e-mrs.org

//curtin.edu.au

CanonicalizeUrl

TheCanonicalizeUrl
methodisautilityfunctionthatcanonicalizesaURLintoastandard
formtoavoidduplicates.Forexample,"http://www.chilkatsoft.com/"
and"http://www.chilkatsoft.com/default.asp"arethesameURL.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

//DoesaDNSlookuptofindthedefaultdomain,whichmayormaynotincludethe"www."dependingontheDNSresults.

//Alsodomainnamesareconvertedtolowercase:

textBox1.Text+=spider.CanonicalizeUrl("http://www.ChilkatSoft.com/")+"\r\n";

textBox1.Refresh();

//CanonicalizeUrlwilldroptheHTMLfragment:

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/purchase2.asp#buyZip")+"\r\n";

textBox1.Refresh();

//Ifausername/passwordisintheURL,itgetsdropped:

textBox1.Text+=spider.CanonicalizeUrl("http://username:password@www.chilkatsoft.com/purchase2.asp#buyZip")+"\r\n";

textBox1.Refresh();

//Port80and443aredropped:

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com:80/purchase2.asp")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.CanonicalizeUrl("https://www.paypal.com:443/")+"\r\n";

textBox1.Refresh();

//Removesdefaultpages:

//default.asp,index.html,index.htm,default.html,index.htm,default.htm

//index.php,index.asp,default.php,.cfm,.aspx,,php3,.pl,.cgi,.txt,.shtml,.phtml

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.asp")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.asp")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.php")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.pl")+"\r\n";

textBox1.Refresh();

textBox1.Text+=spider.CanonicalizeUrl("http://www.chilkatsoft.com/index.htm")+"\r\n";

textBox1.Refresh();

//Output:

//http://chilkatsoft.com/

//http://chilkatsoft.com/purchase2.asp

//http://chilkatsoft.com/purchase2.asp

//http://chilkatsoft.com/purchase2.asp

//https://www.paypal.com/

//http://chilkatsoft.com/

//http://chilkatsoft.com/

//http://chilkatsoft.com/

//http://chilkatsoft.com/

//http://chilkatsoft.com/

AvoidingOutboundLinksMatchingPatterns

Thespider
accumulatesoutboundlinkswhencrawling.Yourprogrammayspecifyany
numberof"avoidpatterns"topreventanylinkmatchingatleastoneof
thewildcardedpatternsfrombeingadded.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

//First,we'llgettheoutboundlinksforapageinthe

//Googledirectory.Thenwe'lladdsomeavoidpatterns

//andthenre-fetch,toseeitwork...

spider.Initialize("directory.google.com");

spider.AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/");

boolsuccess;

success=spider.CrawlNext();

//Displaytheoutboundlinks

inti;

stringurl;

for(i=0;i<=spider.NumOutboundLinks-1;i++){

textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";

//Theoutput:

//http://www.cheese.com/

//http://www.cheesediaries.com/

//http://www.WisDairy.com/

//http://www.newenglandcheese.com

//http://www.ilovecheese.com

//http://www.cheesefromspain.com

//http://www.realcaliforniacheese.com/

//http://www.frencheese.co.uk/

//http://www.cheesesociety.org/

//http://www.specialcheese.com/queso.htm

//http://www.franceway.com/cheese/intro.htm

//http://www.foodsubs.com/Chesfirm.html

//http://www.cheeseboard.co.uk/

//http://www.thecheeseweb.com/

//http://www.vtcheese.com/

//http://www.coldbacon.com/cheese.html

//http://www.norwegiancheeses.co.uk/

//http://www.reluctantgourmet.com/cheese.htm

//http://www.lancewood.co.za/

//http://www.switzerlandcheese.ca

//http://www.frenchcheese.dk/

//http://www.dolcevita.com/cuisine/cheese/cheese.htm

//http://cheeseisland.net/

//http://www.cheestrings.ca/

//http://www.dreamcheese.co.uk

//http://hgic.clemson.edu/factsheets/HGIC3506.htm

//http://www.epicurious.com/cooking/how_to/food_dictionary/entry?id=1815

//http://www.mousetrapcheese.co.uk

//http://taquitos.net/yum/gc.shtml

//http://www.greek-recipe.com/static/greek-cheese

//http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html

//http://www.dairyfarmers.org/engl/recipes/4_1.asp

//http://www.prairieridgecheese.com/wischeesguid.html

//http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Food/Cheese

//http://dmoz.org/about.html

//http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Food/Cheese

//Doitagain,butthistimewithavoidpatterns.

spider.Initialize("directory.google.com");

spider.AddUnspidered("http://directory.google.com/Top/Recreation/Food/Cheese/");

//Addsomeavoidpatterns:

spider.AddAvoidOutboundLinkPattern("*dmoz.org*");

spider.AddAvoidOutboundLinkPattern("*?id=*");

spider.AddAvoidOutboundLinkPattern("*.co.uk*");

success=spider.CrawlNext();

textBox1.Text+="-----------------------"+"\r\n";

//Displaytheoutboundlinks

for(i=0;i<=spider.NumOutboundLinks-1;i++){

textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";

//Output:

//http://www.cheese.com/

//http://www.cheesediaries.com/

//http://www.WisDairy.com/

//http://www.newenglandcheese.com

//http://www.ilovecheese.com

//http://www.cheesefromspain.com

//http://www.realcaliforniacheese.com/

//http://www.cheesesociety.org/

//http://www.specialcheese.com/queso.htm

//http://www.franceway.com/cheese/intro.htm

//http://www.foodsubs.com/Chesfirm.html

//http://www.thecheeseweb.com/

//http://www.vtcheese.com/

//http://www.coldbacon.com/cheese.html

//http://www.reluctantgourmet.com/cheese.htm

//http://www.lancewood.co.za/

//http://www.switzerlandcheese.ca

//http://www.frenchcheese.dk/

//http://www.dolcevita.com/cuisine/cheese/cheese.htm

//http://cheeseisland.net/

//http://www.cheestrings.ca/

//http://hgic.clemson.edu/factsheets/HGIC3506.htm

//http://taquitos.net/yum/gc.shtml

//http://www.greek-recipe.com/static/greek-cheese

//http://www.park.org/Netherlands/pavilions/food_and_markets/cheese/introduction.html

//http://www.dairyfarmers.org/engl/recipes/4_1.asp

//http://www.prairieridgecheese.com/wischeesguid.html

Must-MatchPatterns

Youmayrestrict
thespidertoonlyfollowlinksthatmatchanyoneofasetof
"must-match"wildcardpatterns.TheAddMustMatchPatterncanbecalled
repeatedlytoaddmust-matchpatterns.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

//First,we'llgettheoutboundlinksforapageinthe

//Googledirectory.Thenwe'lladdsomemust-match

//andthenre-fetch,toseeitwork...

spider.Initialize("directory.google.com");

spider.AddUnspidered("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");

boolsuccess;

success=spider.CrawlNext();

//Displaytheoutboundlinks

inti;

stringurl;

for(i=0;i<=spider.NumOutboundLinks-1;i++){

textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";

textBox1.Refresh();

//Theoutput:

//http://www.backpacker.com

//http://www.cmc.org

//http://www.backpacking.net

//http://www.thebackpacker.com/

//http://www.rei.com/online/store/LearnShareArticlesList?categoryId=Camping

//http://www.trailspace.com/

//http://www.catskillhikes.com/

//http://gorp.away.com/gorp/location/asia/nepal/favpicks.htm

//http://www.backpackinglight.com/cgi-bin/backpackinglight/index.html

//http://www.yetizone.com/

//http://www.backpackingfun.com

//http://www.freezerbagcooking.com/

//http://www.spadout.com/backpacking/

//http://sierrabackpacker.com

//http://www.abovecalifornia.com/

//http://www.personal.psu.edu/faculty/r/p/rpc1/bbb/

//http://www.thebackpackersguide.com

//http://www.journeywest.com/WB/index.html

//http://www.johann-sandra.com/backpackdir.htm

//http://www.geocities.com/amytys/

//http://www.cloudwalkersbasecamp.com

//http://www.netbackpacking.com

//http://members.tripod.com/~stooges/

//http://www.thebackpackingsite.com

//http://www.thruhikers.com/

//http://www.redcompservices.com/AT/

//http://members.aol.com/CMorHiker/backpack

//http://mywebpages.comcast.net/midwestpacker/

//http://www.midwesthiker.com/

//http://www.WeBackpack.com

//http://www.michiganhiker.com

//http://www.host33.com/backpack/

//http://www.wilderness-backpacking.com

//http://www.thetravelmonkey.net

//http://dmoz.org/cgi-bin/add.cgi?where=Recreation/Outdoors/Hiking/Backpacking

//http://dmoz.org/about.html

//http://dmoz.org/cgi-bin/apply.cgi?where=Recreation/Outdoors/Hiking/Backpacking

//http://dmoz.org

//http://dmoz.org/profiles/cdog.html

//http://dmoz.org/profiles/justinwp.html

//Doitagain,butthistimewithavoidpatterns.

spider.Initialize("directory.google.com");

spider.AddUnspidered("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");

//Addsomemust-matchpatterns:

spider.AddMustMatchPattern("*.com/*");

spider.AddMustMatchPattern("*.net/*");

//Addsomeavoid-patterns:

spider.AddAvoidOutboundLinkPattern("*.mypages.*");

spider.AddAvoidOutboundLinkPattern("*.personal.*");

spider.AddAvoidOutboundLinkPattern("*.comcast.*");

spider.AddAvoidOutboundLinkPattern("*.aol.*");

spider.AddAvoidOutboundLinkPattern("*~*");

success=spider.CrawlNext();

textBox1.Text+="-----------------------"+"\r\n";

textBox1.Refresh();

//Displaytheoutboundlinks

for(i=0;i<=spider.NumOutboundLinks-1;i++){

textBox1.Text+=spider.GetOutboundLink(i)+"\r\n";

textBox1.Refresh();

//Output:

//http://www.thebackpacker.com/

//http://www.rei.com/online/store/LearnShareArticlesList?categoryId=Camping

//http://www.trailspace.com/

//http://www.catskillhikes.com/

//http://gorp.away.com/gorp/location/asia/nepal/favpicks.htm

//http://www.backpackinglight.com/cgi-bin/backpackinglight/index.html

//http://www.yetizone.com/

//http://www.freezerbagcooking.com/

//http://www.spadout.com/backpacking/

//http://www.abovecalifornia.com/

//http://www.journeywest.com/WB/index.html

//http://www.johann-sandra.com/backpackdir.htm

//http://www.geocities.com/amytys/

//http://www.thruhikers.com/

//http://www.redcompservices.com/AT/

//http://www.midwesthiker.com/

//http://www.host33.com/backpack/

ASimpleWebCrawler

ThisdemonstratesaverysimplewebcrawlerusingtheChilkatSpidercomponent.

DownloadChilkat.NETfor2.0Framework

DownloadChilkat.NETfor1.0/1.1Framework

//TheChilkatSpidercomponent/libraryisfree.

Chilkat.Spiderspider=newChilkat.Spider();

Chilkat.StringArrayseenDomains=newChilkat.StringArray();

Chilkat.StringArrayseedUrls=newChilkat.StringArray();

seenDomains.Unique=true;

seedUrls.Unique=true;

seedUrls.Append("http://directory.google.com/Top/Recreation/Outdoors/Hiking/Backpacking/");

//SetouroutboundURLexcludepatterns

spider.AddAvoidOutboundLinkPattern("*?id=*");

spider.AddAvoidOutboundLinkPattern("*.mypages.*");

spider.AddAvoidOutboundLinkPattern("*.personal.*");

spider.AddAvoidOutboundLinkPattern("*.comcast.*");

spider.AddAvoidOutboundLinkPattern("*.aol.*");

spider.AddAvoidOutboundLinkPattern("*~*");

//Useacachesowedon'thavetore-fetchURLspreviouslyfetched.

spider.CacheDir="c:/spiderCache/";

spider.FetchFromCache=true;

spider.UpdateCache=true;

while(seedUrls.Count>0){

stringurl;

url=seedUrls.Pop();

spider.Initialize(url);

//Spider5URLsofthisdomain.

//butfirst,savethebasedomaininseenDomains

stringdomain;

domain=spider.GetDomain(url);

seenDomains.Append(spider.GetBaseDomain(domain));

inti;

boolsuccess;

for(i=0;i<=4;i++){

success=spider.CrawlNext();

if(success!=true){

break;

//DisplaytheURLwejustcrawled.

textBox1.Text+=spider.LastUrl+"\r\n";

//IfthelastURLwasretrievedfromcache,

//wewon'twait.Otherwisewe'llwait1second

//beforefetchingthenextURL.

if(spider.LastFromCache!=true){

spider.SleepMs(1000);

//AddtheoutboundlinkstoseedUrls,except

//forthedomainswe'vealreadyseen.

for(i=0;i<=spider.NumOutboundLinks-1;i++){

url=spider.GetOutboundLink(i);

domain=spider.GetDomain(url);

stringbaseDomain;

baseDomain=spider.GetBaseDomain(domain);

if(!seenDomains.Contains(baseDomain)){

seedUrls.Append(url);

//Don'tletourlistofseedUrlsgrowtoolarge.

if(seedUrls.Count>1000){

break;

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航