Dissecting The Nutch Crawler -Command "generate": net.nutch.tools.FetchListTool
2006-08-04 23:00
609 查看
英文原文出处:DissectingTheNutchCrawler
转载本文请注明出处:http://blog.csdn.net/pwlazy
> Usage: FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
FetchListTool is used to create one or more "segments". From the tutorial:
<blockquote>
Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types:
a "fetchlist": file that names the pages to be fetched
the "fetcher output": set of files containing the fetched pages
the "index" is a Lucene-format index of the fetcher output
</blockquote>
Within CrawlTool.main(), FetchListTool.main() is invoked once per "depth" value with two arguments: (dir + "/db", dir + "/segments"). After processing args, it creates an instance of itself, calls "flt.emitFetchList()", then returns.
Let's run FetchListTool to see what it changes on disk. Note that we have to specify the webdb directory, plus another directory where segments are written to.
Note that no changes occurred under the webdb dir ("spam"), but a new segments directory was created, and data+index files created therein.
命令generate对应net.nutch.tools.FetchListTool类
该命令产生待检索的segment
该类的调用方式如下:
FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
FetchListTool 产生一个或者多个segment,看看如下教程,
每个segment是一组页面,这些页面作为一个单元被检索和索引,segment数据包含以下几种类型
"fetchlist":一个文件,该文件定义了被检索的页面
"fetcher output":包含检索页面的文件组
"index": 针对fetcher output的lucene格式的索引
在CrwalTool类的main方法中,FetchToolList的main方法每个深度被调用一次,调用时传入两个参数dir+"db"和dir+"segment"(译注:db就是调用CrwalTool的方法时传入的-dir参数),再处理参数后,该方法产生本类的一个实例,然后调用emitFetchList方法,然后返回。
我们来运行FetchListTool,看看它对磁盘内容做了什么改动,请注意我们特定了webdb目录和segment目录
$ bin/nutch generate spam spam_segments
$ find spam -type file | xargs ls -l
-rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbreadlock
-rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbwritelock
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/data
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/index
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/data
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/index
-rw-r--r-- 1 kangas users 89 Oct 25 20:18 spam/webdb/pagesByMD5/data
-rw-r--r-- 1 kangas users 97 Oct 25 20:18 spam/webdb/pagesByMD5/index
-rw-r--r-- 1 kangas users 115 Oct 25 20:18 spam/webdb/pagesByURL/data
-rw-r--r-- 1 kangas users 58 Oct 25 20:18 spam/webdb/pagesByURL/index
-rw-r--r-- 1 kangas users 17 Oct 25 20:18 spam/webdb/stats
$ find spam_segments/ -type file | xargs ls -l
-rw-r--r-- 1 kangas users 113 Oct 25 20:18 spam_segments/20041026001828/fetchlist/data
-rw-r--r-- 1 kangas users 40 Oct 25 20:18 spam_segments/20041026001828/fetchlist/index
结果发现webdb目录下没有变化,但一个新的segments目录产生了,还且date和index也产生了
转载本文请注明出处:http://blog.csdn.net/pwlazy
Command "generate": net.nutch.tools.FetchListTool
> "generate: generate new segments to fetch"> Usage: FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
FetchListTool is used to create one or more "segments". From the tutorial:
<blockquote>
Each segment is a set of pages that are fetched and indexed as a unit. Segment data consists of the following types:
a "fetchlist": file that names the pages to be fetched
the "fetcher output": set of files containing the fetched pages
the "index" is a Lucene-format index of the fetcher output
</blockquote>
Within CrawlTool.main(), FetchListTool.main() is invoked once per "depth" value with two arguments: (dir + "/db", dir + "/segments"). After processing args, it creates an instance of itself, calls "flt.emitFetchList()", then returns.
Let's run FetchListTool to see what it changes on disk. Note that we have to specify the webdb directory, plus another directory where segments are written to.
$ bin/nutch generate spam spam_segments $ find spam -type file | xargs ls -l -rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbreadlock -rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbwritelock -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/data -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/index -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/data -rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/index -rw-r--r-- 1 kangas users 89 Oct 25 20:18 spam/webdb/pagesByMD5/data -rw-r--r-- 1 kangas users 97 Oct 25 20:18 spam/webdb/pagesByMD5/index -rw-r--r-- 1 kangas users 115 Oct 25 20:18 spam/webdb/pagesByURL/data -rw-r--r-- 1 kangas users 58 Oct 25 20:18 spam/webdb/pagesByURL/index -rw-r--r-- 1 kangas users 17 Oct 25 20:18 spam/webdb/stats $ find spam_segments/ -type file | xargs ls -l -rw-r--r-- 1 kangas users 113 Oct 25 20:18 spam_segments/20041026001828/fetchlist/data -rw-r--r-- 1 kangas users 40 Oct 25 20:18 spam_segments/20041026001828/fetchlist/index
Note that no changes occurred under the webdb dir ("spam"), but a new segments directory was created, and data+index files created therein.
命令generate对应net.nutch.tools.FetchListTool类
该命令产生待检索的segment
该类的调用方式如下:
FetchListTool <db_dir> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
FetchListTool 产生一个或者多个segment,看看如下教程,
每个segment是一组页面,这些页面作为一个单元被检索和索引,segment数据包含以下几种类型
"fetchlist":一个文件,该文件定义了被检索的页面
"fetcher output":包含检索页面的文件组
"index": 针对fetcher output的lucene格式的索引
在CrwalTool类的main方法中,FetchToolList的main方法每个深度被调用一次,调用时传入两个参数dir+"db"和dir+"segment"(译注:db就是调用CrwalTool的方法时传入的-dir参数),再处理参数后,该方法产生本类的一个实例,然后调用emitFetchList方法,然后返回。
我们来运行FetchListTool,看看它对磁盘内容做了什么改动,请注意我们特定了webdb目录和segment目录
$ bin/nutch generate spam spam_segments
$ find spam -type file | xargs ls -l
-rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbreadlock
-rw-r--r-- 1 kangas users 0 Oct 25 20:18 spam/dbwritelock
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/data
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByMD5/index
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/data
-rw-r--r-- 1 kangas users 16 Oct 25 20:18 spam/webdb/linksByURL/index
-rw-r--r-- 1 kangas users 89 Oct 25 20:18 spam/webdb/pagesByMD5/data
-rw-r--r-- 1 kangas users 97 Oct 25 20:18 spam/webdb/pagesByMD5/index
-rw-r--r-- 1 kangas users 115 Oct 25 20:18 spam/webdb/pagesByURL/data
-rw-r--r-- 1 kangas users 58 Oct 25 20:18 spam/webdb/pagesByURL/index
-rw-r--r-- 1 kangas users 17 Oct 25 20:18 spam/webdb/stats
$ find spam_segments/ -type file | xargs ls -l
-rw-r--r-- 1 kangas users 113 Oct 25 20:18 spam_segments/20041026001828/fetchlist/data
-rw-r--r-- 1 kangas users 40 Oct 25 20:18 spam_segments/20041026001828/fetchlist/index
结果发现webdb目录下没有变化,但一个新的segments目录产生了,还且date和index也产生了
相关文章推荐
- Dissecting The Nutch Crawler - Command "crawl": net.nutch.tools.CrawlTool
- Dissecting The Nutch Crawler -Command "fetch": net.nutch.fetcher.Fetcher
- Dissecting The Nutch Crawler - Command "admin -create": net.nutch.tools.WebDBAdminTool
- Dissecting The Nutch Crawler - Command "inject": net.nutch.db.WebDBInjector
- Dissecting The Nutch Crawler -Factory classes: '''URLFilterFactory'''
- Dissecting The Nutch Crawler -Aside: net.nutch.util.NutchConfig
- run "mysqld --verbose --help" to list mysqld startup options in the command line
- Dissecting The Nutch Crawler -Factory classes: '''ParserFactory''', '''ProtocolFactory'''
- Dissecting The Nutch Crawler -Summary: Nutch crawler extension points
- No symbol table is loaded. Use the "file" command.解决方法
- Example uses of the command "more"
- 用PowerDesigner逆向数据库工程时&rdquo;Unable to list the table&quot;错误的解决方法(转载)
- 转:Access to the path "c:/windows/microsoft.net/framework/v1.1.4322/Temporary ASP.NET Files/root/xxxxx/xxxxx" is denied.
- notes on ibm:dw"5 things you didn't know about ... Command-line flags for the JVM"
- No symbol table is loaded. Use the "file" command
- PRB: "Requested Registry Access Is Not Allowed" Error Message When ASP.NET Application Tries to Write New EventSource in the Eve
- NT_iOS笔记—add "remote-notification" to the list of your supported UIBackgroundModes in your Info.plis
- QTP破解失败:Failed to add license code "UNKNOWN" to the license server on host "no-net".
- 解决error C2664: no instance of constructor "CFileDialog::CFileDialog" matches the list
- sn.exe error "Failed to generate a strong name key pair -- The keyset is not defined" Thread Tools Rate Thread