Dissecting The Nutch Crawler - Command "inject": net.nutch.db.WebDBInjector
2006-08-04 18:44
375 查看
英文原文出处:DissectingTheNutchCrawler
转载本文请注明出处:http://blog.csdn.net/pwlazy
> Usage: WebDBInjector <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
WebDBInjector.main() accepts two input-type options. "-urlfile" parses a simple list ofURLs with oneURL per line. "-dmozfile" is for parsingDMOZRDF files, which is useful for bootstrapping a whole-web database.
Let's see how it works. Create a file with oneURL, then run "bin/nutch inject":
We can see that a new "stats" file was created, and the data/index files in the "pagesBy..." directories were modified.
命令:inject 对应net.nutch.db.WebDBInjector类
inject 将新的urls插入到数据库
调用方式:WebDBInjector <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
WebDBInjector.main()方法接受两个输入选项。"-urlfile"以每行一个url的方式解析一个简单url列表。 "-dmozfile"用于解析DMOZRDF文件,后者用于启动基于整个web的数据库
然我看看命令是如何工作的,产生一个文件,填入url,然后运行bin/nutch inject
$ vi spam_url.txt
$ bin/nutch inject spam -urlfile spam_url.txt
$ find spam -type file | xargs ls -l
-rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbreadlock
-rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbwritelock
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/data
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/index
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/data
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/index
-rw-r--r-- 1 kangas users 89 Oct 25 18:57 spam/webdb/pagesByMD5/data
-rw-r--r-- 1 kangas users 97 Oct 25 18:57 spam/webdb/pagesByMD5/index
-rw-r--r-- 1 kangas users 115 Oct 25 18:57 spam/webdb/pagesByURL/data
-rw-r--r-- 1 kangas users 58 Oct 25 18:57 spam/webdb/pagesByURL/index
-rw-r--r-- 1 kangas users 17 Oct 25 18:57 spam/webdb/stats
我们能看出一个新的stats文件产生了,而且在"pagesBy..."目录下的名为data和index的文件被修改了
转载本文请注明出处:http://blog.csdn.net/pwlazy
Command "inject": net.nutch.db.WebDBInjector
> "inject: inject new urls into the database"> Usage: WebDBInjector <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
WebDBInjector.main() accepts two input-type options. "-urlfile" parses a simple list ofURLs with oneURL per line. "-dmozfile" is for parsingDMOZRDF files, which is useful for bootstrapping a whole-web database.
Let's see how it works. Create a file with oneURL, then run "bin/nutch inject":
$ vi spam_url.txt $ bin/nutch inject spam -urlfile spam_url.txt $ find spam -type file | xargs ls -l -rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbreadlock -rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbwritelock -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/data -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/index -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/data -rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/index -rw-r--r-- 1 kangas users 89 Oct 25 18:57 spam/webdb/pagesByMD5/data -rw-r--r-- 1 kangas users 97 Oct 25 18:57 spam/webdb/pagesByMD5/index -rw-r--r-- 1 kangas users 115 Oct 25 18:57 spam/webdb/pagesByURL/data -rw-r--r-- 1 kangas users 58 Oct 25 18:57 spam/webdb/pagesByURL/index -rw-r--r-- 1 kangas users 17 Oct 25 18:57 spam/webdb/stats
We can see that a new "stats" file was created, and the data/index files in the "pagesBy..." directories were modified.
命令:inject 对应net.nutch.db.WebDBInjector类
inject 将新的urls插入到数据库
调用方式:WebDBInjector <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
WebDBInjector.main()方法接受两个输入选项。"-urlfile"以每行一个url的方式解析一个简单url列表。 "-dmozfile"用于解析DMOZRDF文件,后者用于启动基于整个web的数据库
然我看看命令是如何工作的,产生一个文件,填入url,然后运行bin/nutch inject
$ vi spam_url.txt
$ bin/nutch inject spam -urlfile spam_url.txt
$ find spam -type file | xargs ls -l
-rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbreadlock
-rw-r--r-- 1 kangas users 0 Oct 25 18:57 spam/dbwritelock
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/data
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByMD5/index
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/data
-rw-r--r-- 1 kangas users 16 Oct 25 18:57 spam/webdb/linksByURL/index
-rw-r--r-- 1 kangas users 89 Oct 25 18:57 spam/webdb/pagesByMD5/data
-rw-r--r-- 1 kangas users 97 Oct 25 18:57 spam/webdb/pagesByMD5/index
-rw-r--r-- 1 kangas users 115 Oct 25 18:57 spam/webdb/pagesByURL/data
-rw-r--r-- 1 kangas users 58 Oct 25 18:57 spam/webdb/pagesByURL/index
-rw-r--r-- 1 kangas users 17 Oct 25 18:57 spam/webdb/stats
我们能看出一个新的stats文件产生了,而且在"pagesBy..."目录下的名为data和index的文件被修改了
相关文章推荐
- Dissecting The Nutch Crawler - Command "crawl": net.nutch.tools.CrawlTool
- Dissecting The Nutch Crawler - Command "admin -create": net.nutch.tools.WebDBAdminTool
- Dissecting The Nutch Crawler -Command "fetch": net.nutch.fetcher.Fetcher
- Dissecting The Nutch Crawler -Command "generate": net.nutch.tools.FetchListTool
- Dissecting The Nutch Crawler -Factory classes: '''URLFilterFactory'''
- Dissecting The Nutch Crawler -Aside: net.nutch.util.NutchConfig
- Dissecting The Nutch Crawler -Factory classes: '''ParserFactory''', '''ProtocolFactory'''
- 交叉调试时arm-linux-gdb提示:No symbol table is loaded. Use the "file" command.
- Dissecting The Nutch Crawler -Factory classes: Overview
- Dissecting The Nutch Crawler -Summary: Nutch crawler extension points
- The operation could not be performedbecause OLE DB provider "SQLNCLI" for linked serve
- Unable to open the physical file "F:\DATA\TestDB.mdf"
- run "mysqld --verbose --help" to list mysqld startup options in the command line
- 解决Please choose a writable location using the '-configuration' command line option"
- PRB: "Requested Registry Access Is Not Allowed" Error Message When ASP.NET Application Tries to Write New EventSource in the Eve
- asp.net中使用fckeditor时,提示“this connector is disabled Please check the"editor/filemanager/connectors/aspx/config.aspx”解决办法
- Example uses of the command "more"
- Dissecting The Nutch Crawler - The "nutch" shell script
- WCF error "No end point listening at net.pipe://server name:port/service name that could accept the message
- .Net "command line arguments will not be passed" message