您的位置:首页 > 其它

Nutch基本命令

2016-04-25 11:11 441 查看
1:nutch读取hbase数据导出文本文件:./nutch readdb -dump /data/nutch_db/1108 -crawlId TestCrawl -content
会执行一个mr程序,/data/nutch_db/1108是mr的输出路径TestCrawl是hbase表名的前半部分。
2:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
3:./crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
4,
    ./nutch inject /nutch/urls/seed.txt -crawlId 7day把url放入对应的hbase以bbs开头的bbs_webpage表中
    ./nutch generate -topN 5 -crawlId 7day
    ./nutch fetch -all -crawlId 7day -threads 20
    ./nutch parse -all -crawlId 7day
    ./nutch updatedb -all -crawlId 7day
    ./nutch index -D solr.server.url=http://192.168.4.129:8983/solr/ -all -crawlId 7day
    ./nutch readdb -crawlId 7day_beijing -dump /home/nutch_output/beijing_1/
5,gora-hbase-mapping.xml该文件定义了列族及列的含义
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: