Nutch基本命令
2016-04-25 11:11
441 查看
1:nutch读取hbase数据导出文本文件:./nutch readdb -dump /data/nutch_db/1108 -crawlId TestCrawl -content
会执行一个mr程序,/data/nutch_db/1108是mr的输出路径TestCrawl是hbase表名的前半部分。
2:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
3:./crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
4,
./nutch inject /nutch/urls/seed.txt -crawlId 7day把url放入对应的hbase以bbs开头的bbs_webpage表中
./nutch generate -topN 5 -crawlId 7day
./nutch fetch -all -crawlId 7day -threads 20
./nutch parse -all -crawlId 7day
./nutch updatedb -all -crawlId 7day
./nutch index -D solr.server.url=http://192.168.4.129:8983/solr/ -all -crawlId 7day
./nutch readdb -crawlId 7day_beijing -dump /home/nutch_output/beijing_1/
5,gora-hbase-mapping.xml该文件定义了列族及列的含义
会执行一个mr程序,/data/nutch_db/1108是mr的输出路径TestCrawl是hbase表名的前半部分。
2:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
3:./crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
4,
./nutch inject /nutch/urls/seed.txt -crawlId 7day把url放入对应的hbase以bbs开头的bbs_webpage表中
./nutch generate -topN 5 -crawlId 7day
./nutch fetch -all -crawlId 7day -threads 20
./nutch parse -all -crawlId 7day
./nutch updatedb -all -crawlId 7day
./nutch index -D solr.server.url=http://192.168.4.129:8983/solr/ -all -crawlId 7day
./nutch readdb -crawlId 7day_beijing -dump /home/nutch_output/beijing_1/
5,gora-hbase-mapping.xml该文件定义了列族及列的含义
相关文章推荐
- Ubuntu 16.04系统下配置cocos2dx-3.10
- [Shrio使用总结]-详解shiro配置文件
- 指针和const
- 如何减少接口响应时间
- Java Audio Video Encoder
- Android Studio 2.0+高效开发之路
- DAtrie的java实现
- fastJson使用demo
- Android ActionBar完全解析,使用官方推荐的最佳导航栏(上)
- 一个NB的安全认证机制
- [改善Java代码]使用匿名类的构造函数
- 局部内部类和匿名内部类的对比
- Jmeter -- 属性和变量
- 缺少动态链接库: libthrift-0.9.3.so: cannot open shared object file: No such file or directory
- 三种聚类方法的简单实现
- Reset大全
- Java建造者模式,Android建造者模式的AlertDialog
- Nutch2.3+Hbase0.94环境搭建
- Apache的prefork模式和worker模式
- Objective-C Blocks 小测验