您的位置：首页 > 其它

nutch v1.9源码分析(6)——plugin系统

2015-01-20 09:44 330 查看

1 Nutch plugin system

1.1 为什么要使用插件机制？

1) 分离关注点，取得编译时和运行时的灵活性。

插件机制也是“微内核”机制的一种应用,对于“内核”来讲，插件实现外围辅助功能。使用插件机制，在编译时，将处理爬取流程内核代码和流程中用到功能性接口（即：Extension-Point）分离开，遵循了“面向抽象编程”的思想，能够使得核心代码开发人员和插件开发可以并行开发，二者之间通过nutch-default.xml+nutch-site.xml配置文件解耦，另外，插件的编译独立于内核代码的编译，使得内核代码编译更快，能够更快速地验证内核代码的逻辑正确性。运行时，插件通过plugin.xml配置文件实现“自描述”，通过以独立的文件夹来组织插件的jar包和依赖资源，实现了“自包含”，通过PluginClassloader实现插件的代码运行时动态加载，以上机制赋予了插件copy&paste，开箱即用的能力。

2) 并不是所有的爬虫的使用人员（非开发人员）都需要全部功能

例如nutch提供了protocal和parser等多种插件，前者处理不同的网络协议，如http、ftp、file等，后者处理不同格式的网络文件内容的解析，如html、pdf、js、swf等，但是并不是每个nutch用户都需要同时访问以上所有的网络协议和所有的文件格式，使得用户可以根据自己的需要灵活选择启用不同的插件，而不是“要么全部启用，要么全部禁用”。

3) 可维护性

分离关注点，面向接口（抽象）编程的典型实践。nutch引擎的开发者只需要关注核心引擎，对于引擎需要的公共的功能性组件，设计插件接口，交予插件开发人员完成，而插件开发人员只需要关注插件本省，不需要知道引擎如何调用该插件，引擎开发人员和插件开发人员互不影响，可并行开发（当然，面向接口开发的前提是“接口”必须稳定）。

1.2 都有哪些内置的插件

表3 nutch自带的插件

plugin文件夹

extension

extension point

描述

Configurable Item

key

作用

默认值

createcommons

CCIndexingFilter

IndexingFilter

indexing时，解析并添加符合Create Commons协议的内容的metadata到NutchDocument

CCParseFilter

HtmlParseFilter

解析html的dom时，添加与Create Common相关的metadata到ParseResult

feed

FeedIndexingFilter

IndexingFilter

indexing时，解析并添加RSS feed格式内容的metadata到NutchDocument

FeedParser

Parser

parse RSS feed格式的文件内容形成ParseResult

headings

HeadingsParseFilter

HtmlParseFilter

抽取html dom中的h1、h2标题，并添加到metadata中。

headings	Comma separated list of headings to retrieve from the document	h1,h2
headings.multivalued	Whether to support multivalued headings	false

index-anchor

AnchorIndexingFilter

IndexingFilter

indexing时，将Inlinks的超链接的(archor)文字添加到NutchDocument中，

anchorIndexingFilter.deduplicate

控制是否会对相同的archor去重

false

index-basic

BasicIndexingFilter

IndexingFilter

将如下字段添加到NutchDocument中：domain、host、url、content、title、cache、tstamp，其中domain、title、content字段收到右边的配置参数的控制

indexer.add.domain	是否会添加domain到NutchDocument	false
indexer.max.title.length	超过多长title字段将会被截断	100
indexer.max.content.length	超过多长content字段会被截断	-1

index-metadata

MetaDataIndexer

IndexingFilter

从crawldb、parse metadata和content metadata中提取指定key的metadata

index.db.md	从crawldb的metadata中的提取哪些key的metadata添加到NutchDocument中	无
index.parse.md	从parse的metadata中的提取哪些key的metadata添加到NutchDocument中	逗号分隔的metadata的key值，metatag.description,metatag.keywords
index.content.md	从content的metadata中的提取哪些key的metadata添加到NutchDocument中	无

index-more

MoreIndexingFilter

IndexingFilter

添加或者reset一些固定key的metadata，如lastModified，contentLength，type，title

moreIndexingFilter.indexMimeTypeParts	是否对multi-part的mime-type进行拆分	true
moreIndexingFilter.mapMimeTypes	是否读取conf/contenttype-mapping.txt，从中获取type的映射	false

index-static

StaticFieldIndexer

IndexingFilter

添加一些常量字段和对应的常量值到NutchDocument

index.static

按照逗号分隔的field:value传递字段和字段值，多值字段用空格分隔多值

无

indexer-dummy

DummyIndexWriter

IndexWriter

将indexer的各项活动，如document的CRUD操作输出到dummy.path指定的文件中，一般起debug作用

dummy.path

输出文件的路径

无

indexer-elastic

ElasticIndexWriter

IndexWriter

支持向Elastic全文检索引擎CRUD NutchDocument

elastic.host	输出文件的路径	无
elastic.port	The port to connect to using TransportClient.	9300
elastic.cluster	The cluster name to discover. Either host and potr must be defined or cluster.	无
elastic.index	Default index to send documents to.	nutch
elastic.max.bulk.docs	Maximum size of the bulk in number of documents.	250
elastic.max.bulk.size	Maximum size of the bulk in bytes.	2500500

indexer-solr

SolrIndexWriter

IndexWriter

支持向Solr全文检索引擎CRUD NutchDocument

solr.mapping.file	Defines the name of the file that will be used in the mapping of internal nutch field names to solr index fields as specified in the target Solr schema.	solrindex-mapping.xml
solr.commit.size	Defines the number of documents to send to Solr in a single update batch. Decrease when handling very large documents to prevent Nutch from running out of memory. NOTE: It does not explicitly trigger a server side commit.	250
solr.commit.index	When closing the indexer, trigger a commit to the Solr server.	true
solr.auth	Whether to enable HTTP basic authentication for communicating with Solr. Use the solr.auth.username and solr.auth.password properties to configure your credentials.	false

language-identifier

LanguageIndexingFilter

IndexingFilter

从parseData中解析出语言类型，并为NutchDocument添加lang字段

lang.analyze.max.length	The maximum bytes of data to uses to indentify the language (0 means full content analysis). The larger is this value, the better is the analysis, but the slowest it is.	2048
lang.extraction.policy	This determines when the plugin uses detection and statistical identification mechanisms. The order in which the detect and identify are written will determine the extraction policy. Default case (detect,identify) means the plugin will first try to extract language info from page headers and metadata, if this is not successful it will try using tika language identification. Possible values are: detect identify detect,identify identify,detect	detect,identify
lang.identification.only.certain	If set to true with lang.extraction.policy containing identify, the language code returned by Tika will be assigned to the document ONLY if it is deemed certain by Tika.	false

未完，待续...

参考文献

[1]. http://wiki.apache.org/nutch/WhyNutchHasAPluginSystem

[2]. http://www.eclipse.org/articles/Article-Plug-in-architecture/plugin_architecture.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航