您的位置:首页 > 其它

利用Nutch实现分类搜索(一)(加入index plugin)

2011-04-02 15:06 447 查看
大家在用Google的时候会发现可以按分类来搜索,例如可以搜新闻、博客和购物等等,本系列文章将通过在Nutch中加入插件的方式来实现此功能。本系列文章假设读者对Nutch有一定的了解,能成功编译和简单配置Nutch,并用Nutch提供的Crawl来抓取网页。

本文将讲述如何在Nutch系统中加入我们的index-type plugin。

在利用luke查看抓去的数据时,可以发现默认有十几个fileds,例如title、url和content等等,我们要加入一个type field用来表示网站类型。

在src/plugin目录下创建index-type目录(可以参考index-basic的目录结构),加入如下三个java文件。需要注意的是包的名称要和建立的文件目录结构一致。

TypeNameFactory实现了IndexingFilter接口,Extention Manager将会从此作为入口。

//src/java/com/zju/repu/indexer/type/TypeIndexingFilter.java
package com.zju.repu.indexer.type;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.document.DateTools;
import org.apache.nutch.metadata.Nutch;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import java.net.MalformedURLException;
import java.net.URL;
import org.apache.hadoop.conf.Configuration;
import com.zju.repu.indexer.type.TypeNameFactory;
/** Adds web page type searchable fields to a document. */
public class TypeIndexingFilter implements IndexingFilter
{
public static final Log LOG = LogFactory.getLog (TypeIndexingFilter.class);
private Configuration conf;
public NutchDocument filter (NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException
{
String urlString = url.toString ();
String type = TypeNameFactory.getFactory (conf).getWebPageType (
urlString);
doc.add ("type", type);
LOG.debug ("Add the web type to the document: " + urlString
+ ", type: " + type);
return doc;
}
public void addIndexBackendOptions (Configuration conf)
{
// web page type is un-stored, indexed and un-tokenized
LuceneWriter.addFieldOptions ("type", LuceneWriter.STORE.NO,
LuceneWriter.INDEX.UNTOKENIZED, conf);
}
public void setConf (Configuration conf)
{
this.conf = conf;
}
public Configuration getConf ()
{
return this.conf;
}
}


TypeNameFactory根据参数indexer.type.includes的值,得到type的种类,而后根据种类读取相应的规则文件,并放入HashMap中。

//src/java/com/zju/repu/indexer/type/TypeNameFactory.java
package com.zju.repu.indexer.type;
import com.zju.repu.indexer.type.TypeNameSelector;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
public class TypeNameFactory
{
public static final Log LOG = LogFactory.getLog (TypeNameFactory.class);
static TypeNameFactory nameFactory = null;
static final String DEFAULT_WEB_TYPE = "others";
HashMap<String, TypeNameSelector> nameSelectors = new HashMap<String, TypeNameSelector> ();
Configuration conf;
protected TypeNameFactory(Configuration conf)
{
this.conf = conf;
init ();
}
public static TypeNameFactory getFactory (Configuration conf)
{
if (nameFactory == null)
{
nameFactory = new TypeNameFactory (conf);
}
return nameFactory;
}
protected String[] getRulesFile (Configuration conf)
{
String typeNames = conf.get ("indexer.type.includes");
LOG.debug ("getRulesFile: " + typeNames);
return typeNames.split ("//|");
}
protected void init ()
{
String[] files = getRulesFile (conf);
for (int index = 0; index < files.length; index++)
{
try
{
nameSelectors.put (files[index], new TypeNameSelector (
files[index], conf));
LOG.debug ("Add the web type selection rule to the hashmap: "
+ files[index]);
}
catch (Exception e)
{
LOG.error ("Add the web type failed: " + files[index]);
LOG.error (e.toString ());
continue;
}
}
}
public String getWebPageType (String url)
{
for (String typeName : nameSelectors.keySet ())
{
if (nameSelectors.get (typeName).filter (url))
{
LOG.debug ("Get the web type selection rule for: " + url
+ ", type: " + typeName);
return typeName;
}
}
return DEFAULT_WEB_TYPE;
}
}


TypeNameSelector对应于一个规则文件,提供filter接口以筛选网址。

//src/java/com/zju/repu/indexer/type/TypeNameSelector.java
package com.zju.repu.indexer.type;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
import org.apache.nutch.urlfilter.api.RegexRule;
import org.apache.nutch.urlfilter.api.Rule;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
public class TypeNameSelector
{
public static final Log LOG = LogFactory.getLog (TypeNameSelector.class);
private static final String CONFIG_FILE_PREFIX = "crawl-urltype-";
private static final String CONFIG_FILE_SUFFIX = ".txt";

/** An array of applicable rules */
private RegexRule[] rules;
public TypeNameSelector(String filename, Configuration conf) throws IOException,
PatternSyntaxException
{

Reader reader = conf.getConfResourceAsReader(CONFIG_FILE_PREFIX+filename+CONFIG_FILE_SUFFIX);
rules = readRulesFile (reader);
}
/**
* Read the specified file of rules.
*
* @param reader
*            is a reader of regular expressions rules.
* @return the corresponding {@RegexRule rules}.
*/
private RegexRule[] readRulesFile (Reader reader) throws IOException,
IllegalArgumentException
{
BufferedReader in = new BufferedReader (reader);
List rules = new ArrayList ();
String line;
while ((line = in.readLine ()) != null)
{
if (line.length () == 0)
{
continue;
}
char first = line.charAt (0);
boolean sign = false;
switch (first)
{
case '+':
sign = true;
break;
case '-':
sign = false;
break;
case ' ':
case '/n':
case '#': // skip blank & comment lines
continue;
default:
throw new IOException ("Invalid first character: " + line);
}
String regex = line.substring (1);
if (LOG.isTraceEnabled ())
{
LOG.trace ("Adding rule [" + regex + "]");
}
RegexRule rule = createRule (sign, regex);
rules.add (rule);
}
return (RegexRule[]) rules.toArray (new RegexRule[rules.size ()]);
}
// Inherited Javadoc
protected RegexRule createRule (boolean sign, String regex)
{
return new Rule (sign, regex);
}
public synchronized boolean filter (String url)
{
for (int i = 0; i < rules.length; i++)
{
if (rules[i].match (url))
{
return rules[i].accept () ? true : false;
}
}
;
return false;
}
}


在lib-regex-filter中加入org.apache.nutch.urlfilter.api.Rule。

package org.apache.nutch.urlfilter.api;
import java.util.regex.Pattern;
import org.apache.nutch.net.*;

public class Rule extends RegexRule {

private Pattern pattern;

public Rule(boolean sign, String regex) {
super(sign, regex);
pattern = Pattern.compile(regex);
}

public boolean match(String url) {
return pattern.matcher(url).find();
}
}


加入build.xml,用到了lib-regex-filter库,所以要加入引用。

<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<project name="index-type" default="jar-core">
<import file="../build-plugin.xml"/>

<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-regex-filter"/>
<ant target="compile-test" inheritall="false" dir="../lib-regex-filter"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-regex-filter/*.jar" />
</fileset>
<pathelement location="${nutch.root}/build/lib-regex-filter/test"/>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<ant target="deploy" inheritall="false" dir="../lib-regex-filter"/>
</target>

</project>


加入plugin.xml,注意extention point是"org.apache.nutch.indexer.IndexingFilter",是indexing的公共的接口。

<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="index-type"
name="Web page type Indexing Filter"
version="1.0.0"
provider-name="repu.zju">
<runtime>
<library name="index-type.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
<import plugin="lib-regex-filter"/>
</requires>
<extension id="com.zju.repu.indexer.type"
name="Web page type Indexing Filter"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="BasicIndexingFilter"
class="com.zju.repu.indexer.type.TypeIndexingFilter"/>
</extension>
</plugin>


还有一件重要的事情,把这个index-type加入编译系统,修改src/plugin/build.xml,添加如下编译入口:

<target name="deploy">
<ant dir="index-type" target="deploy"/>
</target>
<target name="clean">
<ant dir="index-type" target="clean"/>
</target>


至此,在ant package的时候会把我们的index-type编译进去了,查看编译结果能否正确产生index-type.jar。

在成功编译完成后,我们来尝试一下抓取网页,之前需要对conf/nutch-site.xml做修改:

加入indexer.type.includes参数,值为各种网页类型,用“|”分隔多个类型,在plugin.includes参数的值中加入index-type,这样在系统启动时会自动加载index-type plugin。

<property>
<name>indexer.type.includes</name>
<value>news|shop|blog|disc|qa</value>
<description>Regular expression naming web page type names to
include.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-repu|parse-(text|html|js|tika)|index-(basic|anchor|type)|query-(basic|site|url|type)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library. Nutch now also includes integration with Tika
to leverage Tika's parsing capabilities for multiple content types. The existing Nutch
parser implementations will likely be phased out in the next release or so, as such, it is
a good idea to begin migrating away from anything not provided by parse-tika.
</description>
</property>


对每一种网页类型,提供一个对应规则文件,下面crawl-urltype-news.txt是针对news设计的规则。

# accept hosts in
+^http://news.sina.com.cn/
+^http://news.baidu.com/
+^http://news.163.com/
# skip everything else
-.


另外4中type类似,在此不一一贴出,需要注意的是文件命名规则为crawl-urltype-(YOURWEBTYPE).txt,YOURWEBTYPE和indexer.type.includes参数中的值一一对应,文件放置于conf目录下。

而后设置crawl的起始urls,运行nutch crawl就可以在网络上抓取数据了,利用luke查看最终结果,可以发现多了一个type field,如果抓取的网页足够多的话,可以看到有5个值,分别都定义于indexer.type.includes中。

至此index能够成功根据不同的网页类型将其type field加入结构的索引中了。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: