您的位置：首页 > 大数据

大数据——nutch1.8+solr 4 配置过程+ikanalayzer2012 中文分词器

2015-01-10 16:30 615 查看

Nutch 2.2.1目前性能没有Nutch 1.7好，参考这里，NUTCHFIGHT! 1.7 vs 2.2.1. 所以我目前还是使用的Nutch 1.8。

1 下载已编译好的二进制包，解压

$ wget http://psg.mtu.edu/pub/apache/nutch/1.8/apache-nutch-1.8-bin.tar.gz $ tar zxf apache-nutch-1.8-bin.tar.gz
将解压后的文件移到/usr中,存为nutch-1.8

也可下载tar.gz文件包，http://mirrors.cnnic.cn/apache/下载后解压。移到自己的安装目录：

$ sudo mv

apache-nutch-1.8 /usr/nutch-1.8

2 验证一下

$ cd /usr/nutch-1.8
$ bin/nutch

如果出现”Permission denied”请运行下面的命令：

$ chmod +x bin/nutch
出现nutch使用帮助即可。

如果有Warning说

JAVA_HOME

没有设置，请设置一下

JAVA_HOME

.jdk环境配置问题。

3 添加种子URL

在nutch文件夹中

mkdir urls
sudo gedit /urls/seed.txt
添加要爬取的url链接，例如 http://www.tianya.cn/

4 设置URL过滤规则

如果只想抓取某种类型的URL，可以在 conf/regex-urlfilter.txt
设置正则表达式，于是，只有匹配这些正则表达式的URL才会被抓取。例如，我只想抓取豆瓣电影的数据，可以这样设置：

#注释掉这一行
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# accept anything else
#注释掉这行
#+.
+^http:\/\/movie\.douban\.com\/subject\/[0-9]+\/(\?.+)?$
+^表示可以匹配所有url链接

爬虫爬取时，需要约束爬取的范围。基本所有的爬虫都是通过正则表达式来完成这个约束。最简单的，正则：

http://www.xinhuanet.com/.*

代表"http://www.xinhuanet.com/"后加任意个任意字符（可以是0个）。通过这个正则可以约束爬虫的爬取范围,但是这个正则并不是表示爬取新华网所有的网页。新华网并不是只有www.xinhuanet.com这一个域名，还有很多子域名，类似:news.xinhuanet.com这个时候我们需要定义这样一个正则:

http://([a-z0-9]*\.)*xinhuanet.com/

这样就可以限制爬取新华网所有的网页了。每种爬虫的正则约束系统都有一些区别，这里拿Nutch、WebCollector两家爬虫的正则系统做对比：Nutch官网： http://nutch.apache.org/WebCollector官网: http://crawlscript.github.io/WebCollector/[/code]

5 设置agent名字

conf/nutch-site.xml:

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

这里的配置参考nutch-default.xml,给value赋值即可

6 安装Solr

由于建索引的时候需要使用Solr，因此我们需要安装并启动一个Solr服务器。参考Nutch Tutorial 第4、5、6步，以及SolrTutorial。

6.1 下载，解压

$ wget http://mirrors.cnnic.cn/apache/lucene/solr/4.8.1/solr-4.8.1.tgz 也可以下载tar.gz文件包。

http://apache.fayea.com/lucene/solr/

$ tar -zxvf solr-4.8.1.tgz$ sudo mvsolr-4.8.1 /usr/solr4.8.1

6.2 运行Solr

cd /usr/solr4.8.1/example
java -jar start.jar

验证是否启动成功用浏览器打开

http://localhost:8983/solr/#/

，如果能看到页面，说明启动成功。

6.3 将Nutch与Solr集成在一起

NUTCH安装目录是：/usr/nutch1.8SOLR安装目录是：/usr/solr4.8.1将

NUTCH-1.8/conf/schema-solr4.xml

拷贝到

SOLR_DIR/exanple/solr/collection1/conf/

，重命名为schema.xml，并在

<fields>...</fields>

最后添加一行(具体解释见Solr4.2 - what is _version_field?)，

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>

重启Solr，

# Ctrl+C to stop Solr
java -jar start.jar

7 使用crawl脚本一键抓取

Nutch自带了一个脚本，

./bin/crawl

，把抓取的各个步骤合并成一个命令，看一下它的用法

$ bin/crawl
Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

注意，是使用

bin/crawl

，不是

bin/nutch
crawl

，后者已经是deprecated的了。

7.1 抓取网页

$ ./bin/crawl ~/urls/ ./TestCrawl http://localhost:8983/solr/ 2

～/urls

是存放了种子url的目录TestCrawl 是存放数据的根目录（在Nutch 2.x中，则表示crawlId，这会在HBase中创建一张以crawlId为前缀的表，例如TestCrawl_Webpage） http://localhost:8983/solr/ , 这是Solr服务器2，numberOfRounds，迭代的次数过了一会儿，屏幕上出现了一大堆url，可以看到爬虫正在抓取！

fetching http://music.douban.com/subject/25811077/ (queue crawl delay=5000ms)
fetching http://read.douban.com/ebook/1919781 (queue crawl delay=5000ms)
fetching http://www.douban.com/online/11670861/ (queue crawl delay=5000ms)
fetching http://book.douban.com/tag/绘本 (queue crawl delay=5000ms)
fetching http://movie.douban.com/tag/科幻 (queue crawl delay=5000ms)
49/50 spinwaiting/active, 56 pages, 0 errors, 0.9 1 pages/s, 332 245 kb/s, 131 URLs in 5 queues
fetching http://music.douban.com/subject/25762454/ (queue crawl delay=5000ms)
fetching http://read.douban.com/reader/ebook/1951242/ (queue crawl delay=5000ms)
fetching http://www.douban.com/mobile/read-notes (queue crawl delay=5000ms)
fetching http://book.douban.com/tag/诗歌 (queue crawl delay=5000ms)
50/50 spinwaiting/active, 61 pages, 0 errors, 0.9 1 pages/s, 334 366 kb/s, 127 URLs in 5 queues

7.2 查看结果

$ bin/nutch readdb TestCrawl/crawldb/ -stats
14/02/14 16:35:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: TestCrawl/crawldb/
14/02/14 16:35:47 INFO crawl.CrawlDbReader: TOTAL urls:	70
14/02/14 16:35:47 INFO crawl.CrawlDbReader: retry 0:	70
14/02/14 16:35:47 INFO crawl.CrawlDbReader: min score:	0.005
14/02/14 16:35:47 INFO crawl.CrawlDbReader: avg score:	0.03877143
14/02/14 16:35:47 INFO crawl.CrawlDbReader: max score:	1.23
14/02/14 16:35:47 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	59
14/02/14 16:35:47 INFO crawl.CrawlDbReader: status 2 (db_fetched):	11
14/02/14 16:35:47 INFO crawl.CrawlDbReader: CrawlDb statistics: done

8 一步一步使用单个命令抓取网页

上一节为了简单性，一个命令搞定。本节我将严格按照抓取的步骤，一步一步来，揭开爬虫的神秘面纱。感兴趣的读者也可以看看

bin/crawl

脚本里的内容，可以很清楚的看到各个步骤。先删除第7节产生的数据，

$ rm -rf TestCrawl/

8.1 基本概念

Nutch data is composed of:The crawl database, or

crawldb

.This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.The link database, or

linkdb

.This contains the list of known links to each URL, including both the source URL and anchor text of the link.A set of

segments

.Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:a

crawl_generate

namesa set of URLs to be fetcheda

crawl_fetch

containsthe status of fetching each URLa

content

containsthe raw content retrieved from each URLa

parse_text

containsthe parsed text of each URLa

parse_data

containsoutlinks and metadata parsed from each URLa

crawl_parse

containsthe outlink URLs, used to update the crawldb

8.2 inject:使用种子URL列表，生成crawldb

$ bin/nutch inject TestCrawl/crawldb ~/urls

将根据

～/urls

下的种子URL，生成一个URL数据库，放在

crawdb

目录下。

8.3 generate

$ bin/nutch generate TestCrawl/crawldb TestCrawl/segments

这会生成一个 fetch list，存放在一个

segments/日期

目录下。我们将这个目录的名字保存在shell变量

s1

里：

$ s1=`ls -d TestCrawl/segments/2* | tail -1`
$ echo $s1

8.4 fetch

$ bin/nutch fetch $s1

将会在

$1

crawl_fetch

和

content

。

8.5 parse

$ bin/nutch parse $s1

将会在

$1

crawl_parse

parse_data

和

parse_text

。

8.6 updatedb

$ bin/nutch updatedb TestCrawl/crawldb $s1

这将把

crawldb/current

重命名为

crawldb/old

，并生成新的

crawldb/current

。

8.7 查看结果

$ bin/nutch readdb TestCrawl/crawldb/ -stats

8.8 invertlinks

在建立索引之前，我们首先要反转所有的链接，这样我们就可以获得一个页面所有的锚文本，并给这些锚文本建立索引。

$ bin/nutch invertlinks TestCrawl/linkdb -dir TestCrawl/segments

8.9 solrindex, 提交数据给solr，建立索引

$ bin/nutch solrindex http://localhost:8983/solr TestCrawl/crawldb/ -linkdb TestCrawl/linkdb/ TestCrawl/segments/20140203004348/ -filter -normalize

8.10 solrdedup, 给索引去重

有时重复添加了数据，导致索引里有重复数据，我们需要去重，

$bin/nutch solrdedup http://localhost:8983/solr

8.11 solrclean, 删除索引

如果数据过时了，需要在索引里删除，也是可以的。

$ bin/nutch solrclean TestCrawl/crawldb/ http://localhost:8983/solr[/code] 
9.solr与tomcat整合9.1 下载tomcat安装包，点这里下载：http://tomcat.apache.org/download-70.cgi$ tar -zxvf apache-tomcat-7.0.57.tar.gz$ sudo mvapache-tomcat-7.0.57 /usr/tomcat这里我的安装目录是/usr/tomcat9.2 整合solr与tomcat
假定$SOLR_HOME为/usr/tomcat/solr步骤1，从solr-4.8.1/dist复制solr-4.8.1.war到$SOLR_HOME下的wabapps中，并重命名为solr.war；步骤2，将solr-4.8.1/example/solr复制到$/usr/tomcat目录；步骤3，在tomcat/conf/catalina/localhost下新建solr.xml，如下：
步骤4，从solr-4.8.1/example/lib/ext复制所有的jar到tomcat/lib下，并复制solr-4.8.1/example/resources/log4j.properties到tomcat/lib下(有关日志的说明，见http://wiki.apache.org/solr/SolrLogging)，须知，solr-4.8.1.jar并没有自带日志打印组件，因此这个步骤不执行，可能引起“org.apache.catalina.core.StandardContext filterStart SEVERE: Exception starting filter SolrRequestFilter org.apache.solr.common.SolrException: Could not find necessary SLF4j logging jars.”异常；步骤五，进入到 /tomcat/solr/collection1/conf/ 目录下的solrconfig.xml文件中，修改两处，一是注释掉文件中的这一部分代码,大致可以知道,这个简单的项目用不到这些配置:[plain] view plaincopy<span style="background-color: rgb(204, 204, 204);">    <lib dir="../../../contrib/extraction/lib" regex=".*\.jar" /><lib dir="../../../dist/" regex="solr-cell-\d.*\.jar" /><lib dir="../../../contrib/clustering/lib/" regex=".*\.jar" /><lib dir="../../../dist/" regex="solr-clustering-\d.*\.jar" /><lib dir="../../../contrib/langid/lib/" regex=".*\.jar" /><lib dir="../../../dist/" regex="solr-langid-\d.*\.jar" /><lib dir="../../../contrib/velocity/lib" regex=".*\.jar" /><lib dir="../../../dist/" regex="solr-velocity-\d.*\.jar" />  </span>
二是配置一个数据索引文件夹,这里配置到 /tomcat/solrindex :没有solrindex记得创建[plain] view plaincopy<span style="background-color: rgb(204, 204, 204);"><!--未配置的代码--><dataDir>${solr.data.dir:}</dataDir><!--配置后的代码--><dataDir>${solr.data.dir:/tomcat/solrindex}</dataDir>  </span>
步骤六，配置/usr/tomcat/webapps/solr/WEB-INF项目的web.xml,这里正确的配置为:
[plain] view plaincopy<env-entry><env-entry-name>solr/home</env-entry-name>  <!--tomcat下的solr--><env-entry-value>/usr/tomcat/solr</env-entry-value><env-entry-type>java.lang.String</env-entry-type></env-entry>10.配置IK [/code]
a.下载 ikanalayzer2012:http://code.google.com/p/ik-analyzer/downloads/list本例使用 IK Analyer 2012-FF hotfix 1该版本可以适用 solr 4.0, 其它版本可能不兼容.b.下载后,unzip 解压,将 jar 文件复制到 /usr/solr/example/solr-webapp/webapp/WEB-INF/lib并在 /usr/solr/example/solr-webapp/webapp/WEB-INF/ 下新建目录: classes将 stopword.dic 和 IKAnalyzer.cfg.xml 复制到其中.可以在该 xml 中配置其它的扩展词库c.配置schema.xml文件,路径是:/usr/solr/example/solr/collection1/conf/schema.xml在众多fieldType当中添加一条当建立索引时，要对name字段进行分词，在schema.xml中搜索，将其中的 name字段设置：改为：type的内容即上面刚设置的一个fieldType: text_ik。当建立索引的时候，name字段将按IK进行分词。d.重新启动e.查看结果[/code]

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航