您的位置：首页 > 其它

Nutch1.3集成Solr网页快照功能实现（一）

2011-11-23 13:33 337 查看

Nutch1.3版本以后使用了Solr作为索引功能的提供者，在索引效率、集群功能方面做了很大改进，但与Nutch1.2版本相比，Solr缺失了网页快照的功能，按官方手册中集成配置后，每次查询返回的结果中仅包含解析处理过的HTML正文部分，如下图所示：

对于需要原网页快照功能的使用者来说，带来了巨大的麻烦。因此，需要对Nutch1.3做一些改动，使其支持集成后的网页快照功能。
参考Nutch1.2原来的实现方式，其自带的索引功能其实是将整个网页进行了索引，而1.3版本在调用Solr服务之前，Nutch主动将无用的Html标签信息去掉了（其内部机制在此不做探讨），结果Solr中仅获取了网页之中的“正文”部分，也就是上面图片中看到的Content标签中的内容。我们所要做的工作，其核心就是将整个网页的缓存信息也交给Solr，并在查询Solr时作为结果内容返回。
首先，需要下载Nutch1.3的开发环境，下载链接：http://www.apache.org/dist//nutch/。构建工程很麻烦，也可以直接下载我构建好的工程：http://download.csdn.net/detail/Nightbreeze/3667744。JDK需要使用1.6版本。
在工程中找到“SolrIndexer”类，中的“indexSolr”方法，如下：

public void indexSolr(String solrUrl, Path crawlDb, Path linkDb,

List<Path> segments) throws IOException {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();
LOG.info("SolrIndexer: starting at " + sdf.format(start));

final JobConf job = new NutchJob(getConf());
job.setJobName("index-solr " + solrUrl);

IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);[/b]

job.set(SolrConstants.SERVER_URL, solrUrl);

NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);

job.setReduceSpeculativeExecution(false);

final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" +
new Random().nextInt());

FileOutputFormat.setOutputPath(job, tmp);
try {
JobClient.runJob(job);
// do the commits once and for all the reducers in one go
SolrServer solr = new CommonsHttpSolrServer(solrUrl);
solr.commit();
long end = System.currentTimeMillis();
LOG.info("SolrIndexer: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
}
catch (Exception e){
LOG.error(e);
} finally {
FileSystem.get(job).delete(tmp, true);
}
}

Nutch在这里使用了Hadoop的分布式计算机制，我们跳转到：“IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job)”方法中看一下，如下：

public static void initMRJob(Path crawlDb, Path linkDb,

Collection<Path> segments,
JobConf job) {

LOG.info("IndexerMapReduce: crawldb: " + crawlDb);
LOG.info("IndexerMapReduce: linkdb: " + linkDb);

for (final Path segment : segments) {
LOG.info("IndexerMapReduces: adding segment: " + segment);
FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));
FileInputFormat[/b].addInputPath(job, new Path(segment, ParseData.DIR_NAME));[/b]
FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));[/b]
}

FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));
job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(IndexerMapReduce.class);
job.setReducerClass(IndexerMapReduce.class);

job.setOutputFormat(IndexerOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setMapOutputValueClass(NutchWritable.class);
job.setOutputValueClass(NutchWritable.class);
}

可以看到，FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));中仅处理了Segment文件夹下“parse_data”与“parse_text”中的内容。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Nutch Solr Nutch集成 Solr集成网页快照实现

相关文章推荐

新的分享

章节导航