您的位置:首页 > 其它

Nutch1.0源码分析-----抓取部分

2009-11-19 17:48 309 查看
简单的分析了nutch抓取过程,涉及到的mapredue等内容在这不做讨论,时间仓促,很多地方写得不具体,以后有时间再慢慢修改,工作需要又得马上分析nutch相关配置文件,分析整理后会发布上来。转载请注明出处

1.1

抓取目录分析

一共生成5
个文件夹,
分别是:

l

crawldb
目录存放下载的URL,
以及下载的日期,
用来页面更新检查时间.

l

linkdb
目录存放URL
的互联关系,
是下载完成后分析得到的.

l

segments:
存放抓取的页面,
下面子目录的个数于获取的页面层数有关系,
通常每一层页面会独立存放一个子目录,
子目录名称为时间,
便于管理.
比如我这只抓取了一层页面就只生成了20090508173137
目录.
每个子目录里又有6
个子文件夹如下:

Ø

content
:每个下载页面的内容。

Ø

crawl_fetch
:每个下载URL
的状态。

Ø

crawl_generate
:待下载URL
集合。

Ø

crawl_parse
:包含来更新crawldb
的外部链接库。

Ø

parse_data
:包含每个URL
解析出的外部链接和元数据

Ø

parse_text
:包含每个解析过的URL
的文本内容。

l

indexs
:存放每次下载的独立索引目录

l

index
:符合Lucene
格式的索引目录,是indexs
里所有index
合并后的完整索引

1.2

Crawl

过程概述

引用到的类主要有以下
9
个:

1、

nutch.crawl.Inject

用来给抓取数据库添加
URL
的插入器

2、

nutch.crawl.Generator

用来生成待下载任务列表的生成器

3、

nutch.fetcher.Fetcher

完成抓取特定页面的抓取器

4、

nutch.parse.ParseSegment

负责内容提取和对下级
URL
提取的内容进行解析的解析器

5、

nutch.crawl.CrawlDb

负责数据库管理的数据库管理工具

6、

nutch.crawl.LinkDb

负责链接管理

7、

nutch.indexer.Indexer

负责创建索引的索引器

8、

nutch.indexer.DeleteDuplicates

删除重复数据

9、

nutch.indexer.IndexMerger

对当前下载内容局部索引和历史索引进行合并的索引合并器

1.3

抓取过程分析

1.3.1

inject

方法

描述:初始化爬取的crawldb,
读取URL
配置文件,
把内容注入爬取数据库.

首先会找到读取URL
配置文件的目录urls.
如果没创建此目录,nutch1.0
下会报错.

得到hadoop
处理的临时文件夹:

/tmp/hadoop-Administrator/mapred/

日志信息如下:

2009-05-08 15:41:36,640 INFO
Injector - Injector: starting

2009-05-08 15:41:37,031 INFO
Injector - Injector: crawlDb:
20090508/crawldb

2009-05-08 15:41:37,781 INFO
Injector - Injector: urlDir: urls

接着设置一些初始化信息.

调用hadoop
包JobClient.runJob

方法,
跟踪进入JobClient
下的submitJob
方法进行提交整个过程.
具体原理又涉及到另一个开源项目hadoop
的分析,它包括了复杂的

MapReduce
架构,此处不做分析。

查看submitJob
方法,
首先获得jobid,
执行configureCommandLineOptions
方法后会在上边的临时文件夹生成一个system
文件夹,
同时在它下边生成一个job_local_0001
文件夹.
执行writeSplitsFile
后在job_local_0001
下生成job.split
文件.
执行writeXml
写入job.xml,
然后执行jobSubmitClient.submitJob
正式提交整个job
流程,
日志如下:

2009-05-08 15:41:36,640
INFO
Injector - Injector: starting

2009-05-08 15:41:37,031
INFO
Injector - Injector: crawlDb:
20090508/crawldb

2009-05-08 15:41:37,781
INFO
Injector - Injector: urlDir: urls

2009-05-08 15:52:41,734
INFO
Injector - Injector: Converting
injected urls to crawl db entries.

2009-05-08 15:56:22,203
INFO
JvmMetrics - Initializing JVM
Metrics with processName=JobTracker, sessionId=

2009-05-08 16:08:20,796
WARN
JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should implement
Tool for the same.

2009-05-08 16:08:20,984
WARN
JobClient - No job jar file
set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-08 16:24:42,593
INFO
FileInputFormat - Total input paths
to process : 1

2009-05-08 16:38:29,437
INFO
FileInputFormat - Total input paths
to process : 1

2009-05-08 16:38:29,546
INFO
MapTask - numReduceTasks: 1

2009-05-08 16:38:29,562
INFO
MapTask - io.sort.mb = 100

2009-05-08 16:38:29,687
INFO
MapTask - data buffer =
79691776/99614720

2009-05-08 16:38:29,687
INFO
MapTask - record buffer =
262144/327680

2009-05-08 16:38:29,718
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

2009-05-08 16:38:29,921
INFO
PluginRepository - Plugin
Auto-activation mode: [true]

2009-05-08 16:38:29,921
INFO
PluginRepository - Registered
Plugins:

2009-05-08 16:38:29,921
INFO
PluginRepository -
the nutch core extension points
(nutch-extensionpoints)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Basic Query Filter (query-basic)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Basic URL Normalizer (urlnormalizer-basic)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Basic Indexing Filter (index-basic)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Html Parse Plug-in (parse-html)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Site Query Filter (query-site)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Basic Summarizer Plug-in (summary-basic)

2009-05-08 16:38:29,921
INFO
PluginRepository -
HTTP Framework (lib-http)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Text Parse Plug-in (parse-text)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Pass-through URL Normalizer
(urlnormalizer-pass)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Regex URL Filter (urlfilter-regex)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Http Protocol Plug-in (protocol-http)

2009-05-08 16:38:29,921
INFO
PluginRepository -
XML Response Writer Plug-in (response-xml)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Regex URL Normalizer (urlnormalizer-regex)

2009-05-08 16:38:29,921
INFO
PluginRepository -
OPIC Scoring Plug-in (scoring-opic)

2009-05-08 16:38:29,921
INFO
PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Anchor Indexing Filter (index-anchor)

2009-05-08 16:38:29,921
INFO
PluginRepository -
JavaScript Parser (parse-js)

2009-05-08 16:38:29,921
INFO
PluginRepository -
URL Query Filter (query-url)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Regex URL Filter Framework (lib-regex-filter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
JSON Response Writer Plug-in (response-json)

2009-05-08 16:38:29,921
INFO
PluginRepository - Registered
Extension-Points:

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Protocol
(org.apache.nutch.protocol.Protocol)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Field Filter
(org.apache.nutch.indexer.field.FieldFilter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)

2009-05-08 16:38:29,921 INFO
PluginRepository -
Nutch Search Results Response Writer
(org.apache.nutch.searcher.response.ResponseWriter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch URL Filter
(org.apache.nutch.net.URLFilter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Content Parser
(org.apache.nutch.parse.Parser)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)

2009-05-08 16:38:29,921
INFO
PluginRepository -
Ontology Model Loader
(org.apache.nutch.ontology.Ontology)

2009-05-08 16:38:29,968
INFO
Configuration - found resource
crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-08 16:38:29,984
WARN
RegexURLNormalizer - can't find
rules for scope 'inject', using default

2009-05-08 16:38:29,984
INFO
MapTask - Starting flush of map
output

2009-05-08 16:38:30,203
INFO
MapTask - Finished spill 0

2009-05-08 16:38:30,203
INFO
TaskRunner -
Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

2009-05-08 16:38:30,218
INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/urls/site.txt:0+19

2009-05-08 16:38:30,218
INFO
TaskRunner - Task 'attempt_local_0001_m_000000_0'
done.

2009-05-08 16:38:30,234
INFO
LocalJobRunner -

2009-05-08 16:38:30,250
INFO
Merger - Merging 1 sorted segments

2009-05-08 16:38:30,265
INFO
Merger - Down to the last
merge-pass, with 1 segments left of total size: 53 bytes

2009-05-08 16:38:30,265
INFO
LocalJobRunner -

2009-05-08 16:38:30,390
INFO
TaskRunner -
Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

2009-05-08 16:38:30,390
INFO
LocalJobRunner -

2009-05-08 16:38:30,390
INFO
TaskRunner - Task
attempt_local_0001_r_000000_0 is allowed to commit now

2009-05-08 16:38:30,406
INFO
FileOutputCommitter - Saved output
of task 'attempt_local_0001_r_000000_0' to
file:/tmp/hadoop-Administrator/mapred/temp/inject-temp-474192304

2009-05-08 16:38:30,406
INFO
LocalJobRunner - reduce > reduce

2009-05-08 16:38:30,406 INFO
TaskRunner - Task
'attempt_local_0001_r_000000_0' done.

执行完后返回的
running
值如下
:

Job: job_local_0001

file:
file:/tmp/hadoop-Administrator/mapred/system/job_local_0001/job.xml

tracking URL: http://localhost:8080/

2009-05-08 16:47:14,093
INFO
JobClient - Running job:
job_local_0001

2009-05-08 16:49:51,859
INFO
JobClient - Job complete:
job_local_0001

2009-05-08 16:51:36,062
INFO
JobClient - Counters: 11

2009-05-08 16:51:36,062
INFO
JobClient -
File Systems

2009-05-08 16:51:36,062
INFO
JobClient -
Local bytes read=51591

2009-05-08 16:51:36,062
INFO
JobClient -
Local bytes written=104337

2009-05-08 16:51:36,062
INFO
JobClient -
Map-Reduce Framework

2009-05-08 16:51:36,062
INFO
JobClient -
Reduce input groups=1

2009-05-08 16:51:36,062
INFO
JobClient -
Combine output records=0

2009-05-08 16:51:36,062
INFO
JobClient -
Map input records=1

2009-05-08 16:51:36,062
INFO
JobClient -
Reduce output records=1

2009-05-08 16:51:36,062
INFO
JobClient -
Map output bytes=49

2009-05-08 16:51:36,062
INFO
JobClient -
Map input bytes=19

2009-05-08 16:51:36,062
INFO
JobClient -
Combine input records=0

2009-05-08 16:51:36,062
INFO
JobClient -
Map output records=1

2009-05-08 16:51:36,062 INFO
JobClient -
Reduce input records=1

至此第一个
runJob
方法执行结束
.

总结
:
待写

接下来就是生成crawldb
文件夹,
并把urls
合并注入到它的里面.

JobClient.runJob(mergeJob);

CrawlDb.install(mergeJob, crawlDb);

这个过程首先会在前面提到的临时文件夹下生成job_local_0002
目录,
和上边一样同样会生成job.split
和job.xml,
接着完成crawldb
的创建,
最后删除临时文件夹temp
下的文件.

至此inject
过程结束.
最后部分日志如下:

2009-05-08 17:03:57,250
INFO
Injector - Injector: Merging
injected urls into crawl db.

2009-05-08 17:10:01,015
INFO
JvmMetrics - Cannot initialize JVM
Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-08 17:10:15,953
WARN
JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should implement
Tool for the same.

2009-05-08 17:10:16,156
WARN
JobClient - No job jar file
set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-08 17:12:15,296
INFO
FileInputFormat - Total input paths
to process : 1

2009-05-08 17:13:40,296
INFO
FileInputFormat - Total input paths
to process : 1

2009-05-08 17:13:40,406
INFO
MapTask - numReduceTasks: 1

2009-05-08 17:13:40,406
INFO
MapTask - io.sort.mb = 100

2009-05-08 17:13:40,515
INFO
MapTask - data buffer =
79691776/99614720

2009-05-08 17:13:40,515
INFO
MapTask - record buffer =
262144/327680

2009-05-08 17:13:40,546
INFO
MapTask - Starting flush of map
output

2009-05-08 17:13:40,765
INFO
MapTask - Finished spill 0

2009-05-08 17:13:40,765
INFO
TaskRunner -
Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting

2009-05-08 17:13:40,765
INFO
LocalJobRunner -
file:/tmp/hadoop-Administrator/mapred/temp/inject-temp-474192304/part-00000:0+143

2009-05-08 17:13:40,765
INFO
TaskRunner - Task
'attempt_local_0002_m_000000_0' done.

2009-05-08 17:13:40,796
INFO
LocalJobRunner -

2009-05-08 17:13:40,796
INFO
Merger - Merging 1 sorted segments

2009-05-08 17:13:40,796
INFO
Merger - Down to the last
merge-pass, with 1 segments left of total size: 53 bytes

2009-05-08 17:13:40,796
INFO
LocalJobRunner -

2009-05-08 17:13:40,906
WARN
NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable

2009-05-08 17:13:40,906
INFO
CodecPool - Got brand-new
compressor

2009-05-08 17:13:40,906
INFO
TaskRunner - Task:attempt_local_0002_r_000000_0
is done. And is in the process of commiting

2009-05-08 17:13:40,906
INFO
LocalJobRunner -

2009-05-08 17:13:40,906
INFO
TaskRunner - Task
attempt_local_0002_r_000000_0 is allowed to commit now

2009-05-08 17:13:40,921
INFO
FileOutputCommitter - Saved output
of task 'attempt_local_0002_r_000000_0' to
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/1896567745

2009-05-08 17:13:40,921
INFO
LocalJobRunner - reduce > reduce

2009-05-08 17:13:40,937
INFO
TaskRunner - Task 'attempt_local_0002_r_000000_0'
done.

2009-05-08 17:13:46,781
INFO
JobClient - Running job:
job_local_0002

2009-05-08 17:14:55,125
INFO
JobClient - Job complete:
job_local_0002

2009-05-08 17:14:59,328
INFO
JobClient - Counters: 11

2009-05-08 17:14:59,328
INFO
JobClient -
File Systems

2009-05-08 17:14:59,328
INFO
JobClient -
Local bytes read=103875

2009-05-08 17:14:59,328
INFO
JobClient -
Local bytes written=209385

2009-05-08 17:14:59,328
INFO
JobClient -
Map-Reduce Framework

2009-05-08 17:14:59,328
INFO
JobClient -
Reduce input groups=1

2009-05-08 17:14:59,328
INFO
JobClient -
Combine output records=0

2009-05-08 17:14:59,328
INFO
JobClient -
Map input records=1

2009-05-08 17:14:59,328
INFO
JobClient -
Reduce output records=1

2009-05-08 17:14:59,328
INFO
JobClient -
Map output bytes=49

2009-05-08 17:14:59,328
INFO
JobClient -
Map input bytes=57

2009-05-08 17:14:59,328
INFO
JobClient -
Combine input records=0

2009-05-08 17:14:59,328
INFO
JobClient -
Map output records=1

2009-05-08 17:14:59,328
INFO
JobClient -
Reduce input records=1

2009-05-08 17:17:30,984
INFO
JvmMetrics - Cannot initialize JVM
Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-08 17:20:02,390 INFO
Injector - Injector: done

1.3.2

generate

方法

描述:从爬取数据库中生成新的segment,
然后从中生成待下载任务列表(fetchlist).

LockUtil.createLockFile(fs, lock, force);

首先执行上边方法后会在crawldb
目录下生成.locked
文件,
猜测作用是防止crawldb
的数据被修改,
真实作用有待验证
.

接着执行的过程和上边大同小异,
可参考上边步骤,
日志如下:

2009-05-08 17:37:18,218 INFO
Generator - Generator: Selecting best-scoring
urls due for fetch.

2009-05-08 17:37:18,625 INFO
Generator - Generator: starting

2009-05-08 17:37:18,937 INFO
Generator - Generator: segment:
20090508/segments/20090508173137

2009-05-08 17:37:19,468 INFO
Generator - Generator: filtering: true

2009-05-08 17:37:22,312 INFO
Generator - Generator: topN: 50

2009-05-08 17:37:51,203 INFO
Generator - Generator: jobtracker is 'local',
generating exactly one partition.

2009-05-08 17:39:57,609 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-08 17:40:05,234 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-08 17:40:05,406 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-08 17:40:05,437 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-08 17:40:06,062 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-08 17:40:06,109 INFO
MapTask - numReduceTasks: 1

省略插件加载日志……

2009-05-08 17:40:06,312 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-08 17:40:06,343 INFO
FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-08 17:40:06,343 INFO
AbstractFetchSchedule -
defaultInterval=2592000

2009-05-08 17:40:06,343 INFO
AbstractFetchSchedule - maxInterval=7776000

2009-05-08 17:40:06,343 INFO
MapTask - io.sort.mb = 100

2009-05-08 17:40:06,437 INFO
MapTask - data buffer = 79691776/99614720

2009-05-08 17:40:06,437 INFO
MapTask - record buffer = 262144/327680

2009-05-08 17:40:06,453 WARN
RegexURLNormalizer - can't find rules for
scope 'partition', using default

2009-05-08 17:40:06,453 INFO
MapTask - Starting flush of map output

2009-05-08 17:40:06,625 INFO
MapTask - Finished spill 0

2009-05-08 17:40:06,640 INFO
TaskRunner -
Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting

2009-05-08 17:40:06,640 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143

2009-05-08 17:40:06,640 INFO
TaskRunner - Task
'attempt_local_0003_m_000000_0' done.

2009-05-08 17:40:06,656 INFO
LocalJobRunner -

2009-05-08 17:40:06,656 INFO
Merger - Merging 1 sorted segments

2009-05-08 17:40:06,656 INFO
Merger - Down to the last merge-pass, with 1
segments left of total size: 78 bytes

2009-05-08 17:40:06,656 INFO
LocalJobRunner –

省略插件加载日志……

2009-05-08 17:40:06,875 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-08 17:40:06,906 INFO
FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-08 17:40:06,906 INFO
AbstractFetchSchedule -
defaultInterval=2592000

2009-05-08 17:40:06,906 INFO
AbstractFetchSchedule - maxInterval=7776000

2009-05-08 17:40:06,906 WARN
RegexURLNormalizer - can't find rules for
scope 'generate_host_count', using default

2009-05-08 17:40:06,906 INFO
TaskRunner -
Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting

2009-05-08 17:40:06,906 INFO
LocalJobRunner -

2009-05-08 17:40:06,906 INFO
TaskRunner - Task attempt_local_0003_r_000000_0
is allowed to commit now

2009-05-08 17:40:06,906 INFO
FileOutputCommitter - Saved output of task
'attempt_local_0003_r_000000_0' to
file:/tmp/hadoop-Administrator/mapred/temp/generate-temp-1241774893937

2009-05-08 17:40:06,921 INFO
LocalJobRunner - reduce > reduce

2009-05-08 17:40:06,921 INFO
TaskRunner - Task
'attempt_local_0003_r_000000_0' done.

2009-05-08 17:40:21,468 INFO
JobClient - Running job: job_local_0003

2009-05-08 17:40:31,671 INFO
JobClient - Job complete: job_local_0003

2009-05-08 17:40:34,046 INFO
JobClient - Counters: 11

2009-05-08 17:40:34,046 INFO
JobClient -

File Systems

2009-05-08 17:40:34,046 INFO
JobClient -
Local bytes read=157400

2009-05-08 17:40:34,046 INFO
JobClient -
Local bytes written=316982

2009-05-08 17:40:34,046 INFO
JobClient -

Map-Reduce Framework

2009-05-08 17:40:34,046 INFO
JobClient -
Reduce input groups=1

2009-05-08 17:40:34,046 INFO
JobClient -
Combine output records=0

2009-05-08 17:40:34,046 INFO
JobClient -
Map input records=1

2009-05-08 17:40:34,046 INFO
JobClient -
Reduce output records=1

2009-05-08 17:40:34,046 INFO
JobClient -
Map output bytes=74

2009-05-08 17:40:34,046 INFO
JobClient -
Map input bytes=57

2009-05-08 17:40:34,046 INFO
JobClient -
Combine input records=0

2009-05-08 17:40:34,046 INFO
JobClient -
Map output records=1

2009-05-08 17:40:34,046 INFO
JobClient -
Reduce input records=1

接着还是执行submitJob
方法提交整个generate
过程,
生成segments
目录,
删除临时文件,
锁定文件等,
当前segments
下只生成了crawl_generate
一个文件夹
.

1.3.3

fetch

方法

描述:完成具体的下载任务

2009-05-11 09:45:13,984 WARN
Fetcher - Fetcher: Your 'http.agent.name'
value should be listed first in 'http.robots.agents' property.

2009-05-11 09:45:34,796 INFO
Fetcher - Fetcher: starting

2009-05-11 09:45:35,375 INFO
Fetcher - Fetcher: segment:
20090508/segments/20090511094102

2009-05-11 09:49:23,984 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 09:49:58,046 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 09:49:58,234 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 09:49:58,265 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 09:49:58,859 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 09:49:58,906 INFO
MapTask - numReduceTasks: 1

2009-05-11 09:49:58,906 INFO
MapTask - io.sort.mb = 100

2009-05-11 09:49:59,015 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 09:49:59,015 INFO
MapTask - record buffer = 262144/327680

2009-05-11 09:49:59,140 INFO
Fetcher - Fetcher: threads: 5

2009-05-11 09:49:59,140 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

2009-05-11 09:49:59,250 INFO
Fetcher - QueueFeeder finished: total 1
records.

省略插件加载日志
….

2009-05-11 09:49:59,312 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 09:49:59,328 INFO
Configuration - found resource
parse-plugins.xml at file:/D:/work/workspace/nutch_crawl/bin/parse-plugins.xml

2009-05-11 09:49:59,359 INFO
Fetcher - fetching http://www.163.com/
2009-05-11 09:49:59,375 INFO
Fetcher - -finishing thread FetcherThread,
activeThreads=4

2009-05-11 09:49:59,375 INFO
Fetcher - -finishing thread FetcherThread,
activeThreads=3

2009-05-11 09:49:59,375 INFO
Fetcher - -finishing thread FetcherThread,
activeThreads=2

2009-05-11 09:49:59,375 INFO
Fetcher - -finishing thread FetcherThread,
activeThreads=1

2009-05-11 09:49:59,421 INFO
Http - http.proxy.host = null

2009-05-11 09:49:59,421 INFO
Http - http.proxy.port = 8080

2009-05-11 09:49:59,421 INFO
Http - http.timeout = 10000

2009-05-11 09:49:59,421 INFO
Http - http.content.limit = 65536

2009-05-11 09:49:59,421 INFO
Http - http.agent = nutch/Nutch-1.0
(chinahui; http://www.163.com; zhonghui@028wy.com)

2009-05-11 09:49:59,421 INFO
Http - protocol.plugin.check.blocking = false

2009-05-11 09:49:59,421 INFO
Http - protocol.plugin.check.robots = false

2009-05-11 09:50:00,109 INFO
Configuration - found resource
tika-mimetypes.xml at file:/D:/work/workspace/nutch_crawl/bin/tika-mimetypes.xml

2009-05-11 09:50:00,156 WARN
ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not
claim to support contentType: application/xhtml+xml

2009-05-11 09:50:00,375 INFO
Fetcher - -activeThreads=1, spinWaiting=0,
fetchQueues.totalSize=0

2009-05-11 09:50:00,671 INFO
SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature

2009-05-11 09:50:00,687 INFO
Fetcher - -finishing thread FetcherThread,
activeThreads=0

2009-05-11 09:50:01,375 INFO
Fetcher - -activeThreads=0, spinWaiting=0,
fetchQueues.totalSize=0

2009-05-11 09:50:01,375 INFO
Fetcher - -activeThreads=0

2009-05-11 09:50:01,375 INFO
MapTask - Starting flush of map output

2009-05-11 09:50:01,578 INFO
MapTask - Finished spill 0

2009-05-11 09:50:01,578 INFO
TaskRunner -
Task:attempt_local_0005_m_000000_0 is done. And is in the process of commiting

2009-05-11 09:50:01,578 INFO
LocalJobRunner - 0 threads, 1 pages, 0
errors, 0.5 pages/s, 256 kb/s,

2009-05-11 09:50:01,578 INFO
TaskRunner - Task
'attempt_local_0005_m_000000_0' done.

2009-05-11 09:50:01,593 INFO
LocalJobRunner -

2009-05-11 09:50:01,593 INFO
Merger - Merging 1 sorted segments

2009-05-11 09:50:01,593 INFO
Merger - Down to the last merge-pass, with 1
segments left of total size: 72558 bytes

2009-05-11 09:50:01,593 INFO
LocalJobRunner -

2009-05-11 09:50:01,671 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:01,734 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:01,765 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:01,765 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 09:50:01,921 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 09:50:01,984 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:02,015 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:02,062 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:02,093 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:02,125 INFO
CodecPool - Got brand-new compressor

2009-05-11 09:50:02,140 WARN
RegexURLNormalizer - can't find rules for
scope 'outlink', using default

2009-05-11 09:50:02,171 INFO
TaskRunner -
Task:attempt_local_0005_r_000000_0 is done. And is in the process of commiting

2009-05-11 09:50:02,171 INFO
LocalJobRunner - reduce > reduce

2009-05-11 09:50:02,187 INFO
TaskRunner - Task
'attempt_local_0005_r_000000_0' done.

2009-05-11 09:50:44,062 INFO
JobClient - Running job: job_local_0005

2009-05-11 09:51:31,328 INFO
JobClient - Job complete: job_local_0005

2009-05-11 09:51:32,984 INFO
JobClient - Counters: 11

2009-05-11 09:51:33,000 INFO
JobClient -

File Systems

2009-05-11 09:51:33,000 INFO
JobClient -
Local bytes read=336424

2009-05-11 09:51:33,000 INFO
JobClient -
Local bytes written=700394

2009-05-11 09:51:33,000 INFO
JobClient -

Map-Reduce Framework

2009-05-11 09:51:33,000 INFO
JobClient -
Reduce input groups=1

2009-05-11 09:51:33,000 INFO
JobClient -
Combine output records=0

2009-05-11 09:51:33,000 INFO
JobClient -
Map input records=1

2009-05-11 09:51:33,000 INFO
JobClient -
Reduce output records=3

2009-05-11 09:51:33,000 INFO
JobClient -
Map output bytes=72545

2009-05-11 09:51:33,000 INFO
JobClient -
Map input bytes=78

2009-05-11 09:51:33,000 INFO
JobClient -
Combine input records=0

2009-05-11 09:51:33,000 INFO
JobClient -
Map output records=3

2009-05-11 09:51:33,000 INFO
JobClient -
Reduce input records=3

2009-05-11 09:51:47,750 INFO
Fetcher - Fetcher: done

1.3.4

parse

方法

描述:解析下载页面内容

1.3.5

update

方法

描述:添加子链接到爬取数据库

2009-05-11 10:04:20,890 INFO
CrawlDb - CrawlDb update: starting

2009-05-11 10:04:22,500 INFO
CrawlDb - CrawlDb update: db:
20090508/crawldb

2009-05-11 10:05:53,593 INFO
CrawlDb - CrawlDb update: segments:
[20090508/segments/20090511094102]

2009-05-11 10:06:06,031 INFO
CrawlDb - CrawlDb update: additions allowed:
true

2009-05-11 10:06:07,296 INFO
CrawlDb - CrawlDb update: URL normalizing:
true

2009-05-11 10:06:09,031 INFO
CrawlDb - CrawlDb update: URL filtering: true

2009-05-11 10:07:05,125 INFO
CrawlDb - CrawlDb update: Merging segment
data into db.

2009-05-11 10:08:11,031 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:09:00,187 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:09:00,375 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:10:03,531 INFO
FileInputFormat - Total input paths to
process : 3

2009-05-11 10:16:25,125 INFO
FileInputFormat - Total input paths to
process : 3

2009-05-11 10:16:25,203 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:16:25,203 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:16:25,343 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:25,343 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:16:25,343 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:25,750 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:25,796 WARN
RegexURLNormalizer - can't find rules for
scope 'crawldb', using default

2009-05-11 10:16:25,796 INFO
MapTask - Starting flush of map output

2009-05-11 10:16:25,984 INFO
MapTask - Finished spill 0

2009-05-11 10:16:26,000 INFO
TaskRunner -
Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:26,000 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143

2009-05-11 10:16:26,000 INFO
TaskRunner - Task
'attempt_local_0006_m_000000_0' done.

2009-05-11 10:16:26,031 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:16:26,031 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:16:26,140 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:26,140 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:16:26,156 INFO
CodecPool - Got brand-new decompressor

2009-05-11 10:16:26,171 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:26,687 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:26,718 WARN
RegexURLNormalizer - can't find rules for
scope 'crawldb', using default

2009-05-11 10:16:26,734 INFO
MapTask - Starting flush of map output

2009-05-11 10:16:26,750 INFO
MapTask - Finished spill 0

2009-05-11 10:16:26,750 INFO
TaskRunner -
Task:attempt_local_0006_m_000002_0 is done. And is in the process of commiting

2009-05-11 10:16:26,750 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026

2009-05-11 10:16:26,750 INFO
TaskRunner - Task
'attempt_local_0006_m_000002_0' done.

2009-05-11 10:16:26,781 INFO
LocalJobRunner -

2009-05-11 10:16:26,781 INFO
Merger - Merging 3 sorted segments

2009-05-11 10:16:26,781 INFO
Merger - Down to the last merge-pass, with 3
segments left of total size: 3706 bytes

2009-05-11 10:16:26,781 INFO
LocalJobRunner -

2009-05-11 10:16:26,875 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:27,031 INFO
FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-11 10:16:27,031 INFO
AbstractFetchSchedule -
defaultInterval=2592000

2009-05-11 10:16:27,031 INFO
AbstractFetchSchedule - maxInterval=7776000

2009-05-11 10:16:27,046 INFO
TaskRunner -
Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:27,046 INFO
LocalJobRunner -

2009-05-11 10:16:27,046 INFO
TaskRunner - Task attempt_local_0006_r_000000_0
is allowed to commit now

2009-05-11 10:16:27,062 INFO
FileOutputCommitter - Saved output of task
'attempt_local_0006_r_000000_0' to
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/132216774

2009-05-11 10:16:27,062 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:16:27,062 INFO
TaskRunner - Task
'attempt_local_0006_r_000000_0' done.

2009-05-11 10:17:43,984 INFO
JobClient - Running job: job_local_0006

2009-05-11 10:18:33,671 INFO
JobClient - Job complete: job_local_0006

2009-05-11 10:18:35,906 INFO
JobClient - Counters: 11

2009-05-11 10:18:35,906 INFO
JobClient -

File Systems

2009-05-11 10:18:35,906 INFO
JobClient -
Local bytes read=936164

2009-05-11 10:18:35,906 INFO
JobClient -
Local bytes written=1678861

2009-05-11 10:18:35,906 INFO
JobClient -

Map-Reduce Framework

2009-05-11 10:18:35,906 INFO
JobClient -
Reduce input groups=57

2009-05-11 10:18:35,906 INFO
JobClient -
Combine output records=0

2009-05-11 10:18:35,906 INFO
JobClient -
Map input records=63

2009-05-11 10:18:35,906 INFO
JobClient -
Reduce output records=57

2009-05-11 10:18:35,906 INFO
JobClient -
Map output bytes=3574

2009-05-11 10:18:35,906 INFO
JobClient -
Map input bytes=4079

2009-05-11 10:18:35,906 INFO
JobClient -
Combine input records=0

2009-05-11 10:18:35,906 INFO
JobClient -
Map output records=63

2009-05-11 10:18:35,906 INFO
JobClient -
Reduce input records=63

2009-05-11 10:19:48,078 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:22:51,437 INFO
CrawlDb - CrawlDb update: done

1.3.6

invert

方法

描述:分析链接关系
,
生成反向链接

执行完后生成linkdb
目录.

2009-05-11 10:04:20,890 INFO
CrawlDb - CrawlDb update: starting

2009-05-11 10:04:22,500 INFO
CrawlDb - CrawlDb update: db:
20090508/crawldb

2009-05-11 10:05:53,593 INFO
CrawlDb - CrawlDb update: segments: [20090508/segments/20090511094102]

2009-05-11 10:06:06,031 INFO
CrawlDb - CrawlDb update: additions allowed:
true

2009-05-11 10:06:07,296 INFO
CrawlDb - CrawlDb update: URL normalizing:
true

2009-05-11 10:06:09,031 INFO
CrawlDb - CrawlDb update: URL filtering: true

2009-05-11 10:07:05,125 INFO
CrawlDb - CrawlDb update: Merging segment
data into db.

2009-05-11 10:08:11,031 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:09:00,187 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:09:00,375 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:10:03,531 INFO
FileInputFormat - Total input paths to
process : 3

2009-05-11 10:16:25,125 INFO
FileInputFormat - Total input paths to
process : 3

2009-05-11 10:16:25,203 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:16:25,203 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:16:25,343 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:25,343 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:16:25,343 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:25,750 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:25,796 WARN
RegexURLNormalizer - can't find rules for
scope 'crawldb', using default

2009-05-11 10:16:25,796 INFO
MapTask - Starting flush of map output

2009-05-11 10:16:25,984 INFO
MapTask - Finished spill 0

2009-05-11 10:16:26,000 INFO
TaskRunner -
Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:26,000 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143

2009-05-11 10:16:26,000 INFO
TaskRunner - Task
'attempt_local_0006_m_000000_0' done.

2009-05-11 10:16:26,031 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:16:26,031 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:16:26,140 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:26,140 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:16:26,156 INFO
CodecPool - Got brand-new decompressor

2009-05-11 10:16:26,171 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:26,343 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:26,359 WARN
RegexURLNormalizer - can't find rules for
scope 'crawldb', using default

2009-05-11 10:16:26,359 INFO
MapTask - Starting flush of map output

2009-05-11 10:16:26,359 INFO
MapTask - Finished spill 0

2009-05-11 10:16:26,375 INFO
TaskRunner -
Task:attempt_local_0006_m_000001_0 is done. And is in the process of commiting

2009-05-11 10:16:26,375 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_fetch/part-00000/data:0+254

2009-05-11 10:16:26,375 INFO
TaskRunner - Task
'attempt_local_0006_m_000001_0' done.

2009-05-11 10:16:26,406 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:16:26,406 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:16:26,515 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:26,515 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:16:26,531 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:26,687 INFO
Configuration - found resource
crawl-urlfilter.txt at
file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:26,718 WARN
RegexURLNormalizer - can't find rules for
scope 'crawldb', using default

2009-05-11 10:16:26,734 INFO
MapTask - Starting flush of map output

2009-05-11 10:16:26,750 INFO
MapTask - Finished spill 0

2009-05-11 10:16:26,750 INFO
TaskRunner -
Task:attempt_local_0006_m_000002_0 is done. And is in the process of commiting

2009-05-11 10:16:26,750 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026

2009-05-11 10:16:26,750 INFO
TaskRunner - Task
'attempt_local_0006_m_000002_0' done.

2009-05-11 10:16:26,781 INFO
LocalJobRunner -

2009-05-11 10:16:26,781 INFO
Merger - Merging 3 sorted segments

2009-05-11 10:16:26,781 INFO
Merger - Down to the last merge-pass, with 3
segments left of total size: 3706 bytes

2009-05-11 10:16:26,781 INFO
LocalJobRunner -

2009-05-11 10:16:26,875 INFO
PluginRepository - Plugins: looking in:
D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:16:27,031 INFO
FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-11 10:16:27,031 INFO
AbstractFetchSchedule -
defaultInterval=2592000

2009-05-11 10:16:27,031 INFO
AbstractFetchSchedule - maxInterval=7776000

2009-05-11 10:16:27,046 INFO
TaskRunner -
Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:27,046 INFO
LocalJobRunner -

2009-05-11 10:16:27,046 INFO
TaskRunner - Task attempt_local_0006_r_000000_0
is allowed to commit now

2009-05-11 10:16:27,062 INFO
FileOutputCommitter - Saved output of task
'attempt_local_0006_r_000000_0' to
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/132216774

2009-05-11 10:16:27,062 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:16:27,062 INFO
TaskRunner - Task
'attempt_local_0006_r_000000_0' done.

2009-05-11 10:17:43,984 INFO
JobClient - Running job: job_local_0006

2009-05-11 10:18:33,671 INFO
JobClient - Job complete: job_local_0006

2009-05-11 10:18:35,906 INFO
JobClient - Counters: 11

2009-05-11 10:18:35,906 INFO
JobClient -

File Systems

2009-05-11 10:18:35,906 INFO
JobClient -
Local bytes read=936164

2009-05-11 10:18:35,906 INFO
JobClient -
Local bytes written=1678861

2009-05-11 10:18:35,906 INFO
JobClient -

Map-Reduce Framework

2009-05-11 10:18:35,906 INFO
JobClient -
Reduce input groups=57

2009-05-11 10:18:35,906 INFO
JobClient -
Combine output records=0

2009-05-11 10:18:35,906 INFO
JobClient -
Map input records=63

2009-05-11 10:18:35,906 INFO
JobClient -
Reduce output records=57

2009-05-11 10:18:35,906 INFO
JobClient -
Map output bytes=3574

2009-05-11 10:18:35,906 INFO
JobClient -
Map input bytes=4079

2009-05-11 10:18:35,906 INFO
JobClient -
Combine input records=0

2009-05-11 10:18:35,906 INFO
JobClient -
Map output records=63

2009-05-11 10:18:35,906 INFO
JobClient -
Reduce input records=63

2009-05-11 10:19:48,078 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:22:51,437 INFO
CrawlDb - CrawlDb update: done

2009-05-11 10:26:31,250 INFO
LinkDb - LinkDb: starting

2009-05-11 10:26:31,250 INFO
LinkDb - LinkDb: linkdb: 20090508/linkdb

2009-05-11 10:26:31,250 INFO
LinkDb - LinkDb: URL normalize: true

2009-05-11 10:26:31,250 INFO
LinkDb - LinkDb: URL filter: true

2009-05-11 10:26:31,281 INFO
LinkDb - LinkDb: adding segment:
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102

2009-05-11 10:26:31,281 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:26:31,296 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:26:31,453 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:26:31,484 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:26:32,078 INFO
JobClient - Running job: job_local_0007

2009-05-11 10:26:32,078 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:26:32,125 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:26:32,125 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:26:32,234 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:26:32,234 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:26:32,250 INFO
MapTask - Starting flush of map output

2009-05-11 10:26:32,437 INFO
MapTask - Finished spill 0

2009-05-11 10:26:32,453 INFO
TaskRunner -
Task:attempt_local_0007_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:26:32,453 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_data/part-00000/data:0+1382

2009-05-11 10:26:32,453 INFO
TaskRunner - Task
'attempt_local_0007_m_000000_0' done.

2009-05-11 10:26:32,468 INFO
LocalJobRunner -

2009-05-11 10:26:32,468 INFO
Merger - Merging 1 sorted segments

2009-05-11 10:26:32,468 INFO
Merger - Down to the last merge-pass, with 1
segments left of total size: 3264 bytes

2009-05-11 10:26:32,468 INFO
LocalJobRunner -

2009-05-11 10:26:32,562 INFO
TaskRunner -
Task:attempt_local_0007_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:26:32,562 INFO
LocalJobRunner -

2009-05-11 10:26:32,562 INFO
TaskRunner - Task
attempt_local_0007_r_000000_0 is allowed to commit now

2009-05-11 10:26:32,578 INFO
FileOutputCommitter - Saved output of task
'attempt_local_0007_r_000000_0' to file:/D:/work/workspace/nutch_crawl/linkdb-1900012851

2009-05-11 10:26:32,578 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:26:32,578 INFO
TaskRunner - Task
'attempt_local_0007_r_000000_0' done.

2009-05-11 10:26:33,078 INFO
JobClient - Job complete: job_local_0007

2009-05-11 10:26:33,078 INFO
JobClient - Counters: 11

2009-05-11 10:26:33,078 INFO
JobClient -

File Systems

2009-05-11 10:26:33,078 INFO
JobClient -
Local bytes read=535968

2009-05-11 10:26:33,078 INFO
JobClient -
Local bytes written=965231

2009-05-11 10:26:33,078 INFO
JobClient -

Map-Reduce Framework

2009-05-11 10:26:33,078 INFO
JobClient -
Reduce input groups=56

2009-05-11 10:26:33,078 INFO
JobClient -
Combine output records=56

2009-05-11 10:26:33,078 INFO
JobClient -

Map input records=1

2009-05-11 10:26:33,078 INFO
JobClient -
Reduce output records=56

2009-05-11 10:26:33,078 INFO
JobClient -
Map output bytes=3384

2009-05-11 10:26:33,078 INFO
JobClient -
Map input bytes=1254

2009-05-11 10:26:33,078 INFO
JobClient -

Combine input records=60

2009-05-11 10:26:33,078 INFO
JobClient -
Map output records=60

2009-05-11 10:26:33,078 INFO
JobClient -
Reduce input records=56

2009-05-11 10:26:33,078 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:26:33,125 INFO
LinkDb - LinkDb: done

1.3.7

index

方法

描述:创建页面内容索引

生成indexes
目录.



2009-05-11 10:31:22,250
INFO
Indexer - Indexer: starting

2009-05-11 10:31:45,078
INFO
IndexerMapReduce - IndexerMapReduce:
crawldb: 20090508/crawldb

2009-05-11 10:31:45,078
INFO
IndexerMapReduce -
IndexerMapReduce: linkdb: 20090508/linkdb

2009-05-11 10:31:45,078
INFO
IndexerMapReduce -
IndexerMapReduces: adding segment: file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102

2009-05-11 10:32:30,359
INFO
JvmMetrics - Cannot initialize JVM
Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:32:34,109
WARN
JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should implement
Tool for the same.

2009-05-11 10:32:34,296
WARN
JobClient - No job jar file
set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:32:34,421
INFO
FileInputFormat - Total input paths
to process : 6

2009-05-11 10:32:35,078
INFO
FileInputFormat - Total input paths
to process : 6

2009-05-11 10:32:35,140
INFO
MapTask - numReduceTasks: 1

2009-05-11 10:32:35,140
INFO
MapTask - io.sort.mb = 100

2009-05-11 10:32:35,250
INFO
MapTask - data buffer =
79691776/99614720

2009-05-11 10:32:35,250
INFO
MapTask - record buffer =
262144/327680

2009-05-11 10:32:35,265
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:32:35,937
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:35,937
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:35,953
INFO
MapTask - Starting flush of map
output

2009-05-11 10:32:35,968
INFO
MapTask - Finished spill 0

2009-05-11 10:32:35,968
INFO
TaskRunner -
Task:attempt_local_0008_m_000001_0 is done. And is in the process of commiting

2009-05-11 10:32:35,968
INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026

2009-05-11 10:32:35,968
INFO
TaskRunner - Task
'attempt_local_0008_m_000001_0' done.

2009-05-11 10:32:36,000
INFO
MapTask - numReduceTasks: 1

2009-05-11 10:32:36,000
INFO
MapTask - io.sort.mb = 100

2009-05-11 10:32:36,125
INFO
MapTask - data buffer =
79691776/99614720

2009-05-11 10:32:36,125
INFO
MapTask - record buffer =
262144/327680

2009-05-11 10:32:36,125
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:32:36,281
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:36,281
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:36,281 INFO
MapTask - Starting flush of map output

2009-05-11 10:32:36,296
INFO
MapTask - Finished spill 0

2009-05-11 10:32:36,312
INFO
TaskRunner -
Task:attempt_local_0008_m_000002_0 is done. And is in the process of commiting

2009-05-11 10:32:36,312
INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_data/part-00000/data:0+1382

2009-05-11 10:32:36,312
INFO
TaskRunner - Task
'attempt_local_0008_m_000002_0' done.

2009-05-11 10:32:36,343
INFO
MapTask - numReduceTasks: 1

2009-05-11 10:32:36,343
INFO
MapTask - io.sort.mb = 100

2009-05-11 10:32:36,453
INFO
MapTask - data buffer =
79691776/99614720

2009-05-11 10:32:36,453
INFO
MapTask - record buffer =
262144/327680

2009-05-11 10:32:36,453
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:32:36,609
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:36,609
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:36,625
INFO
MapTask - Starting flush of map
output

2009-05-11 10:32:36,625
INFO
MapTask - Finished spill 0

2009-05-11 10:32:36,640
INFO
TaskRunner -
Task:attempt_local_0008_m_000003_0 is done. And is in the process of commiting

2009-05-11 10:32:36,640
INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_text/part-00000/data:0+738

2009-05-11 10:32:36,640
INFO
TaskRunner - Task
'attempt_local_0008_m_000003_0' done.

2009-05-11 10:32:36,671
INFO
MapTask - numReduceTasks: 1

2009-05-11 10:32:36,671
INFO
MapTask - io.sort.mb = 100

2009-05-11 10:32:36,781
INFO
MapTask - data buffer =
79691776/99614720

2009-05-11 10:32:36,781
INFO
MapTask - record buffer =
262144/327680

2009-05-11 10:32:36,796
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:32:36,937
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:36,953
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:36,953
INFO
MapTask - Starting flush of map
output

2009-05-11 10:32:36,968
INFO
MapTask - Finished spill 0

2009-05-11 10:32:36,968
INFO
TaskRunner -
Task:attempt_local_0008_m_000004_0 is done. And is in the process of commiting

2009-05-11 10:32:36,968
INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+3772

2009-05-11 10:32:36,968
INFO
TaskRunner - Task
'attempt_local_0008_m_000004_0' done.

2009-05-11 10:32:37,000
INFO
MapTask - numReduceTasks: 1

2009-05-11 10:32:37,000
INFO
MapTask - io.sort.mb = 100

2009-05-11 10:32:37,109
INFO
MapTask - data buffer =
79691776/99614720

2009-05-11 10:32:37,109
INFO
MapTask - record buffer =
262144/327680

2009-05-11 10:32:37,125
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:32:37,281
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:37,281
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:37,296
INFO
MapTask - Starting flush of map
output

2009-05-11 10:32:37,296
INFO
MapTask - Finished spill 0

2009-05-11 10:32:37,312
INFO
TaskRunner -
Task:attempt_local_0008_m_000005_0 is done. And is in the process of commiting

2009-05-11 10:32:37,312
INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/linkdb/current/part-00000/data:0+4215

2009-05-11 10:32:37,312
INFO
TaskRunner - Task
'attempt_local_0008_m_000005_0' done.

2009-05-11 10:32:37,343
INFO
LocalJobRunner -

2009-05-11 10:32:37,359
INFO
Merger - Merging 6 sorted segments

2009-05-11 10:32:37,359
INFO
Merger - Down to the last
merge-pass, with 6 segments left of total size: 13876 bytes

2009-05-11 10:32:37,359
INFO
LocalJobRunner -

2009-05-11 10:32:37,359
INFO
PluginRepository - Plugins: looking
in: D:/work/workspace/nutch_crawl/bin/plugins

省略插件加载日志


2009-05-11 10:32:37,515
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:37,515
INFO
IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:37,546
INFO
Configuration - found resource
common-terms.utf8 at file:/D:/work/workspace/nutch_crawl/bin/common-terms.utf8

2009-05-11 10:32:38,500
INFO
TaskRunner -
Task:attempt_local_0008_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:32:38,500 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:32:38,500
INFO
TaskRunner - Task
'attempt_local_0008_r_000000_0' done.

2009-05-11 10:33:19,703
INFO
JobClient - Running job:
job_local_0008

2009-05-11 10:33:50,156
INFO
JobClient - Job complete: job_local_0008

2009-05-11 10:33:52,562
INFO
JobClient - Counters: 11

2009-05-11 10:33:52,562
INFO
JobClient -
File Systems

2009-05-11 10:33:52,562
INFO
JobClient -
Local bytes read=2150441

2009-05-11 10:33:52,562
INFO
JobClient -
Local bytes written=3845733

2009-05-11 10:33:52,562
INFO
JobClient -
Map-Reduce Framework

2009-05-11 10:33:52,562
INFO
JobClient -
Reduce input groups=58

2009-05-11 10:33:52,562
INFO
JobClient -
Combine output records=0

2009-05-11 10:33:52,562
INFO
JobClient -
Map input records=177

2009-05-11 10:33:52,562
INFO
JobClient -
Reduce output records=1

2009-05-11 10:33:52,562
INFO
JobClient -
Map output bytes=13506

2009-05-11 10:33:52,562
INFO
JobClient -
Map input bytes=13661

2009-05-11 10:33:52,562
INFO
JobClient -
Combine input records=0

2009-05-11 10:33:52,562
INFO
JobClient -
Map output records=177

2009-05-11 10:33:52,562
INFO
JobClient -
Reduce input records=177

2009-05-11 10:33:57,656
INFO
Indexer - Indexer: done

1.3.8

dedup

方法

描述:删除重复数据

2009-05-11 10:38:53,671 INFO
DeleteDuplicates - Dedup: starting

2009-05-11 10:39:32,890 INFO
DeleteDuplicates - Dedup: adding indexes in:
20090508/indexes

2009-05-11 10:39:57,265 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:40:09,015 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:40:09,218 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:40:51,890 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:42:56,203 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:42:56,265 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:42:56,265 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:42:56,390 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:42:56,390 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:42:56,515 INFO
MapTask - Starting flush of map output

2009-05-11 10:42:56,718 INFO
MapTask - Finished spill 0

2009-05-11 10:42:56,718 INFO
TaskRunner -
Task:attempt_local_0009_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:42:56,718 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/20090508/indexes/part-00000

2009-05-11 10:42:56,718 INFO
TaskRunner - Task
'attempt_local_0009_m_000000_0' done.

2009-05-11 10:42:56,734 INFO
LocalJobRunner -

2009-05-11 10:42:56,734 INFO
Merger - Merging 1 sorted segments

2009-05-11 10:42:56,734 INFO
Merger - Down to the last merge-pass, with 1
segments left of total size: 141 bytes

2009-05-11 10:42:56,734 INFO
LocalJobRunner -

2009-05-11 10:42:56,781 INFO
TaskRunner - Task:attempt_local_0009_r_000000_0
is done. And is in the process of commiting

2009-05-11 10:42:56,781 INFO
LocalJobRunner -

2009-05-11 10:42:56,781 INFO
TaskRunner - Task
attempt_local_0009_r_000000_0 is allowed to commit now

2009-05-11 10:42:56,796 INFO
FileOutputCommitter - Saved output of task
'attempt_local_0009_r_000000_0' to
file:/D:/work/workspace/nutch_crawl/dedup-urls-1843604809

2009-05-11 10:42:56,796 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:42:56,796 INFO
TaskRunner - Task 'attempt_local_0009_r_000000_0'
done.

2009-05-11 10:43:06,515 INFO
JobClient - Running job: job_local_0009

2009-05-11 10:43:14,500 INFO
JobClient - Job complete: job_local_0009

2009-05-11 10:43:16,296 INFO
JobClient - Counters: 11

2009-05-11 10:43:16,296 INFO
JobClient -

File Systems

2009-05-11 10:43:16,296 INFO
JobClient -
Local bytes read=710951

2009-05-11 10:43:16,296 INFO
JobClient -
Local bytes written=1220879

2009-05-11 10:43:16,296 INFO
JobClient -

Map-Reduce Framework

2009-05-11 10:43:16,296 INFO
JobClient -
Reduce input groups=1

2009-05-11 10:43:16,296 INFO
JobClient -
Combine output records=0

2009-05-11 10:43:16,296 INFO
JobClient -
Map input records=1

2009-05-11 10:43:16,296 INFO
JobClient -
Reduce output records=1

2009-05-11 10:43:16,296 INFO
JobClient -
Map output bytes=137

2009-05-11 10:43:16,296 INFO
JobClient -
Map input bytes=2147483647

2009-05-11 10:43:16,296 INFO
JobClient -
Combine input records=0

2009-05-11 10:43:16,296 INFO
JobClient -
Map output records=1

2009-05-11 10:43:16,296 INFO
JobClient -
Reduce input records=1

2009-05-11 10:44:37,734 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:44:45,953 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:44:46,140 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:44:48,781 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:45:46,546 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:45:46,609 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:45:46,609 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:45:46,718 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:45:46,718 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:45:46,734 INFO
MapTask - Starting flush of map output

2009-05-11 10:45:46,953 INFO
MapTask - Finished spill 0

2009-05-11 10:45:46,953 INFO
TaskRunner -
Task:attempt_local_0010_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:45:46,953 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/dedup-urls-1843604809/part-00000:0+247

2009-05-11 10:45:46,953 INFO
TaskRunner - Task
'attempt_local_0010_m_000000_0' done.

2009-05-11 10:45:46,968 INFO
LocalJobRunner -

2009-05-11 10:45:46,968 INFO
Merger - Merging 1 sorted segments

2009-05-11 10:45:46,968 INFO
Merger - Down to the last merge-pass, with 1
segments left of total size: 137 bytes

2009-05-11 10:45:46,968 INFO
LocalJobRunner -

2009-05-11 10:45:47,015 INFO
TaskRunner -
Task:attempt_local_0010_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:45:47,015 INFO
LocalJobRunner -

2009-05-11 10:45:47,015 INFO
TaskRunner - Task
attempt_local_0010_r_000000_0 is allowed to commit now

2009-05-11 10:45:47,015 INFO
FileOutputCommitter - Saved output of task
'attempt_local_0010_r_000000_0' to file:/D:/work/workspace/nutch_crawl/dedup-hash-291931517

2009-05-11 10:45:47,015 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:45:47,015 INFO
TaskRunner - Task
'attempt_local_0010_r_000000_0' done.

2009-05-11 10:45:52,187 INFO
JobClient - Running job: job_local_0010

2009-05-11 10:46:03,984 INFO
JobClient - Job complete: job_local_0010

2009-05-11 10:46:06,359 INFO
JobClient - Counters: 11

2009-05-11 10:46:06,359 INFO
JobClient -

File Systems

2009-05-11 10:46:06,359 INFO
JobClient -
Local bytes read=764171

2009-05-11 10:46:06,359 INFO
JobClient -
Local bytes written=1327019

2009-05-11 10:46:06,359 INFO
JobClient -

Map-Reduce Framework

2009-05-11 10:46:06,359 INFO
JobClient -
Reduce input groups=1

2009-05-11 10:46:06,359 INFO
JobClient -
Combine output records=0

2009-05-11 10:46:06,359 INFO
JobClient -
Map input records=1

2009-05-11 10:46:06,359 INFO
JobClient -
Reduce output records=0

2009-05-11 10:46:06,359 INFO
JobClient -
Map output bytes=133

2009-05-11 10:46:06,359 INFO
JobClient -
Map input bytes=141

2009-05-11 10:46:06,359 INFO
JobClient -
Combine input records=0

2009-05-11 10:46:06,359 INFO
JobClient -
Map output records=1

2009-05-11 10:46:06,359 INFO
JobClient -
Reduce input records=1

2009-05-11 10:47:19,953 INFO
JvmMetrics - Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:47:19,953 WARN
JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:47:20,140 WARN
JobClient - No job jar file set.
User classes may not be found. See
JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:47:20,156 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:47:20,765 INFO
JobClient - Running job: job_local_0011

2009-05-11 10:47:20,765 INFO
FileInputFormat - Total input paths to
process : 1

2009-05-11 10:47:20,796 INFO
MapTask - numReduceTasks: 1

2009-05-11 10:47:20,796 INFO
MapTask - io.sort.mb = 100

2009-05-11 10:47:20,921 INFO
MapTask - data buffer = 79691776/99614720

2009-05-11 10:47:20,921 INFO
MapTask - record buffer = 262144/327680

2009-05-11 10:47:20,937 INFO
MapTask - Starting flush of map output

2009-05-11 10:47:21,140 INFO
MapTask - Index: (0, 2, 6)

2009-05-11 10:47:21,140 INFO
TaskRunner -
Task:attempt_local_0011_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:47:21,140 INFO
LocalJobRunner -
file:/D:/work/workspace/nutch_crawl/dedup-hash-291931517/part-00000:0+103

2009-05-11 10:47:21,140 INFO
TaskRunner - Task
'attempt_local_0011_m_000000_0' done.

2009-05-11 10:47:21,156 INFO
LocalJobRunner -

2009-05-11 10:47:21,156 INFO
Merger - Merging 1 sorted segments

2009-05-11 10:47:21,156 INFO
Merger - Down to the last merge-pass, with 0
segments left of total size: 0 bytes

2009-05-11 10:47:21,156 INFO
LocalJobRunner -

2009-05-11 10:47:21,171 INFO
TaskRunner -
Task:attempt_local_0011_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:47:21,171 INFO
LocalJobRunner - reduce > reduce

2009-05-11 10:47:21,171 INFO
TaskRunner - Task
'attempt_local_0011_r_000000_0' done.

2009-05-11 10:47:21,765 INFO
JobClient - Job complete: job_local_0011

2009-05-11 10:47:21,765 INFO
JobClient - Counters: 11

2009-05-11 10:47:21,765 INFO
JobClient -

File Systems

2009-05-11 10:47:21,765 INFO
JobClient -
Local bytes read=816128

2009-05-11 10:47:21,765 INFO
JobClient -
Local bytes written=1430954

2009-05-11 10:47:21,765 INFO
JobClient -

Map-Reduce Framework

2009-05-11 10:47:21,765 INFO
JobClient -
Reduce input groups=0

2009-05-11 10:47:21,765 INFO
JobClient -
Combine output records=0

2009-05-11 10:47:21,765 INFO
JobClient -
Map input records=0

2009-05-11 10:47:21,765 INFO
JobClient -
Reduce output records=0

2009-05-11 10:47:21,765 INFO
JobClient -
Map output bytes=0

2009-05-11 10:47:21,765 INFO
JobClient -
Map input bytes=0

2009-05-11 10:47:21,765 INFO
JobClient -
Combine input records=0

2009-05-11 10:47:21,765 INFO
JobClient -
Map output records=0

2009-05-11 10:47:21,765 INFO
JobClient -
Reduce input records=0

2009-05-11 10:47:44,031 INFO
DeleteDuplicates - Dedup: done

1.3.9

merge

方法

描述:合并索引文件

首先在tmp/hadoop-Administrator/mapred/local/crawl
生成一个临时文件夹20090511094057,
用indexes
里的数据生成索引添加到20090511094057
下的merge-output
目录,

fs.completeLocalOutput
方法把临时目录的索引写到新生成的index
目录下.

2009-05-11
10:53:56,156 INFO
IndexMerger - merging
indexes to: 20090508/index

2009-05-11
10:58:50,906 INFO
IndexMerger - Adding
file:/D:/work/workspace/nutch_crawl/20090508/indexes/part-00000

2009-05-11 11:04:36,562 INFO
IndexMerger - done merging
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: