您的位置:首页 > 运维架构

Hadoop的疑惑点(持续更新)

2015-10-30 11:13 405 查看
这里记录一些常见的疑惑点。

0)hadoop组件:





1)一个机器上同时跑MR和HDFS框架,从而保证计算和存储数据的机器是同一个,避免网络消耗。

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (seeHDFS
Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

2)客户机提交job(本质是提交客户端代码包含相关配置)给resource manager后,由resource manager相应的负责slaver的调度以及任务分配(本质上是向不同的机器分配客户端代码及相关配置),mapper执行map函数,reducer执行reduce函数。

The Hadoop job client then submits the job (jar/executable etc.) and configuration to theResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.

Job is
the primary interface by which user-job interacts with the ResourceManager. The
job submission process involves: 4)Copying the job’s jar and configuration to the MapReduce system directory on the FileSystem.

3)每个输入split都会由一个map task进行处理。我想强调,一个机器往往处理多个split,那么实际上,这个机器是同时起了多个进程在处理,每个进程对应一个split,懂了?

The Hadoop MapReduce framework spawns one map task for each InputSplit generated by theInputFormat for the job.

4)每个mapper的中间结果都是要排序的!问题来了,内存不够怎么办???其实过程是这样子的:mapper没产生一个中间结果,直接输出到缓存,当buffer满了之后,仅对buffer内的数据进行排序并hash成R个不同的组,然后输出到disk上;等处理完了所有的split,再度进来进行归并排序,然后再hash成R个不同的组存到disk上!!!

The Mapper outputs are sorted and then partitioned
per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. 

A record emitted from a map will be serialized into a buffer and metadata will be stored into
accounting buffers. As described in the following options, when either the serialization buffer or the metadata exceed a threshold, the contents of the buffers will be sorted and written to disk in the background while the map continues to output records.
If either buffer fills completely while the spill is in progress, the map thread will block.

5)用户可以指定每个key由哪个reducer处理。

Users can control which keys (and hence records) go to which Reducer by
implementing a custom Partitioner.

6)在mapper结果给reducer之前,用户可以指定combiner使mapper得中间结果先进行combine操作,减少向reducer传递的数据量。(但要注意,由于框架限制,combiner不一定执行,所以reducer的输入接口要和mapper的输出接口以及combiner的输出接口都要一致<暗含:mapper的输出接口和combiner的输出接口一样>;更好的做法是,使用in
mapper的combiner代码)

Users can optionally specify a combiner, via Job.setCombinerClass(Class),
to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format. 

7)为什么reducer的伪代码参数是<key, (list of values)>?list of values?不是一个一个的key-value对吗?因为在reducer不仅执行reduce函数,而是有三个主要执行阶段:shuffle,
sort and reduce,shuffle和sort阶段做了group和merge操作,之后再执行reduce函数。
Shuffle:reducer fetches the relevant sorted partition of the output of all the mappers, via
HTTP. 
Sort:The frameworkgroups Reducer inputs
by keys
(since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they aremerged.
secondary Sort:If equivalence rules for grouping the intermediate keys are required to be different
from those for grouping keys before reduction, then one may specify a Comparator via Job.setSortComparatorClass(Class).
Since 
Job.setGroupingComparatorClass(Class) can
be used to control how intermediate keys are grouped
, these can be used in conjunction to simulate secondary sort on values.
reduce:In this phase the reduce(WritableComparable, Iterable<Writable>, Context) method is called
for each <key, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via
Context.write(WritableComparable, Writable). The output of the Reducer is not sorted.

8)可以没有reducer。

It is legal to set the number of reduce-tasks to zero if no reduction is desired. In this case the outputs of the map-tasks go directly to
the FileSystem, into the output path set byFileOutputFormat.setOutputPath(Job,
Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  hadoop mapreduce