您的位置:首页 > 其它

MapReduce: Simplified Data Processing on Large Clusters 中文翻译 3

2015-05-03 20:20 453 查看
2 Programming Model

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user’s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

2.1 Example

Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

The map function emits each word plus an associated count of occurrences (just ‘1’ in this simple example).The reduce function sums together all counts emitted for a particular word. In addition, the user writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters. The user then

invokes the MapReduce function, passing it the specification object. The user’s code is linked together with the MapReduce library (implemented in C++). Appendix A contains the full program text for this example.

2.2 Types

Even though the previous pseudo-code is written in terms of string inputs and outputs, conceptually the map and reduce functions supplied by the user have associated types:

map (k1,v1) → list(k2,v2)

reduce (k2,list(v2)) → list(v2)

I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore,the intermediate keys and values are from the same domain as the output keys and values.Our C++ implementation passes strings to and from the user-defined functions and leaves it to the user code to convert between strings and appropriate types.

2.3 More Examples

Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations.

Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.

Count of URL Access Frequency: The map function processes logs of web page requests and outputs<URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count>

pair.

Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>

Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, f requency>pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectorsfor a given host. It adds these term vectors together,throwing away infrequent terms, and then emits a final <hostname, term vector>pair.

Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a

<word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

Distributed Sort: The map function extracts the key from each record, and emits a <key, record> pair. The reduce function emits all pairs unchanged. This computation depends on the partitioning facilities described in Section 4.1 and the ordering properties described in Section 4.2.

第二章 编程模型

计算(过程)接受输入键值(key/value)对集合,产生一个输出键值(key/value)对集合,MapReduce库用户使用两个函数表述这个计算过程:Map和Reduce。

Map函数,由用户编写,接受一个输入对,并产生一个中间键值(key/value)对集。MapReduce库将所有具有相同中间键值(key)的value值聚合,并将他们传递给reduce函数。

reduce函数,也由用户编写,接受一个中间key I,和这个key值相对应的vaule集合。这个过程合并这些value,形成一个可能更小的value集合。通常,每次reduce函数的调用仅仅

会产生0个或1个输出value。而中间值将会由一个迭代器被传递到用户定义的reduce函数中。这种方式允许我们根据内存来控制value列表的大小。

2.1 实例

考虑这个问题:统计在一个大的文档集合中,每个词语出现的次数(词频)。用户将编写和以下类似的伪代码:

map(String key,String value):

//key:文档的名字

//value:文档的内容

for each word w in value:

EmitIntermediate(w,"1");

reduce(String key,Iterator values):

//key:一个词

//values:一个计数列表

int result=0;

for each v in values:

result+=ParseInt(v);

Emit(AsString(resut));

map函数产生每个单词和它出现的次数(这个简单的例子中为1)。reduce函数把每个特定词产生的的计数相加求和。另外,用户编写代码时,使用输入输出文件名字,可选调

节参数,来填充一个MapReduce规范对象。然后,用户调用MapReduce函数,并传递规范对象。用户的编码将与MapReduce库相链接(用C++实现)。附录A包含这个实例的全部文本。

2.2类型

即使前面的伪代码中,输入和输出均为字符串类型,但概念上,由用户提供的map和reduce函数有关联的类型。

map(k1,v1) ->list(k2,v2)

reduce(k2,list(v2)) ->list(v2)

例如,输入的keys,values与输出的keys,values具有不同域(值)。另外,中间的keys,values的与输出的keys,values域相同。我们的C++实现,传递字符串,与用户自定义的函数交互,并让用户的代码,在字符串和适当的类型间进行转换。

2.3更多的实例

这里是一些有趣项目的简单例子,可以很容易的表示成MapReduce计算过程。

分布式Grep (缩写来自Globally search a Regular Expression and Print,是一种强大的文本搜索工具,它能使用正则表达式搜索文本,并把匹配的行打印出来)map函数将返回一行,如果输入行匹配一个被提供的模式。reduce函数是一个身份函数,仅仅将提供的中间数据复制到输出。

url访问频率计数

map函数用来处理网页页面请求日志,并产生输出<url,1>。reduce函数将具有相同url的values相加,并产生一个<url,total count>对。

反向网页链接图

map函数为每个链接输出<target,source>对,一个url叫做target,包含这个url的页面叫source。reduce函数将所有与给定target url 相关的source url联系起来,并形成列表,产生<目

标,源列表><target,list(source)>对。

每个主机的术语向量

一个术语向量,使用一个(词,频率)列表,概括在一个文档或是一个文档集合中最重要的一些词语。map函数对于每个输入文档返回一个<hostname,term vector>对(其中,主

机名从文档中的url中提取)。reduce函数,被传入一个给定主机的所有文档的术语向量。并将这些术语向量加在一起,并去除不常用的术语,然后产生一个最终的<hostname,term vector>对。

倒排索引

map函数解析每篇文档,产生一个(词,文档号)<word,document ID>对的序列。reduce函数,对于给定的单词,接受其所有的对,排序相应的文档IDs,产生一个

<word,list(document ID)>对。所有的输出对集合,构成了一个简单的倒排索引。它可以简单的增加跟踪词位置的计算。

分布式排序:

map函数从每个记录中提取关键字,返回一个<key,record>对。reduce函数返回索引未改变的对。这个计算过程,依赖于分割工具(在4.1部分描述)和排序属性(在4.2部分描述)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: