您的位置:首页 > 大数据 > 人工智能

Cloud Computing(3)_Basic MapReduce Algorithm Design_Pairs&Stripes

2017-03-09 22:28 260 查看
How do we aggregate partial counts efficiently?

Pairs

An algorithm.

This algorithm illustrates the use of complex keys in order to coordinate distributed computations.

Each mapper takes a sentence

Reducers sum up counts associated with these pairs

//"pairs" approach
class MAPPER
method MAP(docid a, doc d)
for all term w∈doc d do
for all term u∈NEIGHBORS(w) do
EMIT( pair(w, u) , count 1)   //EMIT count for each co-occurrence

class REDUCER
method REDUCE(pair p, counts[c1, c2, ...])
s = 0
for all count c ∈counts[c1, c2, ...] do
s = s + c
EMIT(pair p, count s)


For each term emit pairs: ( (a,b), 1 ) 键值是一个pair(a,b)

[b]“Pairs Analysis”(数组短,但数目多)[/b]

Advantages

Easy to implement, easy to understand: map就是找pair,reduce就是统计

Disadvantages

Lots of pairs to sort and shuffle around, upper bound = (n!)(n个单词,就有n的阶乘个pairs)

Not many opportunities for combiners to work

Stripes

Co-occurrence information is first stored in an associative array, denoted H.

The mapper emits key-value pairs with words as keys and corresponding associative arrays as values, where each associative array encodes the co-occurrence counts of the neighbors of a particular word.

Each mapper takes a sentence

Reducers perform element-wise sum of associative arrays

//"stripes" approach
class MAPPER
method MAP(docid a, doc d)
for all term w∈doc d do
H = new ASSOCIATIVEARRAY
for all term u∈NEIGHBORS(w) do
H{u} = H{u} + 1   //Tally words co-occurring with w
EMIT( term w , Stripe H)

class REDUCER
method REDUCE(term w , Stripes [H1, H2, H3,...])
Hf = new ASSOCIATIVEARRAY
for all stripe H ∈stripes[H1, H2, H3, ...] do
sum(Hf,H)
EMIT(term w , Stripe Hf)


For each term emit stripes: a->{b:1, c:2, d:2, ….} 键值是“a”

[b]“Stripes Analysis”(数组长,但数目少)[/b]

Advantages

Far less sorting and shuffling of key-value pairs

Can make better use of combiners

Disadvantages

More difficult to implement

Underlying object more heavyweight

Fundamental limitation in terms of size of event space

Pairs vs. Stripes

处理量不大,处理资源数目少,用pairs;反之,stripes较优
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: