您的位置：首页 > 其它

Spark中的combineByKey

2015-11-29 16:04 337 查看

Spark中的combineByKey

时间 2015-01-23 21:35:00 逸思
原文 http://zhangyi.farbox.com/post/combinebykey-in-spark
主题软件开发

在数据分析中，处理Key，Value的Pair数据是极为常见的场景，例如我们可以针对这样的数据进行分组、聚合或者将两个包含Pair数据的RDD根据key进行join。从函数的抽象层面看，这些操作具有共同的特征，都是将类型为RDD[(K,V)]的数据处理为RDD[(K,C)]。这里的V和C可以是相同类型，也可以是不同类型。这种数据处理操作并非单纯的对Pair的value进行map，而是针对不同的key值对原有的value进行联合（Combine）。因而，不仅类型可能不同，元素个数也可能不同。
Spark为此提供了一个高度抽象的操作combineByKey。该方法的定义如下所示：

/**
* Generic function to combine the elements foreach key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
* Note that V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*/
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = {
//实现略
}

函数式风格与命令式风格不同之处在于它说明了代码做了什么（what to do），而不是怎么做(how to do)。combineByKey函数主要接受了三个函数作为参数，分别为createCombiner、mergeValue、mergeCombiners。这三个函数足以说明它究竟做了什么。理解了这三个函数，就可以很好地理解combineByKey。
combineByKey是将RDD[(K,V)]combine为RDD[(K,C)]，因此，首先需要提供一个函数，能够完成从V到C的combine，称之为combiner。如果V和C类型一致，则函数为V => V。倘若C是一个集合，例如Iterable[V]，则createCombiner为V => Iterable[V]。
mergeValue则是将原RDD中Pair的Value合并为操作后的C类型数据。合并操作的实现决定了结果的运算方式。所以，mergeValue更像是声明了一种合并方式，它是由整个combine运算的结果来导向的。函数的输入为原RDD中Pair的V，输出为结果RDD中Pair的C。
最后的mergeCombiners则会根据每个Key所对应的多个C，进行归并。
让我们将combineByKey想象成是一个超级酷的果汁机。它能同时接受各种各样的水果，然后聪明地按照水果的种类分别榨出不同的果汁。苹果归苹果汁，橙子归橙汁，西瓜归西瓜汁。我们为水果定义类型为Fruit，果汁定义为Juice，那么combineByKey就是将RDD[(String, Fruit)]combine为RDD[(String, Juice)]。
注意，在榨果汁前，水果可能有很多，即使是相同类型的水果，也会作为不同的RDD元素：

("apple", apple1), ("orange", orange1), ("apple", apple2)

combine的结果是每种水果只有一杯果汁（只是容量不同罢了）:

("apple", appleJuice), ("orange", orangeJuice)

这个果汁机由什么元件构成呢？首先，它需要一个元件提供将各种水果榨为各种果汁的功能；其次，它需要提供将果汁进行混合的功能；最后，为了避免混合错误，还得提供能够根据水果类型进行混合的功能。注意第二个函数和第三个函数的区别，前者只提供混合功能，即能够将不同容器的果汁装到一个容器中，而后者的输入已有一个前提，那就是已经按照水果类型放到不同的区域，果汁机在混合果汁时，并不会混淆不同区域的果汁。
果汁机的功能类似于groupByKey+foldByKey操作。它可以调用combineByKey函数：

case class Fruit(kind: String, weight: Int) {def makeJuice:Juice = Juice(weight * 100)
}
case class Juice(volumn: Int) {def add(j: Juice):Juice = Juice(volumn + j.volumn)
}
val apple1 = Fruit("apple", 5)
val apple2 = Fruit("apple", 8)
val orange1 = Fruit("orange", 10)
val fruit = sc.parallelize(List(("apple", apple1) , ("orange", orange1) , ("apple", apple2)))
val juice = fruit.combineByKey(
f => f.makeJuice,
(j:Juice,f) => j.add(f.makeJuice),
(j1:Juice,j2:Juice) => j1.add(j2)
)

执行juice.collect，结果为：

Array[(String, Juice)] = Array((orange,Juice(1000)), (apple,Juice(1300)))

RDD中有许多针对Pair RDD的操作在内部实现都调用了combineByKey函数。例如groupByKey：

class PairRDDFunctions[K, V](self: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
extends Logging
with SparkHadoopMapReduceUtil
with Serializable {
defgroupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKey[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine=false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
}

groupByKey函数针对PairRddFunctions的RDD[(K, V)]按照key对value进行分组。它在内部调用了combineByKey函数，传入的三个函数分别承担了如下职责：

createCombiner是将原RDD中的K类型转换为Iterable[V]类型，实现为CompactBuffer。
mergeValue实则就是将原RDD的元素追加到CompactBuffer中，即将追加操作(+=)视为合并操作。
mergeCombiners则负责针对每个key值所对应的Iterable[V]，提供合并功能。

根据传入的函数实现不同，我们还可以利用combineByKey完成不同的工作，例如aggregate，fold，average等操作。这是一个高度的抽象，但从声明的角度来看，却又不需要了解过多的实现细节。这就是函数式编程的魅力。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航