[Spark--基础]--聚合操作-reduceByKey、combineBykey、groupBy和AggregateByKey
2017-12-20 13:34
1401 查看
1、Spark中的reduceByKey和CombineByKey有什么区别?
groupByKey和CombineByKey
/ reduceByKey有很大的区别。请参阅以下文章以深入了解。
reduceByKey和CombineByKey之间的唯一区别是API,在内部它们的功能完全相同。
AggregateByKey内部也调用CombineByKey
2、Spark中groupByKey和ReduceByKey的区别是什么?
GroupByKey
ReduceByKey
/ CombineByKey / AggregateByKey:
具体如下:
参考:
https://github.com/vaquarkhan/vk-wiki-notes/wiki/reduceByKey--vs-groupBykey-vs-aggregateByKey-vs-combineByKey http://www.cnblogs.com/LuisYao/p/6813228.html https://stackoverflow.com/questions/42632707/difference-between-reducebykey-and-combinebykey-in-spark http://bytepadding.com/big-data/spark/reducebykey-vs-combinebykey/ http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/ http://bytepadding.com/big-data/spark/combine-by-key-to-find-max/
groupByKey和CombineByKey
/ reduceByKey有很大的区别。请参阅以下文章以深入了解。
reduceByKey和CombineByKey之间的唯一区别是API,在内部它们的功能完全相同。
reduceByKey | CombineByKey |
reduceByKey在内部调用combineByKey | CombineByKey是通用API,由reduceByKey和aggregateByKey使用 |
reduceByKey的输入类型和outputType是相同的 | CombineByKey更灵活,因此可以提到所需的outputType。 输出类型不一定需要与输入类型相同。 |
2、Spark中groupByKey和ReduceByKey的区别是什么?
GroupByKey
ReduceByKey
/ CombineByKey / AggregateByKey:
具体如下:
GroupByKey | ReduceByKey / CombineByKey / AggregateByKey: |
所有数据都从mapTask发送到reduceTask | 合并器在MapTask和reduceTask上运行 |
没有优化网络I / O | 优化的网络I / O |
只有在reduceTask中需要给定键的所有Value时才应该使用 | 应该总是使用,应该避免使用groupByKey。 当需要像sum,average,median,mode,top N这样的函数时应该使用 |
可能导致GC问题和JobFailure | 更少的数据被洗牌,所以失败的机会更少 |
一个火花分区可以容纳最多2 GB的数据 | 一个火花分区可以容纳最多2 GB的数据 |
参考:
https://github.com/vaquarkhan/vk-wiki-notes/wiki/reduceByKey--vs-groupBykey-vs-aggregateByKey-vs-combineByKey http://www.cnblogs.com/LuisYao/p/6813228.html https://stackoverflow.com/questions/42632707/difference-between-reducebykey-and-combinebykey-in-spark http://bytepadding.com/big-data/spark/reducebykey-vs-combinebykey/ http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/ http://bytepadding.com/big-data/spark/combine-by-key-to-find-max/
相关文章推荐
- Spark算子:RDD键值转换操作(3)–groupBy、keyBy、groupByKey、reduceByKey、reduceByKeyLocally
- Spark-聚合操作-combineByKey
- Spark—聚合操作—combineByKey
- Spark—聚合操作—combineByKey
- spark RDD算子(六)之键值对聚合操作reduceByKey,foldByKey,排序操作sortByKey
- 结合Spark源码分析, combineByKey, aggregateByKey, foldByKey, reduceByKey
- 通过 “由对象V到对象C的转换” 来说明 Spark_Streaming api中 reduceByKey 与 combineByKey 注意事项
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- spark RDD算子(五)之键值对聚合操作 combineByKey
- Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally
- Spark RDD中Transformation的combineByKey、reduceByKey,join详解
- 3.4 Spark RDD Action操作3-聚合-aggregate、fold、reduce
- [Spark_API]Transformation-reduceByKey()和aggregateByKey()
- 请教Spark 中 combinebyKey 和 reduceByKey的传入函数参数的区别?
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark编程的基本的算子之:combineByKey,reduceByKey,groupByKey
- Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally