减少partition时,用coalesce效率更高
2015-07-09 13:49
429 查看
减少partition时,用coalesce效率更高
测试
repartition,shuffle 2.8G, 耗时10min39sec
joe: start time: Tue Jul 07 12:43:06 CST 2015
joe: end time: Tue Jul 07 12:53:45 CST 2015
coalesce,没有shuffle, 耗时6min22sec
joe: start time: Tue Jul 07 13:39:16 CST 2015
joe: end time: Tue Jul 07 13:45:38 CST 2015
说明
repartition(numPartitions) [/b]
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
等于
coalesce(numPartitions, shuffle = true)
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
which can avoid performing a shuffle.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
来自为知笔记(Wiz)
测试
repartition,shuffle 2.8G, 耗时10min39sec
df.rdd.repartition(1).saveAsTextFile("/gx/gziptest", classOf[org.apache.hadoop.io.compress.GzipCodec])
joe: start time: Tue Jul 07 12:43:06 CST 2015
joe: end time: Tue Jul 07 12:53:45 CST 2015
coalesce,没有shuffle, 耗时6min22sec
df.rdd.coalesce(1).saveAsTextFile("/gx/gziptest", classOf[org.apache.hadoop.io.compress.GzipCodec])
joe: start time: Tue Jul 07 13:39:16 CST 2015
joe: end time: Tue Jul 07 13:45:38 CST 2015
说明
repartition(numPartitions) [/b]
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
等于
coalesce(numPartitions, shuffle = true)
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
which can avoid performing a shuffle.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
来自为知笔记(Wiz)
相关文章推荐
- 15个信号说明你是一个过度思考者
- PASSION之XML文档详解
- HBase 分布式的、面向列的开源数据库
- getopt函数
- 浅尝Unity 3D的Asset Bundle知识(三)-----导入资源篇
- 关于android屏幕适配的问题(drawable-xxxxxxxx,dp,sp,px等等),偶尔看到了android源码,关于dpi的区分的值
- Highcharts图例坐标轴
- AndroidManifest.xml——permission-tree
- ThinkPHP框架研究之一 基本函数 M和D的区别
- 利用socket直接与adb的pc service通讯
- 如何使用Unity制作虚拟导览(二)
- 场景金融,移动支付中心从钱到人
- redis 一致性hash ,分布式存储
- 欧几里得求最大公约数/python
- 如何使用Unity制作虚拟导览(一)
- centos的SSH用法
- 随机梯度下降法
- 疯狂java讲义之类加载与反射
- 【招聘-北京】广告算法
- 大话存储——笔记