您的位置：首页 > 其它

减少partition时，用coalesce效率更高

2015-07-09 13:49 429 查看

减少partition时，用coalesce效率更高

测试

repartition，shuffle 2.8G, 耗时10min39sec

df.rdd.repartition(1).saveAsTextFile("/gx/gziptest", classOf[org.apache.hadoop.io.compress.GzipCodec])

joe: start time: Tue Jul 07 12:43:06 CST 2015

joe: end time: Tue Jul 07 12:53:45 CST 2015

coalesce，没有shuffle，耗时6min22sec

df.rdd.coalesce(1).saveAsTextFile("/gx/gziptest", classOf[org.apache.hadoop.io.compress.GzipCodec])

joe: start time: Tue Jul 07 13:39:16 CST 2015

joe: end time: Tue Jul 07 13:45:38 CST 2015

说明

repartition(numPartitions) [/b]

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
等于
coalesce(numPartitions, shuffle = true)

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
which can avoid performing a shuffle.

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).

来自为知笔记(Wiz)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航