您的位置:首页 > 其它

减少partition时,用coalesce效率更高

2015-07-09 13:49 429 查看
减少partition时,用coalesce效率更高

测试

repartition,shuffle 2.8G, 耗时10min39sec

df.rdd.repartition(1).saveAsTextFile("/gx/gziptest", classOf[org.apache.hadoop.io.compress.GzipCodec])




joe: start time: Tue Jul 07 12:43:06 CST 2015

joe: end time: Tue Jul 07 12:53:45 CST 2015

coalesce,没有shuffle, 耗时6min22sec

df.rdd.coalesce(1).saveAsTextFile("/gx/gziptest", classOf[org.apache.hadoop.io.compress.GzipCodec])




joe: start time: Tue Jul 07 13:39:16 CST 2015

joe: end time: Tue Jul 07 13:45:38 CST 2015

说明

repartition(numPartitions) [/b]

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
等于
coalesce(numPartitions, shuffle = true)

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
which can avoid performing a shuffle.

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).

来自为知笔记(Wiz)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: