您的位置:首页 > 其它

Spark控制ReduceTask数量

2015-10-17 20:56 183 查看
Spark控制ReduceTask数量
所有key/value RDD操作都有一个可选参数,表示reduceTask的并行度

1、通过查看http://192.168.80.20:4040/jobs/

val words = sc.parallelize(List(("spark",1),("hadoop",1),("hadoop",1),("hadoop",1)))

words: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:21


val wordsGroup = words.groupByKey()

wordsGroup: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[1] at groupByKey at <console>:23

wordsGroup.collect

res0: Array[(String, Iterable[Int])] = Array((spark,CompactBuffer(1)), (hadoop,CompactBuffer(1, 1, 1)))


words.groupByKey(7).collect

val nums = sc.parallelize(List(1,2,3,2,4,5))

nums.distinct().collect #distinct去掉重复的

res3: Array[Int] = Array(4, 1, 3, 5, 2)


nums.distinct(6).collect

nums.coalesce(4).collect

nums.coalesce(4,true).collect 有 shuffle 相当于repartion(4)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: