Spark控制ReduceTask数量
2015-10-17 20:56
183 查看
Spark控制ReduceTask数量
所有key/value RDD操作都有一个可选参数,表示reduceTask的并行度
1、通过查看http://192.168.80.20:4040/jobs/
val words = sc.parallelize(List(("spark",1),("hadoop",1),("hadoop",1),("hadoop",1)))
val wordsGroup = words.groupByKey()
wordsGroup: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[1] at groupByKey at <console>:23
wordsGroup.collect
words.groupByKey(7).collect
val nums = sc.parallelize(List(1,2,3,2,4,5))
nums.distinct().collect #distinct去掉重复的
nums.distinct(6).collect
nums.coalesce(4).collect
nums.coalesce(4,true).collect 有 shuffle 相当于repartion(4)
所有key/value RDD操作都有一个可选参数,表示reduceTask的并行度
1、通过查看http://192.168.80.20:4040/jobs/
val words = sc.parallelize(List(("spark",1),("hadoop",1),("hadoop",1),("hadoop",1)))
words: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:21
val wordsGroup = words.groupByKey()
wordsGroup: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[1] at groupByKey at <console>:23
wordsGroup.collect
res0: Array[(String, Iterable[Int])] = Array((spark,CompactBuffer(1)), (hadoop,CompactBuffer(1, 1, 1)))
words.groupByKey(7).collect
val nums = sc.parallelize(List(1,2,3,2,4,5))
nums.distinct().collect #distinct去掉重复的
res3: Array[Int] = Array(4, 1, 3, 5, 2)
nums.distinct(6).collect
nums.coalesce(4).collect
nums.coalesce(4,true).collect 有 shuffle 相当于repartion(4)
相关文章推荐
- Xcode快捷键
- 【UWP应用开发实战】第一弹 使用剪切板
- json对象转java对象
- 操作系统之PCB的组成及作用
- 课后作业
- (转)探究requestDisallowInterceptTouchEvent失效的原因
- qt与java实现简单的网络通信
- 自增和自减运算
- Lost in the City
- 64位Windows系统如何配置32位ODBC数据源
- 关于“去哪网”一道笔试题解法的思考
- 类与对象课后作业第一题
- 使用Jil序列化JSON提升Asp.net web api 性能
- Leetcode102: Search in Rotated Sorted Array
- jQuery源码分析之width,height,innerWidth,innerHieght,outerWidth,outerHeight函数
- NYoj-Binary String Matching-KMP算法
- 验证2
- python例子-关于时间time模块
- N点主机管理系统的重装步骤(图文)
- Automatic visual detection of Human behavior:A review from 2000 to 2014