Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
2017-08-14 10:36
906 查看
Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
1 countcount 返回的是在一个RDD里面存储的元素的个数
def count(): Long
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.count res2: Long = 4
2 countApproxDistinct
计算单一值的大概的出现的次数,假设有一个分布于很多节点的很大的一个RDD,大致的计算速度会快于其他的计算方式,
Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高
def countApproxDistinct(relativeSD: Double = 0.05): Long
val a = sc.parallelize(1 to 10000, 20) val b = a++a++a++a++a b.countApproxDistinct(0.1) res14: Long = 8224 b.countApproxDistinct(0.05) res15: Long = 9750 b.countApproxDistinct(0.01) res16: Long = 9947 b.countApproxDistinct(0.001) res0: Long = 10000
3 countApproxDistinctByKey [Pair]
这个作用于一个键值对类型的数据。它和之前的
countApproxDistinct是类似的。不过计算的是每个单独出现的key值的单独的value值出现的次数。RDD包含的元素的值也必须是tuple类型的元素。Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高。
def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)] def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)] def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]
val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) val b = sc.parallelize(a.takeSample(true, 10000, 0), 20) val c = sc.parallelize(1 to b.count().toInt, 20) val d = b.zip(c) d.countApproxDistinctByKey(0.1).collect res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494)) d.countApproxDistinctByKey(0.01).collect res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513)) d.countApproxDistinctByKey(0.001).collect res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))
4 countByKey
作用于键值对类型的元素,不过计算的是每个键对应出现的value的次数。
def countByKey(): Map[K, Long]
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2) c.countByKey res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
5 countByValue
计算一个RDD中,每一个元素出现的次数,返回的结果为一个map型,表示的是每个值出现了几次。
def countByValue(): Map[T, Long]
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) b.countByValue res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)
相关文章推荐
- Spark编程之基本的RDD算子-aggregate和aggregateByKey
- Spark编程之基本的RDD算子之fold,foldByKey,treeAggregate, treeReduce
- Spark编程之基本的RDD算子之cogroup,groupBy,groupByKey
- Spark编程之基本的RDD算子之glom,substract,substractByKey,intersection,distinct,union
- Spark编程之基本的RDD算子之fold,foldByKey,treeAggregate, treeReduce
- Spark编程之基本的RDD算子之join,rightOuterJoin, leftOuterJoin
- spark RDD算子(九)之基本的Action操作 first, take, collect, count, countByValue, reduce, aggregate, fold,top
- Spark编程之基本的RDD算子coalesce, repartition, checkpoint
- spark rdd countByValue
- Spark编程之基本的RDD算子sparkContext,foreach,foreachPartition, collectAsMap
- Spark编程的基本的算子之:combineByKey,reduceByKey,groupByKey
- Spark编程之基本的RDD算子之zip,zipPartitions,zipWithIndex,zipWithUniqueId
- Spark编程之基本的RDD算子之map,mapPartitions, mapPartitionsWithIndex.
- Spark算子:RDD行动Action操作(4)–countByKey、foreach、foreachPartition、sortBy
- Spark算子:RDD基本转换操作(5)–mapPartitions、mapPartitionsWithIndex
- Spark算子:RDD基本转换操作(6)–zip、zipPartitions
- Spark RDD/Core 编程 API入门系列之map、filter、textFile、cache、对Job输出结果进行升和降序、union、groupByKey、join、reduce、look
- Spark算子:RDD基本转换操作(5)–mapPartitions、mapPartitionsWithIndex
- Spark算子:RDD键值转换操作(5)–leftOuterJoin、rightOuterJoin、subtractByKey
- Spark算子:RDD行动Action操作(1)–first、count、reduce、collect