您的位置:首页 > 其它

RDD数据去重(时间连续变化,断面以及客流连续不变的只保留第一条记录)

2018-03-13 09:28 323 查看
先将RDD按断面分组,生成新的RDD
rdd.groupBy(s => (s.station_fore,s.station_back)).map(s=> cleandata(s)).flatMap(s=>s)//flatMap将数组展开,每条数据生成一条记录
def cleandata(data:((String,String),Iterable[sample])):Array[sample]={
val nt = ArrayBuffer[sample]()
val t2 = data._2.toArray.sortBy(_.deal_time)//按时间排序
var temp = t2(0)
nt+=t2(0)
for (i<- 0 until t2.length){
if (temp.flow != t2(i).flow){
temp = t2(i)
nt+=temp
}
}
nt.toArray
}
case class sample(station_fore:String,station_back:String,flow:Long,deal_time:String){
override def toString = station_fore+","+station_back+","+flow+","+deal_time
}

                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: