您的位置:首页 > 其它

从十亿数据中找出出现最多的数以及出现次数

2013-10-13 19:16 211 查看
package org.example.bigdata;

import java.util.Collections;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;

public class TopTimes {
public static void main(String[] args) {
final int input[] = {2389,8922,3382,6982,5231,8934,8923,7593
,4322,7922,6892,5224,4829,3829,8934,8922
,6892,6872,4682,6723,8923,3492,9527,8923
,7593,7698,7593,7593,7593,8922,9527,4322
,8934,4322,3382,5231,5231,4682,9527,9527};

int sort[] = new int[1000];
//Set all bit to 0
for(int index = 0; index < sort.length; index++){
sort[index] = 0;
}
Map<Integer,Integer> numCountMap = new HashMap<Integer,Integer>();
for(int number : input){
//Every number takes 2 bit.
int existTimes = (sort[number >>> 4] >>> (2 * (number % 16))) & (1 | 1 << 1);
//Increase counter in sort array.
if(existTimes <= ((1 | 1 << 1) - 1)){
existTimes++;
//set two bit zero
sort[number >>> 4] &= ~((1 | 1 << 1) << (2 * (number % 16)));
//set increased bit value
sort[number >>> 4] |= existTimes << (2 * (number % 16));
//set <number, counter> into two maps.
if((1 | 1 << 1) == existTimes){
numCountMap.put(number, existTimes);
}
}
else{
//Time >= 3, increase the counter in treemap.
if((1 | 1 << 1) == existTimes){
int mapCounter = numCountMap.get(number).intValue();
mapCounter++;
numCountMap.put(number, mapCounter);
}
}
}

List<CounterNumber> counterList = new LinkedList<CounterNumber>();
for(Integer number : numCountMap.keySet()){
counterList.add(new CounterNumber(numCountMap.get(number), number));
}
Collections.sort(counterList);
for(CounterNumber counterNumber : counterList){
System.out.println(counterNumber.getCounter() + "----" + counterNumber.getNumber());
}
}
}

class CounterNumber implements Comparable<CounterNumber>{
private Integer counter;
private Integer number;
public CounterNumber(Integer counter, Integer number){
this.counter = counter;
this.number  = number;
}

public Integer getCounter(){
return this.counter;
}
public Integer getNumber(){
return this.number;
}
@Override
public int compareTo(CounterNumber counterNumber){
return counterNumber.getCounter().compareTo(this.getCounter());
}
}


大体思路和上一篇《从十亿数据找出最大的一百个数》差不多,不过原来1-Bit表示一个数,扩展为2^N位数来存储一个数出现次数,当出现次数超过2^N-1次时,把数据放入一个Map中。

根据长尾效应,排名靠前的数字出现次数很大,后面的数字剧减。某业务实际生产环境中,一亿条数据取Top一万的记录,排名第一的出现400万次,排名第10000位的出现不足5000次。

算法中是以数字来计算的,但是位于文本记录也有意义,可以将文本逐条计算出HashCode,再使用该算法。同时对Map/CounterNumber也进行相应的改造,加入原始文本字段。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐