hadoop学习-stream-Top K记录
2014-04-07 23:44
232 查看
求海量数据中最大的K个记录
来源于《hadoop实战》(hadoop in action)(美 拉姆)第4.7章节有关stream的习题。
数据源:apat63_99.txt 专利描述数据集,包含专利号、专利申请年份等等信息。可从美国国家经济研究局获得,网址为http://www.nber.org/patents
大约有290万条记录。
这里的脚本用的是python。
apat63_99.txt里面存有专利的各种信息,这里以第9列的专利特定专利声明个数作为排序的key值,将最大的K条完整记录输出。
apat63_99.txt格式:
Top_K.py:
对每条读入的记录,通过add函数判断并添加到topresult链表中。
add函数第一个参数temp,是用来排序的key值,这里传进去的也就是数据源中的第9列的值,即index=8
第二个参数是key值对应的完整专利描述记录。
在map阶段和reduce阶段都调用此脚本。即可完成Top K记录的查找。
运行指令:
参数中的8 表示希望求得最大的8条记录。
运行结果:
4533693,1985,9349,1983,"US","CA",538840,2,346,524,1,15,13,80,1,0.6444,0.6272,7.8875,6.7692,0.0909,0.0769,0.0127,0.0125
5812453,1998,14144,1994,"JP","",581270,3,348,365,2,24,22,0,1,,0.1694,,13.1818,0.1818,0.1818,,
4068298,1978,6584,1975,"US","OH",557440,2,393,707,2,22,12,24,1,0.7292,0.8194,12.7083,5.6667,0,0,0,0
4624109,1986,9825,1983,"US","CA",,1,394,60,4,45,3,2,1,0.5,0.4444,9,8.6667,,,,
4081478,1978,6661,1976,"US","MI",600480,2,472,564,1,14,7,1,0.8571,0,0.8333,4,9.5714,0.6,0.4286,1,1
4373527,1983,8446,1979,"US","MD",295920,2,642,604,3,32,6,49,1,0.414,0.4444,10.6939,6.8333,0,0,0.1087,0.102
4085139,1978,6682,1976,"US","MI",600480,2,706,564,1,14,7,1,0.8571,0,0.8333,14,9.5714,0.6,0.4286,0,0
5095054,1992,11757,1990,"DE","",618615,2,868,524,1,15,40,111,0.975,0.8442,0.8586,4.1622,14.45,0.0571,0.05,0.0093,0.009
参考资料:
http://blog.csdn.net/jtlyuan/article/details/7560905
《hadoop实战》(hadoop in action)(美 拉姆)
来源于《hadoop实战》(hadoop in action)(美 拉姆)第4.7章节有关stream的习题。
数据源:apat63_99.txt 专利描述数据集,包含专利号、专利申请年份等等信息。可从美国国家经济研究局获得,网址为http://www.nber.org/patents
大约有290万条记录。
这里的脚本用的是python。
apat63_99.txt里面存有专利的各种信息,这里以第9列的专利特定专利声明个数作为排序的key值,将最大的K条完整记录输出。
apat63_99.txt格式:
"PATENT","GYEAR","GDATE","APPYEAR","COUNTRY","POSTATE","ASSIGNEE","ASSCODE","CLAIMS","NCLASS","CAT","SUBCAT","CMADE","CRECEIVE","RATIOCIT","GENERAL","ORIGINAL","FWDAPLAG","BCKGTLAG","SELFCTUB","SELFCTLB","SECDUPBD","SECDLWBD" 3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,, 3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,, 3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,, 3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,, 3070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,, 3070806,1963,1096,,"US","PA",,1,,2,6,63,,0,,,,,,,,, 3070807,1963,1096,,"US","OH",,1,,623,3,39,,3,,0.4444,,,,,,, 3070808,1963,1096,,"US","IA",,1,,623,3,39,,4,,0.375,,,,,,, 3070809,1963,1096,,"US","AZ",,1,,4,6,65,,0,,,,,,,,,
Top_K.py:
#!/usr/bin/env python #encoding:utf-8 import sys # Top_K Top_K = int(sys.argv[1]) #源数据apat63_99.txt的第9列,表示特定专利声明个数 index=8 #初始化top_k数组,用来保存记录 top=[] topresult=[] for i in range(Top_K): top.append(0) topresult.append("") #插入记录 def add(temp,record): global index global Top_K global top global topresult i=0 if temp>top[0]: top[0]=temp topresult[0]=record for i in range(Top_K-1): if temp>top[i+1]: top[i]=top[i+1] topresult[i]=topresult[i+1] else: i=i-1 break i=i+1 top[i]=temp topresult[i]=record for line in sys.stdin: fields = line.strip().split(",") if fields[index].isdigit(): val = int(fields[index]) add(val,line.strip()) #输出top_k记录 i=0 for i in range(Top_K-1,-1,-1): print topresult[i]
对每条读入的记录,通过add函数判断并添加到topresult链表中。
add函数第一个参数temp,是用来排序的key值,这里传进去的也就是数据源中的第9列的值,即index=8
第二个参数是key值对应的完整专利描述记录。
在map阶段和reduce阶段都调用此脚本。即可完成Top K记录的查找。
运行指令:
bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input Top_K/apat63_99.txt -output output -mapper 'Top_K.py 8' -file Top_K.py -reducer 'Top_K.py 8' -file Top_K.py
参数中的8 表示希望求得最大的8条记录。
运行结果:
4533693,1985,9349,1983,"US","CA",538840,2,346,524,1,15,13,80,1,0.6444,0.6272,7.8875,6.7692,0.0909,0.0769,0.0127,0.0125
5812453,1998,14144,1994,"JP","",581270,3,348,365,2,24,22,0,1,,0.1694,,13.1818,0.1818,0.1818,,
4068298,1978,6584,1975,"US","OH",557440,2,393,707,2,22,12,24,1,0.7292,0.8194,12.7083,5.6667,0,0,0,0
4624109,1986,9825,1983,"US","CA",,1,394,60,4,45,3,2,1,0.5,0.4444,9,8.6667,,,,
4081478,1978,6661,1976,"US","MI",600480,2,472,564,1,14,7,1,0.8571,0,0.8333,4,9.5714,0.6,0.4286,1,1
4373527,1983,8446,1979,"US","MD",295920,2,642,604,3,32,6,49,1,0.414,0.4444,10.6939,6.8333,0,0,0.1087,0.102
4085139,1978,6682,1976,"US","MI",600480,2,706,564,1,14,7,1,0.8571,0,0.8333,14,9.5714,0.6,0.4286,0,0
5095054,1992,11757,1990,"DE","",618615,2,868,524,1,15,40,111,0.975,0.8442,0.8586,4.1622,14.45,0.0571,0.05,0.0093,0.009
参考资料:
http://blog.csdn.net/jtlyuan/article/details/7560905
《hadoop实战》(hadoop in action)(美 拉姆)
相关文章推荐
- hadoop学习-stream-Top K记录
- Hadoop学习记录(4)|MapReduce原理|API操作使用
- Hadoop学习记录(7)|Eclipse远程调试Hadoop
- JAVA学习记录(三)——Java 流(Stream)、文件(File)和IO
- hadoop 学习记录
- hadoop与spark学习记录(一)
- hadoop学习记录(零)
- [学习记录]多媒体音量控制setVolumeControlStream(int streamType)
- 大数据学习记录(day4)-Hadoop之MapReduce的执行方式
- Hadoop学习全程记录——hadoop 入门
- Hadoop学习记录(5)|集群搭建|节点动态添加删除
- Hadoop学习全程记录——eclipse hadoop开发环境配置(2)(修改)
- 初识Hadoop学习记录
- hadoop学习记录(一)HDFS
- Hadoop学习-错误记录:namenode、datanode、secondarynamenode未启动
- hadoop学习记录-安装
- 学习记录--颤抖吧,hadoop(五)---搭建完全分布式hadoop集群(1)
- hadoop学习记录(二)HDFS java api
- 学习hadoop过程记录在CTO!!
- 学习安装单机版Hadoop记录