您的位置:首页 > 编程语言 > Python开发

Big Data, MapReduce, Hadoop, and Spark with Python

2016-11-10 09:49 639 查看
此书不错,很短,且想打通PYTHON和大数据架构的关系。

先看一次,计划把这个文档作个翻译。

先来一个模拟MAPREDUCE的东东。。。



mapper.py

class Mapper:
def map(self, data):
returnval = []
counts = {}
for line in data:
words = line.split()
for w in words:
counts[w] = counts.get(w, 0) + 1
for w, c in counts.iteritems():
returnval.append((w, c))
print "Mapper result:"
print returnval
return returnval


reducer.py

class Reducer:
def reduce(self, d):
returnval = []
for k, v in d.iteritems():
returnval.append("%s\t%s"%(k, sum(v)))
print "Reducer result:"
print returnval
return returnval


main.py

from mapper import Mapper
from reducer import Reducer

class JobRunner:
def run(self, Mapper, Reducer, data):
# map
mapper = Mapper()
tuples = mapper.map(data)

# combine
combined = {}
for k, v in tuples:
if k not in combined:
combined[k] = []
combined[k].append(v)
print "combined result:"
print combined

# reduce
reducer = Reducer()
output = reducer.reduce(combined)

# do something with output
for line in output:
print line

runner = JobRunner()
runner.run(Mapper, Reducer, open("input.txt"))


内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐