您的位置：首页 > 编程语言 > Python开发

机器学习_hadoop + python

2013-08-07 23:19 369 查看

用streaming处理

参考：/article/8787228.html

遇到safemode，关掉 hadoop dfsadmin -safemode leave

或者 hadoop fsck /

Mapper.py

#!/usr/bin/python
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words while removing any empty strings
words = filter(lambda word: word, line.split())
# increase counters
for word in words:
print '%s\t%s' % (word, 1)

Reduer.py

#!/usr/bin/python
from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split()
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the word lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key = itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s' % (word, count)

test.txt

This is a book!
That is a rular.
I'm a boy.

hadoop fs -mkdir py_input

hadoop jar hadoop-streaming.jar -input py_input -output py_output -mapper ./py/Mapper.py -reducer ./py/Reducer.py -file ./py/Mapper.py -file ./py/Reducer.py

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航