您的位置:首页 > 编程语言 > Python开发

机器学习_hadoop + python

2013-08-07 23:19 369 查看
用streaming处理

参考:/article/8787228.html

遇到safemode,关掉 hadoop dfsadmin -safemode leave

或者 hadoop fsck /

Mapper.py

#!/usr/bin/python
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words while removing any empty strings
words = filter(lambda word: word, line.split())
# increase counters
for word in words:
print '%s\t%s' % (word, 1)
Reduer.py

#!/usr/bin/python
from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split()
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the word lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key = itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s' % (word, count)

test.txt

This is a book!
That is a rular.
I'm a boy.


hadoop fs -mkdir py_input

hadoop jar hadoop-streaming.jar -input py_input -output py_output -mapper ./py/Mapper.py -reducer ./py/Reducer.py -file ./py/Mapper.py -file ./py/Reducer.py
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: