机器学习_hadoop + python
2013-08-07 23:19
369 查看
用streaming处理
参考:/article/8787228.html
遇到safemode,关掉 hadoop dfsadmin -safemode leave
或者 hadoop fsck /
Mapper.py
test.txt
hadoop fs -mkdir py_input
hadoop jar hadoop-streaming.jar -input py_input -output py_output -mapper ./py/Mapper.py -reducer ./py/Reducer.py -file ./py/Mapper.py -file ./py/Reducer.py
参考:/article/8787228.html
遇到safemode,关掉 hadoop dfsadmin -safemode leave
或者 hadoop fsck /
Mapper.py
#!/usr/bin/python import sys # maps words to their counts word2count = {} # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words while removing any empty strings words = filter(lambda word: word, line.split()) # increase counters for word in words: print '%s\t%s' % (word, 1)Reduer.py
#!/usr/bin/python from operator import itemgetter import sys # maps words to their counts word2count = {} # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split() # convert count (currently a string) to int try: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass # sort the word lexigraphically; # # this step is NOT required, we just do it so that our # final output will look more like the official Hadoop # word count examples sorted_word2count = sorted(word2count.items(), key = itemgetter(0)) # write the results to STDOUT (standard output) for word, count in sorted_word2count: print '%s\t%s' % (word, count)
test.txt
This is a book! That is a rular. I'm a boy.
hadoop fs -mkdir py_input
hadoop jar hadoop-streaming.jar -input py_input -output py_output -mapper ./py/Mapper.py -reducer ./py/Reducer.py -file ./py/Mapper.py -file ./py/Reducer.py
相关文章推荐
- 【机器学习】使用Hadoop Streaming来用Python代码完成MapReduce
- python机器学习之神经网络(一)
- 安装Python的机器学习包Sklearn 出错解决方法
- 【机器学习系列】EM算法求解三硬币问题(python版本)
- Hadoop WordCount(Streaming,Python,Java三合一)
- 编程将成为社交行为,Python更适合机器学习
- Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
- 基于 Python 和 Scikit-Learn 的机器学习介绍
- Python 机器学习——解决过拟合的方法
- Python 机器学习工具库
- 机器学习实战笔记(Python实现)-02-k近邻算法(kNN)
- 留学生作业代写 编程代写 有偿代写 python matlab数学建模 机器学习 深度学习 c# c++ 数学 算法 论文程序代写
- 七步精通Python机器学习
- 利用Python,四步掌握机器学习
- 机器学习--python之学会如何从文件逐行读取数据
- [python]机器学习实战
- 机器学习之python开发环境准备
- python机器学习之随机森林(七)
- Python机器学习——线性模型
- 5.3SVM实例1--python机器学习