用hadoop统计文本中单词的个数
2011-10-09 19:22
281 查看
The Files
You need 3 files to run the wordCount example:a C++ file containing the map and reduce functions,a data file containing some text, such as Ulysses, anda Makefile to compile the C++ file.wordcount.cpp
The wordcount program is shown belowIt contains two classes, one for the map, one for the reduceIt makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.#include <algorithm> #include <limits> #include <string> #include "stdint.h" // <--- to prevent uint64_t errors! #include "hadoop/Pipes.hh" #include "hadoop/TemplateFactory.hh" #include "hadoop/StringUtils.hh" using namespace std; class WordCountMapper : public HadoopPipes::Mapper { public: // constructor: does nothing WordCountMapper( HadoopPipes::TaskContext& context ) { } // map function: receives a line, outputs (word,"1") // to reducer. void map( HadoopPipes::MapContext& context ) { //--- get line of text --- string line = context.getInputValue(); //--- split it into words --- vector< string > words = HadoopUtils::splitString( line, " " ); //--- emit each word tuple (word, "1" ) --- for ( unsigned int i=0; i < words.size(); i++ ) { context.emit( words[i], HadoopUtils::toString( 1 ) ); } } }; class WordCountReducer : public HadoopPipes::Reducer { public: // constructor: does nothing WordCountReducer(HadoopPipes::TaskContext& context) { } // reduce function void reduce( HadoopPipes::ReduceContext& context ) { int count = 0; //--- get all tuples with the same key, and count their numbers --- while ( context.nextValue() ) { count += HadoopUtils::toInt( context.getInputValue() ); } //--- emit (word, count) --- context.emit(context.getInputKey(), HadoopUtils::toString( count )); } }; int main(int argc, char *argv[]) { return HadoopPipes::runTask(HadoopPipes::TemplateFactory< WordCountMapper, WordCountReducer >() ); }
Makefile
Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the followingcommand:uname -aTo which the OS responds:
Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/LinuxThe i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.Once you have this information create the Makefile (make sure to spell it with an uppercase M):
CC = g++HADOOP_INSTALL = /home/hadoop/hadoopPLATFORM = Linux-i386-32CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/includewordcount: wordcount.cpp$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \-lhadooputils -lpthread -g -O2 -o $@Note: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! --D.Thiebaut 08:16, 25 February 2011 (EST)
Data File
We'll assume that you have some large text files already in HDFS, in a directory called dft1.Compiling and Running
You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!sudo apt-get install g++Compile the code:
make wordcountand fix any errors you're getting.Copy the executable file (wordcount) to the bin directory in HDFS:
hadoop dfs -mkdir bin (Note: it should already exist!)hadoop dfs -put wordcount bin/wordcountRun the program!
hadoop pipes -D hadoop.pipes.java.recordreader=true \-D hadoop.pipes.java.recordwriter=true \-input dft1 -output dft1-out \-program bin/wordcountVerify that you have gotten the right output:
hadoop dfs -text dft1-out/part-00000"Come 1"Defects," 1"I 1"Information 1"J" 1"Plain 2...zodiacal 2zoe)_ 1zones: 1zoo. 1zoological 1zouave's 1zrads, 2zrads. 1
相关文章推荐
- hadoop简单应用-统计文本文件单词个数
- 使用hadoop统计多个文本中每个单词数目
- Hadoop:使用原生python编写MapReduce来统计文本文件中所有单词出现的频率功能
- 循环-06. 统计一行文本的单词个数(15)
- hadoop基础----hadoop实战(三)-----hadoop运行MapReduce---对单词进行统计--经典的自带例子wordcount
- 和我一起学Hadoop(五):MapReduce的单词统计,wordcount
- 通过hadoop自带的demo运行单词统计
- 利用二叉搜索树来实现输入文本的单词统计
- Python 练习册 6-统计文本文件中的出现最多的单词
- 采用二叉搜索树来统计文本中单词出现的频率
- 【C语言助教】统计文本中单词的个数!
- Python实现统计文本当中单词的数量,
- 基于bf算法统计文本中某个单词出现的频率
- Hadoop Demo(一)【统计文件中单词出现的频率】
- hadoop-python——统计单词出现的频率
- hadoop hellokitty 单词统计
- hadoop运行简单例子--单词统计
- (13) Hadoop Java 实现MapReduce HelloWord 单词统计 更新版 2
- hadoop单节点配置并且统计单词
- 统计文本单词的个数