您的位置:首页 > 运维架构

用hadoop统计文本中单词的个数

2011-10-09 19:22 281 查看

The Files

You need 3 files to run the wordCount example:a C++ file containing the map and reduce functions,a data file containing some text, such as Ulysses, anda Makefile to compile the C++ file.

wordcount.cpp

The wordcount program is shown belowIt contains two classes, one for the map, one for the reduceIt makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.
#include <algorithm>
#include <limits>
#include <string>

#include  "stdint.h"  // <--- to prevent uint64_t errors!

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

using namespace std;

class WordCountMapper : public HadoopPipes::Mapper {
public:
// constructor: does nothing
WordCountMapper( HadoopPipes::TaskContext& context ) {
}

// map function: receives a line, outputs (word,"1")
// to reducer.
void map( HadoopPipes::MapContext& context ) {
//--- get line of text ---
string line = context.getInputValue();

//--- split it into words ---
vector< string > words =
HadoopUtils::splitString( line, " " );

//--- emit each word tuple (word, "1" ) ---
for ( unsigned int i=0; i < words.size(); i++ ) {
context.emit( words[i], HadoopUtils::toString( 1 ) );
}
}
};

class WordCountReducer : public HadoopPipes::Reducer {
public:
// constructor: does nothing
WordCountReducer(HadoopPipes::TaskContext& context) {
}

// reduce function
void reduce( HadoopPipes::ReduceContext& context ) {
int count = 0;

//--- get all tuples with the same key, and count their numbers ---
while ( context.nextValue() ) {
count += HadoopUtils::toInt( context.getInputValue() );
}

//--- emit (word, count) ---
context.emit(context.getInputKey(), HadoopUtils::toString( count ));
}
};

int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<
WordCountMapper,
WordCountReducer >() );
}

Makefile

Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the followingcommand:
  uname -a
To which the OS responds:
  Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux
The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.Once you have this information create the Makefile (make sure to spell it with an uppercase M):
CC = g++HADOOP_INSTALL = /home/hadoop/hadoopPLATFORM = Linux-i386-32CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/includewordcount: wordcount.cpp$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \-lhadooputils -lpthread -g -O2 -o $@
Note: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! --D.Thiebaut 08:16, 25 February 2011 (EST)

Data File

We'll assume that you have some large text files already in HDFS, in a directory called dft1.

Compiling and Running

You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
  sudo apt-get install g++
Compile the code:
  make  wordcount
and fix any errors you're getting.Copy the executable file (wordcount) to the bin directory in HDFS:
  hadoop dfs -mkdir bin                    (Note: it should already exist!)hadoop dfs -put  wordcount   bin/wordcount
Run the program!
  hadoop pipes -D hadoop.pipes.java.recordreader=true  \-D hadoop.pipes.java.recordwriter=true \-input dft1  -output dft1-out  \-program bin/wordcount
Verify that you have gotten the right output:
  hadoop dfs -text dft1-out/part-00000"Come   1"Defects,"      1"I      1"Information    1"J"     1"Plain  2...zodiacal        2zoe)_   1zones:  1zoo.    1zoological      1zouave's        1zrads,  2zrads.  1
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: