您的位置:首页 > 运维架构

hadoop学习笔记(1) 开发环境

2012-02-28 15:32 495 查看
刚开始学习hadoop,首先搭建了一下开发环境,最开始是在单独编写Map-Reduce程序,然后在命令行编译,再通过hadoop命令运行打好的jar包,虽然也能够运行,但是总感觉这样用起来很麻烦,所以今晚又尝试了直接通过eclipse来编辑和运行Map-Reduce程序,瞎弄了一下,居然让我弄成功了,自然走了点弯路,担心以后再走弯路,所以把自己的搭建过程记录于此。

1.准备

1.1 软件

redhat 6

hadoop-0.20.2

java 1.6

1.2 java环境

修改环境变量,我这里是修改的用户目录下的.bash_profile文件,在该文件中添加如下内容:

[cpp]
view plaincopyprint?

JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0 CLASSPATH=$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar PATH=$PATH:JAVA_HOME/bin export JAVA_HOME export CLASSPATH export PATH

JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0
CLASSPATH=$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
PATH=$PATH:JAVA_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH

然后执行如下命令使修改对当前环境有效:

% source /root/.bash_profile

我这里的修改仅仅对当前用户有效,如果想对所有用户都有效,则要修改/etc/profile文件。另外,修改环境变量也可以直接在终端直接执行上述命令,不过这样修改只对当前shell环境有效。

2.hadoop安装

可以从http://hadoop.apache.org/common/releases.html#Download下载hadoop-0.20.2,再解压到系统本地文件系统,我是解压到/root/hadoop/目录:

% tar xzf hadoop-0.20.2.tar.gz

创建一个指向hadoop安装目录的环境变量(HADOOP_HOME),再把hadoop的安装路径放到命令行路径上(我还是修改的/root/.bash_profile文件):

[cpp]
view plaincopyprint?

HADOOP_HOME=/root/hadoop/hadoop-0.20.2 CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-0.20.2-core.jar PATH=$PATH:$HADOOP_HOME/bin export HADOOP_HOME export CLASSPATH export PATH

HADOOP_HOME=/root/hadoop/hadoop-0.20.2
CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-0.20.2-core.jar
PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_HOME
export CLASSPATH
export PATH

至此,hadoop基本环境搭建完成,如想在独立模式(local mode)下运行,此环境已足够。可以通过在终端下输入如下命令来验证:

% hadoop version

会输出如下内容:

Hadoop 0.20.2

Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707

Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010

3.命令行环境运行

以下的例子是来自《hadoop权威指南》。

在/root/hadoop目录下创建java文件NewMaxTemperature.java:

[java]
view plaincopyprint?

// cc NewMaxTemperature Application to find the maximum temperature in the weather dataset using the new context objects MapReduce API
package ch2;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// vv NewMaxTemperature
public class NewMaxTemperature {

static class NewMaxTemperatureMapper
/*[*/extends Mapper<LongWritable, Text, Text, IntWritable>/*]*/ {

private static final int MISSING = 9999;

public void map(LongWritable key, Text value, /*[*/Context context/*]*/)
throws IOException, /*[*/InterruptedException/*]*/ {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
/*[*/context.write/*]*/(new Text(year), new IntWritable(airTemperature));
}
}
}

static class NewMaxTemperatureReducer
/*[*/extends Reducer<Text, IntWritable, Text, IntWritable>/*]*/ {

public void reduce(Text key, /*[*/Iterable/*]*/<IntWritable> values,
/*[*/Context context/*]*/)
throws IOException, /*[*/InterruptedException/*]*/ {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
/*[*/context.write/*]*/(key, new IntWritable(maxValue));
}
}

public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: NewMaxTemperature <input path> <output path>");
System.exit(-1);
}

/*[*/Job job = new Job();
job.setJarByClass(NewMaxTemperature.class);/*]*/

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(NewMaxTemperatureMapper.class);
job.setReducerClass(NewMaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

/*[*/System.exit(job.waitForCompletion(true) ? 0 : 1);/*]*/
}
}
// ^^ NewMaxTemperature

//
cc NewMaxTemperature Application to find the maximum temperature in the
weather dataset using the new context objects MapReduce API
package ch2;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

// vv NewMaxTemperature
public class NewMaxTemperature {

static class NewMaxTemperatureMapper
/*[*/extends Mapper<LongWritable, Text, Text,
IntWritable>/*]*/ {

private static final int MISSING = 9999;

public void map(LongWritable key, Text value, /*[*/Context
context/*]*/)
throws IOException, /*[*/InterruptedException/*]*/ {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading
plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING &&
quality.matches("[01459]")) {
/*[*/context.write/*]*/(new Text(year), new
IntWritable(airTemperature));
}
}
}

static class NewMaxTemperatureReducer
/*[*/extends Reducer<Text, IntWritable, Text,
IntWritable>/*]*/ {

public void reduce(Text key, /*[*/Iterable/*]*/<IntWritable>
values,
/*[*/Context context/*]*/)
throws IOException, /*[*/InterruptedException/*]*/ {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
/*[*/context.write/*]*/(key, new IntWritable(maxValue));
}
}

public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: NewMaxTemperature <input path>
<output path>");
System.exit(-1);
}

/*[*/Job job = new Job();
job.setJarByClass(NewMaxTemperature.class);/*]*/

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(NewMaxTemperatureMapper.class);
job.setReducerClass(NewMaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

/*[*/System.exit(job.waitForCompletion(true) ? 0 : 1);/*]*/
}
}
// ^^ NewMaxTemperature
然后使用如下命令编译该文件:
% javac -d ./ NewMaxTemperature.java

编译的结果是在/root/hadoop/文件夹下生成了一个文件夹ch2,这是上面编译结果类的包名,文件夹里面包含如下三个.class文件:

NewMaxTemperature.class

NewMaxTemperature$NewMaxTemperatureMapper.class

NewMaxTemperature$NewMaxTemperatureReducer.class

接下来是将上面的ch2文件夹打包成可执行的jar包,方便hadoop调用执行。为了打jar包,首先在/root/hadoop/目录下创建文本文件manifest.mf,该文件内容如下:

Main-Class: ch2.NewMaxTemperature

注意:冒号后面一定要加一个空格,否则后果自负。

然后执行如下命令:

% jar cvfm ch2.jar manifest.mf ch2

不出意外,现在在/root/hadoop/目录下应该已经生成了ch2.jar 文件,OK,大功告成,这就是我们想要的,接下来就是通过hadoop命令来执行我们的第一个Map-Reduce程序了。哦,忘了,Map- Reduce程序是用来进行离线数据处理的,我们还没有数据,无的放矢啊,既然Tom White先生(《hadoop权威指南》的作者)贡献了他的上述程序,他当然也会为我们提供数据集啦,对,数据集和上述源码都可以从 http://www.hadoopbook.com这里获取。我这里用的是源码包里的sample.txt数据集,里面内容很简单,就下面五行数据(数据的含义参考《hadoop权威指南》):
[cpp]
view plaincopyprint?

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
我们将sample.txt数据集放到/root/hadoop/文件夹下。OK,到这里真就大功告成了,只需要执行如下命令进行验证就可以了:
$ hadoop jar ch2.jar sample.txt output

ok,看到了什么,是不是刷刷刷一串输出,这就是上面作业运行的输出的一些提示信息(考虑到篇幅问题,这里只是部分输出):

[cpp]
view plaincopyprint?

11/12/14 23:23:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/12/14 23:23:14 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/14 23:23:14 INFO input.FileInputFormat: Total input paths to process : 1
11/12/14 23:23:14 INFO mapred.JobClient: Running job: job_local_0001
11/12/14 23:23:14 INFO input.FileInputFormat: Total input paths to process : 1
11/12/14 23:23:14 INFO mapred.MapTask: io.sort.mb = 100
11/12/14 23:23:14 INFO mapred.MapTask: data buffer = 79691776/99614720
11/12/14 23:23:14 INFO mapred.MapTask: record buffer = 262144/327680
11/12/14 23:23:14 INFO mapred.MapTask: Starting flush of map output

11/12/14
23:23:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/12/14 23:23:14 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/12/14 23:23:14 INFO input.FileInputFormat: Total input paths to
process : 1
11/12/14 23:23:14 INFO mapred.JobClient: Running job: job_local_0001
11/12/14 23:23:14 INFO input.FileInputFormat: Total input paths to
process : 1
11/12/14 23:23:14 INFO mapred.MapTask: io.sort.mb = 100
11/12/14 23:23:14 INFO mapred.MapTask: data buffer = 79691776/99614720
11/12/14 23:23:14 INFO mapred.MapTask: record buffer = 262144/327680
11/12/14 23:23:14 INFO mapred.MapTask: Starting flush of map output
注意:在执行上面的hadoop命令之前一定要保证/root/hadoop/目录下没有output文件夹,否则将不会运行成功,会提示:

[java]
view plaincopyprint?

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory output already exists

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory output already exists

这是hadoop内置的保护措施,为了防止数据覆盖而丢失,想一想,一个花了几个小时运行得到的结果被意外的覆盖肯定是非常可怕的事情。

看一看/root/hadoop/目录下发生了什么变化,是不是多了一个output文件夹,里面有个名为part-r-00000的文件,文件里面的内容就是我们通过hadoop计算得到的结果。

4.搭建hadoop的eclipse开发环境

前面是通过命令行来编辑和运行,当然这个是基本功,但是这个确实比较麻烦。所以,如果能像普通java程序一样通过eclipse点击运行,结果就刷刷刷的出来,那确实是比较爽的事情啊,呵呵。所以,我就网上找了点资料,弄了弄,不弄不知道,弄了才知道,原来和其他许多的eclipse功能一样,只要添加个插件就可以,感谢那些无偿开发这些插件的人们,主会保佑你们的,呵呵。

前面我们已经下载了hadoop的源码包,ok,插件就在源码包里面,contrib/eclipse-pluginhadoop-0.20.2-eclipse-plugin.jar,这就是我们想要的,将这个插件拷贝到eclipse的/plugins文件夹下,重启eclipse。

然后到‘Windows’ -> ‘Preference’ -> 'Hadoop Map/Reduce',配置Hadoop Installation Directory,这里是/root/hadoop/hadoop-0.20.2。好了,现在,我们可以通过eclipse来开发了,新建—>项目,发现了什么:



可以直接新建Map/Reduce工程了,这不就是我们想要的吗,ok,把上面的例子再跑一遍吧,和普通的java项目一样就行了。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: