您的位置：首页 > 运维架构

Hadoop初学笔记

2016-07-03 23:53 330 查看

环境：
unbuntu
jdk8
hadoop-2.6.4

一、介绍hadoop

Hadoop由两部分组成：HDFS和MapReducer；
HDFS为一个分布式文件系统，由google的GFS演变而来。 HDFS有高容错性的特点，并且设计用来部署在低廉的（low-cost）硬件上；而且它提供高吞吐量（high throughput）来访问应用程序的数据，适合那些有着超大数据集（large data set）的应用程序。
MapReduce是处理大量半结构化数据集合的编程模型。编程模型是一种处理并结构化特定问题的方式。（就像Oracle和SQL）

二、安装
部署hadoop需要安装如下软件：
JDK（1.5以上）、hadoop、SSH

1、JDK安装
export JAVA_HOME=/usr/lib/jvm/java
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

2、hadoop安装 http://hadoop.apache.org/ 从hadoop官网下载安装包，解压后修改etc/hadoop目录下的四个文件yarn-site.xml、core-site.xml、hdfs-site.xml、mapred-site.xml

core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop-2.6.4/tmp</value>
</property>
</configuration>

hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop-2.6.4/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop-2.6.4/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

3、格式化hadoop文件系统
配置好后用以下命令格式化hdfs文件系统
bin/hadoop namenode -format
（格式化之前需要保证core-site.xml中配置的/tmp目录下无文件）

4、启动hadoop：
sbin/start-all.sh

5、核查Hadoop是否启动成功
利用命令“jps”观察hadoop的启动进程，出现如下6个进程表示启动成功

$ jps
4258 ResourceManager
3749 NameNode
5944 Jps
4088 SecondaryNameNode
4378 NodeManager
3869 DataNode

6、通过前台界面管理hadoop，http://localhost:50070

三、hadoop基本操作命令

查看 bin/hadoop dfs -ls /

创建目录 bin/hadoop dfs -mkdir /input

...（更多的操作方式可以baidu、google、bing）

四、利用hadoop编写单词统计程序

1、hadoop计算基本原理

例如：
以下是文件的内容：
asiainfo.txt

hello asiainfo
asiainfo is big
hello big

hadoop按照如下的4个步骤对最后计算出每个单词的数量，其中Map和Reduce两个步骤是需要自己实现：

Map（interface）	排序	汇总	Reduce（interface）
<hello,1> <asiainfo,1> <asiainfo,1> <is,1> <big,1> <hello,1> <big,1>	<asiainfo,1> <asiainfo,1> <big,1> <big,1> <hello,1> <hello,1> <is,1>	<asiainfo,[1,1]> <big,[1,1]> <hello,[1,1]> <is,[1]>	<asiainfo,2> <big,2> <hello,2> <is,1>

官方图解：

2、程序实现
按照上一步的4个步骤，再程序中需要自己实现Map、Reduce两个步骤

package com.hadooptest;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
//定义文件内容的拆分规则
public static class WordCountMapper extends
Mapper<Object, Text, Text, IntWritable>{
String line = null;
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
line = value.toString();
String[] arry = line.split("/");
for(String s : arry){
context.write(new Text(s),new IntWritable(1));
}
}

}

//定义处理结果的方式
public static class WordCountReducer extends
Reducer<Text,IntWritable,Text,IntWritable>{

@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable v : values){
sum += v.get();
}
context.write(key,new IntWritable(sum));
}

}

public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
String[] arguments = new GenericOptionsParser(conf,args).getRemainingArgs();
//程序执行需要输入两个参数：输入目录、输出目录，其中输出目录不需要预先创建（如存在需要删除）
if(arguments.length!=2){
System.out.println("invalid arguments");
System.exit(2);
}

//创建调度Job
Job job = new Job(conf,"word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
//定义Mapper
job.setReducerClass(WordCountReducer.class);//定义Mapper
job.setCombinerClass(WordCountReducer.class);
//定义计算合并，与Reducer一样
job.setOutputKeyClass(Text.class);
//定义输出的Key值类型
job.setOutputValueClass(IntWritable.class);//定义输出的Value类型

FileInputFormat.addInputPath(job, new Path(arguments[0]));//输入目录
FileOutputFormat.setOutputPath(job,new Path(arguments[1]));
//输出目录

//执行Job
System.exit(job.waitForCompletion(true)?0:1);
}
}

执行jar文件：

$ bin/hadoop jar ./wordCount.jar /input /output

执行结果存放在/output目录下面

$ bin/hadoop dfs -ls /output
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2016-07-03 16:52 /output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 14683 2016-07-03 16:52 /output/part-r-00000

查看输出结果

$ bin/hadoop dfs -cat /output/part-r-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

1
asiainfo 2
big 2
hello 2
is 1

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航