IDEA向hadoop集群提交作业
2017-10-02 12:58
543 查看
1. 说明
本地环境:Intellij IDEA15.0.2 、jdk-7u65-windows-x64.exe、hadoop-2.6.1.tar.gz集群环境及其配置详情请见:http://blog.csdn.net/qq_28039433/article/details/78147172
本文原先是根据http://blog.csdn.net/uq_jin/article/details/52235121 进行配置,发现该配置只能将作业提交到本机的hadoop上运行,后来结合http://blog.csdn.net/u011654631/article/details/70037219来搭建IDEA远程向hadoop集群提交作业。
2. 配置本机hadoop环境
2.1解压hadoop-2.6.1.tar.gz至任意一个目录
我这里选择将其解压到E:\java\hadoop-2.6.1目录下。2.2设置hadoop环境变量
注意HADOOP_USER_NAME值设置为Hadoop集群里的用户名。不然会报org.apache.hadoop.security.AccessControlException。我的Hadoop集群的用户名是hadoopHADOOP_HOME=E:\java\hadoop-2.6.1 HADOOP_BIN_PATH=%HADOOP_HOME%\bin HADOOP_PREFIX=%HADOOP_HOME% 在Path后面加上%HADOOP_HOME%\bin;%HADOOP_HOME%\sbin; HADOOP_USER_NAME=hadoop
2.3配置内网映射
在C:\Windows\System32\drivers\etc\hosts文末追加三行,与centos6.5里的/etc/hosts配置相同192.168.48.101 hdp-node-01 192.168.48.102 hdp-node-02 192.168.48.103 hdp-node-03
3. 搭建项目
jdk的安装在这里就不做详细介绍,本机跟Hadoop集群的jdk安装的版本尽量一致。3.1 新建Maven项目
3.2 在pom.xml中加入依赖
<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.1</version> </dependency> </dependencies>
完成后,如果External Libraries里没有依赖的包,在右下角Event Log中有提示Maven projects need to be imported: Import Changes Enable Auto-Import,点击Import Changes。
3.3 设置配置文件
将hadoop集群中配置文件core-site.xml、mapred-site.xml、yarn-site.xml 原封不动地复制到resources目录下。以下是我的配置文件core-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hdp-node-01:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/apps/hadoop-2.6.1/tmp</value> </property> </configuration>
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
yarn-site.xml
<?xml version="1.0"?> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>hdp-node-01</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- Site specific YARN configuration properties --> </configuration>
log4j.properties
log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n
3.4 编写程序
WordCountMapper.javaimport org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] words = line.split(" "); for (String word : words) { context.write(new Text(word),new IntWritable(1)); } } }
WordCountReducer.java
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // super.reduce(key, values, context); int count = 0 ; for (IntWritable value:values) { count += value.get(); } context.write(key,new IntWritable((count))); } }
WordCountRunner.java
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.text.SimpleDateFormat; import java.util.Date; public class WordCountRunner { public static void main(String[] args) throws Exception { Configuration config = new Configuration(); config.set("mapreduce.framework.name", "yarn");//集群的方式运行,非本地运行 config.set("mapreduce.app-submission.cross-platform", "true");//意思是跨平台提交,在windows下如果没有这句代码会报错 "/bin/bash: line 0: fg: no job control",去网上搜答案很多都说是linux和windows环境不同导致的一般都是修改YarnRunner.java,但是其实添加了这行代码就可以了。 config.set("mapreduce.job.jar","D:\\wordcount\\out\\artifacts\\wordcount_jar\\wordcount.jar"); Job job = Job.getInstance(config); job.setJarByClass(WordCountRunner.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //要处理的数据输入与输出地址 FileInputFormat.setInputPaths(job,"hdfs://hdp-node-01:9000/wordcount/input/somewords.txt"); SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy_MM_dd_HH_mm_ss"); FileOutputFormat.setOutputPath(job,new Path("hdfs://hdp-node-01:9000/wordcount/output/"+ simpleDateFormat.format(new Date(System.currentTimeMillis())))); boolean res = job.waitForCompletion(true); System.exit(res?0:1); } }
注意mapreduce.job.jar 参数设置为jar的路径。
3.5 导出jar
点击File -》project Structure注意勾上Build on make选项。3.4里的mapreduce.job.jar地址跟这Output directory地址前缀相同
最后点击Build-》Build Artifacts-》Build后会在根目录下会生成out目录。
3.6 运行程序
运行程序前先要启动hadoop集群。去http://download.csdn.net/detail/u010435203/9606355 下载winutils.exe放到hadoop/bin下面。
运行成功会控制台会显示:
16:44:07,037 | WARN | main | NativeCodeLoader | che.hadoop.util.NativeCodeLoader 62 | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16:44:11,203 | INFO | main | RMProxy | pache.hadoop.yarn.client.RMProxy 98 | Connecting to ResourceManager at hdp-node-01/192.168.48.101:8032 16:44:13,785 | WARN | main | JobResourceUploader | op.mapreduce.JobResourceUploader 64 | Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16:44:17,581 | INFO | main | FileInputFormat | reduce.lib.input.FileInputFormat 281 | Total input paths to process : 1 16:44:18,055 | INFO | main | JobSubmitter | he.hadoop.mapreduce.JobSubmitter 199 | number of splits:1 16:44:18,780 | INFO | main | JobSubmitter | he.hadoop.mapreduce.JobSubmitter 288 | Submitting tokens for job: job_1506933793385_0001 16:44:20,138 | INFO | main | YarnClientImpl | n.client.api.impl.YarnClientImpl 251 | Submitted application application_1506933793385_0001 16:44:20,307 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1301 | The url to track the job: http://hdp-node-01:8088/proxy/application_1506933793385_0001/ 16:44:20,309 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1346 | Running job: job_1506933793385_0001 16:45:03,829 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1367 | Job job_1506933793385_0001 running in uber mode : false 16:45:03,852 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1374 | map 0% reduce 0% 16:45:40,267 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1374 | map 100% reduce 0% 16:46:08,081 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1374 | map 100% reduce 100% 16:46:09,121 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1385 | Job job_1506933793385_0001 completed successfully 16:46:09,562 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1392 | Counters: 49 File System Counters FILE: Number of bytes read=256 FILE: Number of bytes written=212341 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=259 HDFS: Number of bytes written=152 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=30792 Total time spent by all reduces in occupied slots (ms)=24300 Total time spent by all map tasks (ms)=30792 Total time spent by all reduce tasks (ms)=24300 Total vcore-seconds taken by all map tasks=30792 Total vcore-seconds taken by all reduce tasks=24300 Total megabyte-seconds taken by all map tasks=31531008 Total megabyte-seconds taken by all reduce tasks=24883200 Map-Reduce Framework Map input records=1 Map output records=18 Map output bytes=214 Map output materialized bytes=256 Input split bytes=118 Combine input records=0 Combine output records=0 Reduce input groups=15 Reduce shuffle bytes=256 Reduce input records=18 Reduce output records=15 Spilled Records=36 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=533 CPU time spent (ms)=5430 Physical memory (bytes) snapshot=311525376 Virtual memory (bytes) snapshot=1680896000 Total committed heap usage (bytes)=136122368 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=141 File Output Format Counters Bytes Written=152 Process finished with exit code 0
4. 常见问题FAQ
4.1 权限问题
Exception in thread "main" org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=dvqfq6prcjdsh4p\hadoop, access=WRITE, inode="hadoop":hadoop:supergroup:rwxr-xr-x
在hdfs-site.xml增加
<property> <name>dfs.permissions</name> <value>false</value> </property>
在环境变量里加HADOOP_USER_NAME=hadoop。详情见2.2
4.2 时间同步问题
Container launch failed for container_1506950816832_0005_01_000002 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. This token is expired. current time is 1506954189368 found 1506953252362
多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。
最好在每台服务器的 /etc/crontab 中加入一行:
0 2 * * * root ntpdate time.nist.gov && hwclock -w
4.3
Stack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no job control
jar地址错误,注意mapreduce.job.jar 的配置。
相关文章推荐
- Intellij IDEA远程向hadoop集群提交mapreduce作业
- IDEA向Hadoop集群提交作业环境搭建
- 【Hadoop】集群之外的机器如何连接到集群并与HDFS交互,提交作业给Hadoop集群
- 如何向hadoop集群定时提交一个jar作业?
- window7使用eclipse提交Hadoop作业到Hadoop集群运行方法
- Hadoop集群提交作业问题总结
- windows下idea中搭建hadoop开发环境,向远程hadoop集群提交mapreduce任务
- 在Eclipse中提交作业至远程的Hadoop集群上执行
- Hadoop集群提交作业问题总结
- hadoop2.7.1 Intellj idea 远程提交job到linux集群
- Hadoop(十二):从源码角度分析Hadoo是如何将作业提交给集群的
- Hadoop作业提交分析(一)
- Hadoop1.0 Eclipse Plugin-作业提交
- Hadoop技术内幕之作业提交与初始化过程分析
- hadoop-yarn集群中,通过shell脚本自动化提交spark任务
- Hadoop1.X中作业提交的事件模型
- Hadoop作业提交多种方案具体流程详解
- 源码剖析MapReduce作业提交机制(集群模式)
- 问题记录:hadoop集群提交job时出现Exception in thread "main" java.io.IOException: Error opening job jar:
- IntelliJ IDEA远程连接HDP2.6.0.3的hadoop集群