大数据企业学习篇06----Oozie详解
2017-12-20 19:55
483 查看
一、Oozie是什么?有什么用?
Oozie是工作流调度框架 *工作流 import -> hive ->export 按照不同的业务编排 *调度 *作业/任务 定时执行 *事件触发执行 *时间 *数据集
二、Hadoop的调度框架
*Linux crontab*Azkaban
https://azkaban.github.io
*Oozie
http://oozie.apache.org/
*Zeus
https://github.com/michael8335/zeus2
三、Oozie的功能架构
注意:一个Oozie job是一个mapreduce程序,仅仅只有map Task
注意:针对不同类型的任务,workflow,有模板
三、Oozie的安装部署
<1>解压缩tar -zxvf oozie-4.0.0-cdh5.3.6.tar.gz -C /opt/cdh5.3.6/
<2>NOTE: Configure the Hadoop cluster with proxyuser for the Oozie process.
The following two properties are required in Hadoop core-site.xml:
<!-- OOZIE --> <property> <name>hadoop.proxyuser.[OOZIE_SERVER_USER].hosts</name> <value>[OOZIE_SERVER_HOSTNAME]</value> </property> <property> <name>hadoop.proxyuser.[OOZIE_SERVER_USER].groups</name> <value>[USER_GROUPS_THAT_ALLOW_IMPERSONATION]</value> </property>
Replace the capital letter sections with specific values and then restart Hadoop
<3>Expand the Oozie hadooplibs tar.gz in the same location Oozie distribution tar.gz was expanded. A hadooplibs/ directory will be created containing the Hadoop JARs for the versions of Hadoop that the Oozie distribution supports.
tar -zxf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz
注意:
hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6 //hadoop 2.x hadooplib-2.5.0-mr1-cdh5.3.6.oozie-4.0.0-cdh5.3.6 //hadoop 1.x
<4>Create a libext/ directory in the directory where Oozie was expanded
mkdir libext
<5>If using a version of Hadoop bundled in Oozie hadooplibs/ , copy the corresponding Hadoop JARs from hadooplibs/ to the libext/ directory. If using a different version of Hadoop, copy the required Hadoop JARs from such version in the libext/ directory.
cp -r oozie-4.0.0-cdh5.3.6/hadooplibs/hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6/* libext/
<6>If using the ExtJS library copy the ZIP file to the libext/ directory
cp /opt/softwares/cdh/ext-2.2.zip libext/
<7>creating war files for oozie with an optional alternative directory other than libext.
bin/oozie-setup.sh prepare-war
<8>A “sharelib create|upgrade -fs fs_default_name [-locallib sharelib]” command is available when running oozie-setup.sh for uploading new or upgrading existing sharelib into hdfs where the first argument is the default fs name and the second argument is the Oozie sharelib to install, it can be a tarball or the expanded version of it. If the second argument is omitted, the Oozie sharelib tarball from the Oozie installation directory will be used.
bin/oozie-setup.sh sharelib create -fs hdfs://hadoop-senior.ibeifeng.com:8020 -locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz
<9>db create|upgrade|postupgrade -run [-sqlfile ] command is for create, upgrade or postupgrade oozie db with an optional sql file
ooziedb.sh create -sqlfile oozie.sql -run DB Connection
<10>更改配置文件oozie-site.xml
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name> <value>*=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/etc/hadoop</value>
<11>Start Oozie as a daemon process run
$ bin/oozied.sh start
<12>Check the Oozie log file logs/oozie.log to ensure Oozie started properly.
cd logs more oozie.log
<13>Using the Oozie command line tool check the status of Oozie:
bin/oozie admin -oozie http://localhost:11000/oozie -status
四、Oozie Command Line Examples
<1>Setting Up the Examples*Expanding this file will create an examples/ directory in the local file system
tar -zxvf oozie-examples.tar.gz
*查看example
cd example ###apps ---调度不同类型的应用案例 ###input_data ---数据 ###src ---案例源码
*The examples/ directory must be copied to the user HOME directory in HDFS:
hadoop fs -put examples examples
NOTE: If an examples directory already exists in HDFS, it must be deleted before copying it again. Otherwise files may not be copied.
<2>Running the Examples
*For the Streaming and Pig example, the Oozie Share Library must be installed in HDFS.
Add Oozie bin/ to the environment PATH.
The examples assume the JobTracker is localhost:8021 and the NameNode is hdfs://localhost:8020 . If the actual values are different, the job properties files in the examples directory must be edited to the correct values.
apps/map-reduce/job.properties
The example applications are under the examples/app directory, one directory per example. The directory contains the application XML file (workflow, or worklfow and coordinator), the job.properties file to submit the job and any JAR files the example may need.
The inputs for all examples are in the examples/input-data/ directory.
The examples create output under the examples/output-data/${EXAMPLE_NAME} directory.
Note : The job.properties file needs to be a local file during submissions, and not a HDFS path.
*How to run an example application
bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run
Check the workflow job status:
oozie job -oozie http://localhost:11000/oozie -info 0000000-171221101425718-oozie-beif-W
To avoid having to provide the -oozie option with the Oozie URL with every oozie command, set OOZIE_URL env variable to the Oozie URL in the shell environment. For example:
export OOZIE_URL="http://localhost:11000/oozie" oozie job -info 14-20090525161321 0000000-171221101425718-oozie-beif-W
五、Oozie的工作流调度
<1>DefinitionsAction: An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or ‘action node’.
Workflow: A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). “control dependency” from one action to another means that the second action can’t run until the first action has completed.
Workflow Definition: A programmatic description of a workflow that can be executed.
Workflow Definition Language: The language used to define a Workflow Definition.
Workflow Job: An executable instance of a workflow definition.
Workflow Engine: A system that executes workflows jobs. It can also be referred as a DAG engine
<2>Workflow Definition
A workflow definition is a DAG with control flow nodes (start, end, decision, fork, join, kill) or action nodes (map-reduce, pig, etc.), nodes are connected by transitions arrows.
The workflow definition language is XML based and it is called hPDL (Hadoop Process Definition Language)
<3>Workflow Nodes
Workflow nodes are classified in control flow nodes and action nodes:
Control flow nodes: nodes that control the start and end of the workflow and workflow job execution path.
Action nodes: nodes that trigger the execution of a computation/processing task.
Node names and transitions must be conform to the following pattern =[a-zA-Z][-_a-zA-Z0-0]*=, of up to 20 characters long.
* Control Flow Nodes
Control flow nodes define the beginning and the end of a workflow (the start , end and kill nodes) and provide a mechanism to control the workflow execution path (the decision , fork and join nodes).
六、mapreduce action
<1>要点A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job
The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path
The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends
The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job
<2>案例
1)准备工作
mkdir oozie-apps cd oozie-apps/ cp -r ../examples/apps/map-reduce . mv map-reduce mr-wordcount-wf
2)修改workflow.xml(新旧api)
<workflow-app xmlns="uri:oozie:workflow:0.5" name="mr-wordcount-wf"> <start to="mr-node-worcount"/> <action name="mr-node-worcount"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/> </prepare> <configuration> <property> <name>mapred.mapper.new-api </name> <value>true</value> </property> <property> <name>mapred.reducer.new-api </name> <value>true</value> </property> <property> <name>mapreduce.job.queuename </name> <value>${queueName}</value> </property> <property> <name>mapreduce.job.map.class </name> <value>com.ibeifeng.hadoop.senior.mapreduce.WordCount$WordCountMapper</value> </property> <property> <name>mapreduce.job.reduce.class </name> <value>com.ibeifeng.hadoop.senior.mapreduce.WordCount$WordCountReducer</value> </property> <property> <name>mapreduce.map.output.key.class </name> <value>org.apache.hadoop.io.Text </value> </property> <property> <name>mapreduce.map.output.value.class </name> <value>org.apache.hadoop.io.IntWritable </value> </property> <property> <name>mapreduce.job.output.key.class </name> <value>org.apache.hadoop.io.Text </value> </property> <property> <name>mapreduce.job.output.value.class </name> <value>org.apache.hadoop.io.IntWritable </value> </property> <property> <name>mapreduce.input.fileinputformat.inputdir </name> <value>${nameNode}/${oozieDataRoot}/${inputDir}</value> </property> <property> <name>mapreduce.output.fileoutputformat.outputdir </name> <value>${nameNode}/${oozieDataRoot}/${outputDir}</value> </property> </configuration> </map-reduce> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
<3>修改job.properties
nameNode=hdfs://hadoop-senior.ibeifeng.com:8020 jobTracker=hadoop-senior.ibeifeng.com:8032 queueName=default oozieAppsRoot=user/beifeng/oozie-apps oozieDataRoot=user/beifeng/oozie/datas oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/mr-wordcount-wf/workflow.xml inputDir=mr-wordcount-wf/input outputDir=mr-wordcount-wf/output
<4>将mapreduce jar包拷贝到oozie-apps/mr-wordcount-wf/lib目录下
<5>将oozie-apps放到hdfs目录下
hdfs dfs -put oozie-apps/ oozie-apps
<6>创建数据目录并上传数据
hdfs dfs -mkdir -p oozie/datas/mr-wordcount-wf/input hdfs dfs -put /opt/datas/wc.txt oozie/datas/mr-wordcount-wf/input
<7>运行
export OOZIE_URL="http://localhost:11000/oozie" bin/oozie job -config oozie-apps/mr-wordcount-wf/job.properties -run
3)如何定义一个workflow
* job.properties
关键点:指向workflow.xml文件所在的HDFS位置
* workflow.xml
定义文件
XML文件
包含几点
* start
* action
MapReduce、Hive、Sqoop、Shell
* ok
* error
* kill
* end
* lib 目录
依赖的jar包
workflow.xml编写:
* 流程控制节点
* Action节点
MapReduce Action
如何使用Ooize调度MapReduce程序
关键点:
将以前Java MapReduce程序中的【Driver】部分
|
configuration
七、hive action案例
<1>准备案例cp -r ../examples/apps/hive . mv hive hive-select hdfs dfs -put hive-select/ oozie-apps/
<2>修改配置文件
*job.properties
nameNode=hdfs://hadoop-senior.ibeifeng.com:8020 jobTracker=hadoop-senior.ibeifeng.com:8032 queueName=default oozieAppsRoot=user/beifeng/oozie-apps oozieDataRoot=user/beifeng/oozie/datas oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/hive-select outputDir=hive-select/output
注意:准备工作
cp /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/conf/hive-site.xml . //加载hive配置文件 mkdir lib cd lib cp /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/lib/mysql-connector-java-5.1.27-bin.jar . //加载驱动包 hdfs dfs -put lib/ /user/beifeng/oozie-apps/hive-select/
*workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="hive-select-wf"> <start to="hive-node"/> <action name="hive-node"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/> </prepare> <job-xml>${nameNode}/${oozieAppsRoot}/hive-select/hive-site.xml</job-xml> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>select.sql</script> <param>OUTPUT=${nameNode}/${oozieDataRoot}/${outputDir}</param> </hive> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
*select.sql
insert overwrite directory '${OUTPUT}' select count(*) from default.student;
八、sqoop action案例
<1>准备案例cp -r /opt/cdh-5.3.6/oozie-4.0.0-cdh5.3.6/examples/apps/sqoop . mv sqoop sqoop-imp-user cp /opt/cdh-5.3.6/oozie-4.0.0-cdh5.3.6/oozie-apps/hive-select/lib/mysql-connector-java-5.1.27-bin.jar lib/
<2>修改配置文件
*job.properties
nameNode=hdfs://hadoop-senior.ibeifeng.com:8020 jobTracker=hadoop-senior.ibeifeng.com:8032 queueName=default oozieAppsRoot=user/beifeng/oozie-apps oozieDataRoot=user/beifeng/oozie/datas oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/sqoop-imp-user outputDir=sqoop-imp-user/output
*workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.3" name="sqoop-user-wf"> <start to="sqoop-node"/> <action name="sqoop-node"> <sqoop xmlns="uri:oozie:sqoop-action:0.3"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/> </prepare> <configuration> <property> <name>mapreduce.job.queuename</name> <value>${queueName}</value> </property> </configuration> <command>sqoop --options-file sqoop-imp-user.sqoop</command> </sqoop> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/>
*sqoop-imp-user.sqoop
import --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/test --username root --password 123456 --table my_user --target-dir ${nameNode}/${oozieDataRoot}/${outputDir} --num-mappers 1
注意:
sqoop底层也是mapreduce,使用的是新的api
sqoop中的引号只能是双引号
九、shell action
<1>准备案例cp -r /opt/cdh-5.3.6/oozie-4.0.0-cdh5.3.6/examples/apps/shell . mv shell shell-select
<2>编辑配置文件
*job.properties
nameNode=hdfs://hadoop-senior.ibeifeng.com:8020 jobTracker=hadoop-senior.ibeifeng.com:8032 queueName=default oozieAppsRoot=user/beifeng/oozie-apps oozieDataRoot=user/beifeng/oozie/datas oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/shell-select exec=student-select.sh script=student-select.sql
*workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.5" name="shell-wf"> <start to="shell-node"/> <action name="shell-node"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>${exec}</exec> <file>${nameNode}/${oozieAppsRoot}/shell-select/${exec}#${exec}</file> <file>${nameNode}/${oozieAppsRoot}/shell-select/${script}#${script}</file> <capture-output/> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
*student-select.sh
#!/bin/sh /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/bin/hive -f student-select.sql
*student-select.sql
insert overwrite directory '/user/beifeng/oozie/datas/shell-select/output' select * from default.student
十、oozie协作调度
十一、oozie协作调度
相关文章推荐
- 06.java数据类型详解
- 大数据企业学习篇04-----Sqoop浅析
- 大数据企业学习篇03_2-----hive 深入
- 大数据企业学习篇05----flume初识
- 大数据企业学习篇02_3-------hadoop高级
- 企业数据保护神DPM2007部署详解,DPM2007系列之一
- 大数据企业学习篇03_3------hive 高级
- 大数据企业学习篇02_2------hadoop深入
- 大数据企业学习篇03_1------hive 初识
- 大数据企业学习篇01之---Linux的那些事
- MapGIS数据格式技术详解
- SpringMVC(5):MVC的参数传递详解与示例(简单类型数据、ModelAndView、Model 、 POJO 以及 Map)
- 数据挖掘在企业中应用的四种途径
- APUE读书笔记-06系统数据文件和信息-05额外组信息
- Farseer.net轻量级开源框架 入门篇:添加数据详解
- Farseer.net轻量级开源框架 入门篇:修改数据详解
- java数据类型转换详解
- TCP/IP第四层--传输层TCP数据报文详解
- ASP.net DropDownList数据绑定及使用详解
- Visual C++常用数据类型转换详解