您的位置：首页 > 大数据

大数据企业学习篇06----Oozie详解

2017-12-20 19:55 483 查看

一、Oozie是什么?有什么用?

Oozie是工作流调度框架
*工作流
import ->
hive ->export
按照不同的业务编排
*调度
*作业/任务 定时执行
*事件触发执行
*时间
*数据集

二、Hadoop的调度框架

*Linux crontab

*Azkaban

https://azkaban.github.io

*Oozie

http://oozie.apache.org/

*Zeus

https://github.com/michael8335/zeus2

三、Oozie的功能架构

注意：一个Oozie job是一个mapreduce程序，仅仅只有map Task

注意：针对不同类型的任务，workflow,有模板

三、Oozie的安装部署

<1>解压缩

tar -zxvf oozie-4.0.0-cdh5.3.6.tar.gz -C /opt/cdh5.3.6/

<2>NOTE: Configure the Hadoop cluster with proxyuser for the Oozie process.

The following two properties are required in Hadoop core-site.xml:

<!-- OOZIE -->
<property>
<name>hadoop.proxyuser.[OOZIE_SERVER_USER].hosts</name>
<value>[OOZIE_SERVER_HOSTNAME]</value>
</property>
<property>
<name>hadoop.proxyuser.[OOZIE_SERVER_USER].groups</name>
<value>[USER_GROUPS_THAT_ALLOW_IMPERSONATION]</value>
</property>

Replace the capital letter sections with specific values and then restart Hadoop

<3>Expand the Oozie hadooplibs tar.gz in the same location Oozie distribution tar.gz was expanded. A hadooplibs/ directory will be created containing the Hadoop JARs for the versions of Hadoop that the Oozie distribution supports.

tar -zxf oozie-hadooplibs-4.0.0-cdh5.3.6.tar.gz

注意：

hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6 //hadoop 2.x
hadooplib-2.5.0-mr1-cdh5.3.6.oozie-4.0.0-cdh5.3.6 //hadoop 1.x

<4>Create a libext/ directory in the directory where Oozie was expanded

mkdir libext

<5>If using a version of Hadoop bundled in Oozie hadooplibs/ , copy the corresponding Hadoop JARs from hadooplibs/ to the libext/ directory. If using a different version of Hadoop, copy the required Hadoop JARs from such version in the libext/ directory.

cp -r oozie-4.0.0-cdh5.3.6/hadooplibs/hadooplib-2.5.0-cdh5.3.6.oozie-4.0.0-cdh5.3.6/*  libext/

<6>If using the ExtJS library copy the ZIP file to the libext/ directory

cp /opt/softwares/cdh/ext-2.2.zip  libext/

<7>creating war files for oozie with an optional alternative directory other than libext.

bin/oozie-setup.sh prepare-war

<8>A “sharelib create|upgrade -fs fs_default_name [-locallib sharelib]” command is available when running oozie-setup.sh for uploading new or upgrading existing sharelib into hdfs where the first argument is the default fs name and the second argument is the Oozie sharelib to install, it can be a tarball or the expanded version of it. If the second argument is omitted, the Oozie sharelib tarball from the Oozie installation directory will be used.

bin/oozie-setup.sh sharelib create -fs hdfs://hadoop-senior.ibeifeng.com:8020 -locallib oozie-sharelib-4.0.0-cdh5.3.6-yarn.tar.gz

<9>db create|upgrade|postupgrade -run [-sqlfile ] command is for create, upgrade or postupgrade oozie db with an optional sql file

ooziedb.sh create -sqlfile oozie.sql -run DB Connection

<10>更改配置文件oozie-site.xml

<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/etc/hadoop</value>

<11>Start Oozie as a daemon process run

$ bin/oozied.sh start

<12>Check the Oozie log file logs/oozie.log to ensure Oozie started properly.

cd logs
more oozie.log

<13>Using the Oozie command line tool check the status of Oozie:

bin/oozie admin -oozie http://localhost:11000/oozie -status

四、Oozie Command Line Examples

<1>Setting Up the Examples

*Expanding this file will create an examples/ directory in the local file system

tar -zxvf oozie-examples.tar.gz

*查看example

cd example
###apps ---调度不同类型的应用案例
###input_data ---数据
###src ---案例源码

*The examples/ directory must be copied to the user HOME directory in HDFS:

hadoop fs -put examples examples

NOTE: If an examples directory already exists in HDFS, it must be deleted before copying it again. Otherwise files may not be copied.

<2>Running the Examples

*For the Streaming and Pig example, the Oozie Share Library must be installed in HDFS.

Add Oozie bin/ to the environment PATH.

The examples assume the JobTracker is localhost:8021 and the NameNode is hdfs://localhost:8020 . If the actual values are different, the job properties files in the examples directory must be edited to the correct values.

apps/map-reduce/job.properties

The example applications are under the examples/app directory, one directory per example. The directory contains the application XML file (workflow, or worklfow and coordinator), the job.properties file to submit the job and any JAR files the example may need.

The inputs for all examples are in the examples/input-data/ directory.

The examples create output under the examples/output-data/${EXAMPLE_NAME} directory.

Note : The job.properties file needs to be a local file during submissions, and not a HDFS path.

*How to run an example application

bin/oozie job -oozie http://localhost:11000/oozie -config examples/apps/map-reduce/job.properties -run

Check the workflow job status:

oozie job -oozie http://localhost:11000/oozie -info 0000000-171221101425718-oozie-beif-W

To avoid having to provide the -oozie option with the Oozie URL with every oozie command, set OOZIE_URL env variable to the Oozie URL in the shell environment. For example:

export OOZIE_URL="http://localhost:11000/oozie"
oozie job -info 14-20090525161321 0000000-171221101425718-oozie-beif-W

五、Oozie的工作流调度

<1>Definitions

Action: An execution/computation task (Map-Reduce job, Pig job, a shell command). It can also be referred as task or ‘action node’.

Workflow: A collection of actions arranged in a control dependency DAG (Direct Acyclic Graph). “control dependency” from one action to another means that the second action can’t run until the first action has completed.

Workflow Definition: A programmatic description of a workflow that can be executed.

Workflow Definition Language: The language used to define a Workflow Definition.

Workflow Job: An executable instance of a workflow definition.

Workflow Engine: A system that executes workflows jobs. It can also be referred as a DAG engine

<2>Workflow Definition

A workflow definition is a DAG with control flow nodes (start, end, decision, fork, join, kill) or action nodes (map-reduce, pig, etc.), nodes are connected by transitions arrows.

The workflow definition language is XML based and it is called hPDL (Hadoop Process Definition Language)

<3>Workflow Nodes

Workflow nodes are classified in control flow nodes and action nodes:

Control flow nodes: nodes that control the start and end of the workflow and workflow job execution path.

Action nodes: nodes that trigger the execution of a computation/processing task.

Node names and transitions must be conform to the following pattern =[a-zA-Z][-_a-zA-Z0-0]*=, of up to 20 characters long.

* Control Flow Nodes

Control flow nodes define the beginning and the end of a workflow (the start , end and kill nodes) and provide a mechanism to control the workflow execution path (the decision , fork and join nodes).

六、mapreduce action

<1>要点

A map-reduce action can be configured to perform file system cleanup and directory creation before starting the map reduce job

The workflow job will wait until the Hadoop map/reduce job completes before continuing to the next action in the workflow execution path

The counters of the Hadoop job and job exit status (=FAILED=, KILLED or SUCCEEDED ) must be available to the workflow job after the Hadoop jobs ends

The map-reduce action has to be configured with all the necessary Hadoop JobConf properties to run the Hadoop map/reduce job

<2>案例

1)准备工作

mkdir oozie-apps
cd oozie-apps/
cp -r ../examples/apps/map-reduce .
mv map-reduce mr-wordcount-wf

2)修改workflow.xml（新旧api）

<workflow-app xmlns="uri:oozie:workflow:0.5" name="mr-wordcount-wf">
<start to="mr-node-worcount"/>
<action name="mr-node-worcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api  </name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api  </name>
<value>true</value>
</property>
<property>
<name>mapreduce.job.queuename </name>
<value>${queueName}</value>
</property>
<property>
<name>mapreduce.job.map.class </name>
<value>com.ibeifeng.hadoop.senior.mapreduce.WordCount$WordCountMapper</value>
</property>
<property>
<name>mapreduce.job.reduce.class </name>
<value>com.ibeifeng.hadoop.senior.mapreduce.WordCount$WordCountReducer</value>
</property>

<property>
<name>mapreduce.map.output.key.class  </name>
<value>org.apache.hadoop.io.Text </value>
</property>
<property>
<name>mapreduce.map.output.value.class   </name>
<value>org.apache.hadoop.io.IntWritable  </value>
</property>
<property>
<name>mapreduce.job.output.key.class  </name>
<value>org.apache.hadoop.io.Text </value>
</property>
<property>
<name>mapreduce.job.output.value.class   </name>
<value>org.apache.hadoop.io.IntWritable  </value>
</property>

<property>
<name>mapreduce.input.fileinputformat.inputdir </name>
<value>${nameNode}/${oozieDataRoot}/${inputDir}</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.outputdir </name>
<value>${nameNode}/${oozieDataRoot}/${outputDir}</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

<3>修改job.properties

nameNode=hdfs://hadoop-senior.ibeifeng.com:8020
jobTracker=hadoop-senior.ibeifeng.com:8032
queueName=default
oozieAppsRoot=user/beifeng/oozie-apps
oozieDataRoot=user/beifeng/oozie/datas
oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/mr-wordcount-wf/workflow.xml
inputDir=mr-wordcount-wf/input
outputDir=mr-wordcount-wf/output

<4>将mapreduce jar包拷贝到oozie-apps/mr-wordcount-wf/lib目录下

<5>将oozie-apps放到hdfs目录下

hdfs dfs -put oozie-apps/  oozie-apps

<6>创建数据目录并上传数据

hdfs dfs -mkdir -p oozie/datas/mr-wordcount-wf/input
hdfs dfs -put /opt/datas/wc.txt oozie/datas/mr-wordcount-wf/input

<7>运行

export OOZIE_URL="http://localhost:11000/oozie"
bin/oozie job -config oozie-apps/mr-wordcount-wf/job.properties -run

3)如何定义一个workflow

* job.properties

关键点：指向workflow.xml文件所在的HDFS位置

* workflow.xml

定义文件

XML文件

包含几点

* start

* action

MapReduce、Hive、Sqoop、Shell

* ok

* error

* kill

* end

* lib 目录

依赖的jar包

workflow.xml编写：

* 流程控制节点

* Action节点

MapReduce Action

如何使用Ooize调度MapReduce程序

关键点：

将以前Java MapReduce程序中的【Driver】部分

|

configuration

七、hive action案例

<1>准备案例

cp -r ../examples/apps/hive .
mv hive hive-select
hdfs dfs -put hive-select/ oozie-apps/

<2>修改配置文件

*job.properties

nameNode=hdfs://hadoop-senior.ibeifeng.com:8020
jobTracker=hadoop-senior.ibeifeng.com:8032
queueName=default
oozieAppsRoot=user/beifeng/oozie-apps
oozieDataRoot=user/beifeng/oozie/datas
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/hive-select
outputDir=hive-select/output

注意：准备工作

cp /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/conf/hive-site.xml . //加载hive配置文件
mkdir lib
cd lib
cp /opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/lib/mysql-connector-java-5.1.27-bin.jar . //加载驱动包
hdfs dfs -put lib/ /user/beifeng/oozie-apps/hive-select/

*workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.5" name="hive-select-wf">
<start to="hive-node"/>

<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>
</prepare>
<job-xml>${nameNode}/${oozieAppsRoot}/hive-select/hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>select.sql</script>
<param>OUTPUT=${nameNode}/${oozieDataRoot}/${outputDir}</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>

<kill name="fail">
<message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

*select.sql

insert overwrite directory '${OUTPUT}'
select count(*) from default.student;

八、sqoop action案例

<1>准备案例

cp -r /opt/cdh-5.3.6/oozie-4.0.0-cdh5.3.6/examples/apps/sqoop .
mv sqoop sqoop-imp-user
cp /opt/cdh-5.3.6/oozie-4.0.0-cdh5.3.6/oozie-apps/hive-select/lib/mysql-connector-java-5.1.27-bin.jar lib/

<2>修改配置文件

*job.properties

nameNode=hdfs://hadoop-senior.ibeifeng.com:8020
jobTracker=hadoop-senior.ibeifeng.com:8032
queueName=default
oozieAppsRoot=user/beifeng/oozie-apps
oozieDataRoot=user/beifeng/oozie/datas

oozie.use.system.libpath=true

oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/sqoop-imp-user
outputDir=sqoop-imp-user/output

*workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.3" name="sqoop-user-wf">
<start to="sqoop-node"/>

<action name="sqoop-node">
<sqoop xmlns="uri:oozie:sqoop-action:0.3">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/${oozieDataRoot}/${outputDir}"/>

</prepare>
<configuration>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
</configuration>
<command>sqoop --options-file sqoop-imp-user.sqoop</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>

<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>

*sqoop-imp-user.sqoop

import
--connect
jdbc:mysql://hadoop-senior.ibeifeng.com:3306/test
--username
root
--password
123456
--table
my_user
--target-dir
${nameNode}/${oozieDataRoot}/${outputDir}
--num-mappers
1

注意:

sqoop底层也是mapreduce，使用的是新的api

sqoop中的引号只能是双引号

九、shell action

<1>准备案例

cp -r /opt/cdh-5.3.6/oozie-4.0.0-cdh5.3.6/examples/apps/shell .
mv shell shell-select

<2>编辑配置文件

*job.properties

nameNode=hdfs://hadoop-senior.ibeifeng.com:8020
jobTracker=hadoop-senior.ibeifeng.com:8032
queueName=default
oozieAppsRoot=user/beifeng/oozie-apps
oozieDataRoot=user/beifeng/oozie/datas

oozie.wf.application.path=${nameNode}/${oozieAppsRoot}/shell-select

exec=student-select.sh
script=student-select.sql

*workflow.xml

<workflow-app xmlns="uri:oozie:workflow:0.5" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${exec}</exec>
<file>${nameNode}/${oozieAppsRoot}/shell-select/${exec}#${exec}</file>
<file>${nameNode}/${oozieAppsRoot}/shell-select/${script}#${script}</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

*student-select.sh

#!/bin/sh
/opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6/bin/hive -f student-select.sql

*student-select.sql

insert overwrite directory '/user/beifeng/oozie/datas/shell-select/output'
select * from default.student

十、oozie协作调度

十一、oozie协作调度

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 大数据企业 workflow oozie

相关文章推荐

新的分享

章节导航