使用自带的 oozie-examples.tar.gz运行oozie.
解压 oozie-examples.tar.gz 到本地系统中,将examples目录文件拷贝到hdfs系统中
$ hadoop hdfs -put examples examples
$ hadoop hdfs -put share share
$ bin/oozie job -oozie http://hadoop.com:11000/oozie -config examples/apps/map-reduce/job.properties -run
job: 0000001-170331194520545-oozie-huan-W
$ oozie job -oozie http://hadoop.com:11000/oozie -info 0000001-170331194520545-oozie-huan-W
WorkflowName: map-reduce-wf
AppPath : hdfs://hadoop.com:8020/user/huangxgc/examples/apps/map-reduce/workflow.xml
Status : RUNNING
Run :0
User : huangxgc
Group :-
Created :2017-03-3113:02 GMT
Started :2017-03-3113:02 GMT
LastModified:2017-03-3113:26 GMT
Ended :-
CoordAction ID:-
$ export OOZIE_URL="http:// :11000/oozie"
本地Oozie示例,Oozie提供了Java的API。
import org.apache.oozie.local.LocalOozie;
import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;
import java.util.Properties;
...
//启动本地Oozie
LocalOozie.start();
//获得一个OozieCLient对象
OozieClient wc =LocalOozie.getClient();
//创建工作流作业配置并设置工作流应用程序路径
Properties conf = wc.createConfiguration();
conf.setProperty(OozieClient.APP_PATH,"hdfs://hadoop.com:8020/user/huangxg/my-wf-app”);
//设置工作流参数
conf.setProperty("ResourceManager","hadoop.com:8032");
conf.setProperty("inputDir","/user/huangxgc/indir");
conf.setProperty("outputDir","/user/huangxgc/outdir");
...
。
//提交并启动工作流作业
String jobId = wc.run(conf);
System.out.println("Workflow job submitted");
//等待工作流作业每10秒钟完成打印状态
while(wc.getJobInfo(jobId).getStatus()==Workflow.Status.RUNNING){
System.out.println("Workflow job running ...");
Thread.sleep(10 *1000);
}
//打印工作流作业的最终状态
System.out.println(“工作流作业完成...”);
System.out.println(wf.getJobInfo(jobId));
// stop local Oozie
LocalOozie.stop();
...
Oozie的功能规范:
// 4.0.0版本的Oozie
EmailAction
ShellAction
HiveAction
SqoopAction
SshAction
DistCpAction
Writing a CustomActionExecutor
//4.3.0支持Spark 动作
Hive 2 Action
Spark Action
Oozie Specification, a Hadoop Workflow System
- oozie的规范,hadoop的工作流系统
- 3.2.1 Action Basis
- 3.2.2 Map-Reduce Action
- 3.2.3 Pig Action
- 3.2.4 Fs (HDFS) action
- 3.2.5 Ssh Action
- 3.2.6 Sub-workflow Action
- 工作流的定义the workflow definition
A workflow definition is a DAG with control flow nodes (start, end, decision, fork, join, kill) or action nodes (map-reduce, pig, etc.), nodes are connected by transitions arrows.(转换箭头连接)
The workflow definition language is XML based and it is called hPDL (Hadoop Process Definition Language).(hadoop过程定义语言)
Refer to(参阅) the Appendix A(附录A) for theOozie Workflow Definition XML Schema . Appendix B has Workflow Definition Examples .
- 工作流节点
- 工作流节点分为控制流节点和动作节点:
- 控制流节点: 控制工作流和工作流作业执行路径的开始和结束的节点。
- 动作节点: 触发执行计算/处理任务的节点。
- 工作流节点分为控制流节点和动作节点:
节点名称和转换必须符合以下模式= [a-zA-Z] [\ -_ a-zA-Z0-9] * =,最多20个字符。
- Contral Flow Nodes(控制流节点) 6个Node
- 开始和结束节点——开始、结束和终止;
- 执行路径节点——决策、分支和连接
- 一个工作流任务的入口,一个工作流必须一个start node。
<workflow-appname="[WF-DEF-NAME]"xmlns="uri:oozie:workflow:0.5">
...
<startto="[NODE-NAME]"/>
...
</workflow-app>
- 结束节点是表明任务成功完成,当任务到达end,任务就成功完成。
<workflow-appname="[WF-DEF-NAME]"xmlns="uri:oozie:workflow:0.5">
...
<endname="[NODE-NAME]"/>
...
</workflow-app>
<workflow-appname="[WF-DEF-NAME]"xmlns="uri:oozie:workflow:0.5">
...
<killname="[NODE-NAME]">
<message>[MESSAGE-TO-LOG]</message>
</kill>
...
</workflow-app>
- decision能都够在工作路径上执行选择,可以看做是switch-case语句,在执行列中遇到true时开始执行,当没有遇到true时,执行默认路径。
- Predicates解析EL表达式,将其解析为bool类型,如:
${fs:fileSize('/usr/foo/myinputdir') gt 10* GB}
<workflow-appname="[WF-DEF-NAME]"xmlns="uri:oozie:workflow:0.5">
...
<decisionname="[NODE-NAME]">
<switch>
<caseto="[NODE_NAME]">[PREDICATE]</case>
...
<caseto="[NODE_NAME]">[PREDICATE]</case>
<defaultto="[NODE_NAME]"/>
</switch>
</decision>
...
</workflow-app>
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.5">
...
<fork name="[FORK-NODE-NAME]">
<path start="[NODE-NAME]"/>
...
<path start="[NODE-NAME]"/>
</fork>
...
<join name="[JOIN-NODE-NAME]" to="[NODE-NAME]"/>
...
</workflow-app>
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.5">
...
<fork name="forking">
<path start="firstparalleljob"/>
<path start="secondparalleljob"/>
</fork>
<action name="firstparallejob">
<map-reduce>
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<job-xml>job1.xml</job-xml>
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<action name="secondparalleljob">
<map-reduce>
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<job-xml>job2.xml</job-xml>
</map-reduce>
<ok to="joining"/>
<error to="kill"/>
</action>
<join name="joining" to="nextaction"/>
...
</workflow-app>
- 动作节点是工作流触发执行计算/处理任务的机制。
- 3.2.1 Action Basis
- 3.2.2 Map-Reduce Action
- 3.2.3 Pig Action
- 3.2.4 Fs (HDFS) action
- 3.2.5 Ssh Action
- 3.2.6 Sub-workflow Action
- 3.2.7 Java Action
- Map-Reduce Action
示例:MapReduce中的WordCount工作流。实际应用不能输入注释
<workflow-app xmlns="uri:oozie:workflow:0.5" name="myoozie-wordcount-wf">
<start to=mr-node-wordcount"/>
<action name="mr-node-wordcount">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<!--注意,实际的workflow中不能使用注释-->
<!--删除mapreduce的输出目录-->
<prepare>
<!--可以在job.properties中对目录环变量进行重新配置-->
<delete path="${nameNode}/${wf:user()}/${examplesRoot}/output-data/${outputDir}"/>
</prepare>
<!--使用的配置文件-->
<configuration>
<!--启用新的API,Oozie默认使用旧的API-->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>${queueName}</value>
</property>
<property>
<name>mapreduce.job.map.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapreduce.job.reduce.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<!--输入输出路径-->
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/${examplesRoot}/input-data/text</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}</value>
</property>
<!--map和job output新设置。key-value 属性,根据实际需求进行改变-->
<!--根据需求另需要设置shuffle-->
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapreduce.map.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
- Hive Action
语法:
<workflow-appname="[WF-DEF-NAME]"xmlns="uri:oozie:workflow:0.1">
...
<actionname="[NODE-NAME]">
<hivexmlns="uri:oozie:hive-action:0.2">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<deletepath="[PATH]"/>
...
<mkdirpath="[PATH]"/>
...
</prepare>
<!--hive-site.xml文件的配置地址-->
<job-xml>[HIVE SETTINGS FILE]</job-xml>
<configuration>
<!--hadoop的配置属性-->
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<!--脚本的HDFS地址-->
<script>[HIVE-SCRIPT]</script>
<!--脚本使用的参数-->
<param>[PARAM-VALUE]</param>
...
<param>[PARAM-VALUE]</param>
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
</hive>
<okto="[NODE-NAME]"/>
<errorto="[NODE-NAME]"/>
</action>
...
</workflow-app>
- 注意:
- Hadoop mapred.job.tracker and fs.default.name 不能出现在内联标签中。
- Hive使用的hadoop的API 默认为老的API在默认使用的时候,需要注意使用,务必需要确认xml文件中是否需要更改,mapred.mapper.new-api默认为FALSE。
- 可通过运行Hive通过ResourceManager8088中Historyserver进行查看
<workflow-appname="sample-wf"xmlns="uri:oozie:workflow:0.5">
...
<actionname="myfirsthivejob">
<hivexmlns="uri:oozie:hive-action:0.5">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<deletepath="${nameNode}/${wf:user()}/${examplesRoot}/${outputDir}"/>
</prepare>
<job-XMl>${{nameNode}${nameNode}/${wf:user()}/${examplesRoot}/job-xml.xml/${}
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>myscript.q</script>
<param>${nameNode}/${wf:user()}/${examplesRoot}/${inputDir}</param>
<param>${nameNode}/${wf:user()}/${examplesRoot}/${outputDir}</param>
</hive>
<okto="myotherjob"/>
<errorto="errorcleanup"/>
</action>
...
</workflow-app>
- Sqoop Action
<workflow-appname="[WF-DEF-NAME]"xmlns="uri:oozie:workflow:0.1">
...
<actionname="[NODE-NAME]">
<sqoopxmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<deletepath="[PATH]"/>
...
<mkdirpath="[PATH]"/>
...
</prepare>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<command>[SQOOP-COMMAND]</command>
<arg>[SQOOP-ARGUMENT]</arg>
...
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
...
</sqoop>
<okto="[NODE-NAME]"/>
<errorto="[NODE-NAME]"/>
</action>
...
</workflow-app>
<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:hsqldb:file:db.hsqldb</arg>
<arg>--table</arg>
<arg>TT</arg>
<arg>--target-dir</arg>
<arg>hdfs://localhost:8020/user/tucu/foo</arg>
<arg>-m</arg>
<arg>1</arg>
Sqoop command
- sqoop的命令可以沟通过command的标签或者多个arg标签进行指定该操作。
- 也可以通过 bin/sqoo --option-file xxx.txt 来执行
- Shell Action
Oozie Coordinate(调度)
- 确保时区的一致性。
- 对oozie.processing.timezone属性修改为GMT+0800,使用设置后,时间的格式将变为时间后必须加+0800
- Oozie的web控制台也需要设置的Timezone的时区保持一致,否则在控制台的时间将会出现不一致,目录地址:${OOZIE_HOME}oozie-server/webapps/oozie/oozie-cosole.jsfunciton getTimeZone中进行修改。
Sun, 02 Apr 2017 19:23:39 +0800
- 频率和时间段的表达式(Frequency and Time-Period Representation)
- coordinate支持两种时间表达式,分别是EL表达式和Crontab表达式
- coordinate支持两种时间表达式,分别是EL表达式和Crontab表达式
EL Constant | Value | Example |
${coord:minutes(int n)} | n | ${coord:minutes(45)} --> 45 |
${coord:hours(int n)} | n * 60 | ${coord:hours(3)} --> 180 |
${coord:days(int n)} | variable | ${coord:days(2)} --> minutes in 2 full days from the current date |
${coord:months(int n)} | variable | ${coord:months(1)} --> minutes in a 1 full month from the current date |
${cron syntax} | variable | ${0,10 15 * * 2-6} --> a job that runs every weekday at 3:00pm and 3:10pm UTC time |
Cron Expression | Meaning |
---|---|
10 9 * * * | Runs everyday at 9:10am |
10,30,45 9 * * * | Runs everyday at 9:10am, 9:30am, and 9:45am |
0 * 30 JAN 2-6 | Runs at 0 minute of every hour on weekdays and 30th of January |
0/20 9-17 * * 2-5 | Runs every Mon, Tue, Wed, and Thurs at minutes 0, 20, 40 from 9am to 5pm |
1 2 L-3 * * | Runs every third-to-last day of month at 2:01am |
1 2 6W 3 ? | Runs on the nearest weekday to March, 6th every year at 2:01am |
1 2 * 3 3#2 | Runs every second Tuesday of March at 2:01am every year |
0 10,13 * * MON-FRI | Runs every weekday at 10am and 1pm |