oozie 学习笔记 action

最新推荐文章于 2023-10-30 10:30:56 发布

CSLDBLYDX

最新推荐文章于 2023-10-30 10:30:56 发布

阅读量967

点赞数

CC 4.0 BY-SA版权

分类专栏： oozie

本文链接：https://blog.youkuaiyun.com/CSLDBLYDX/article/details/47093871

oozie 专栏收录该内容

4 篇文章

订阅专栏

Oozie的Action节点启动workflow任务，如Pig或Java任务，这些任务在远程环境中异步执行。Oozie通过回调或轮询机制来判断任务结束。每个Action有两种转换：成功时走'ok'路径，失败时走'error'路径，并需提供error-code和error-message用于精细化错误处理。Map-Reduce Action会阻塞直到任务完成，其退出状态决定后续流程走向。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

action node出发workflowe 任务，可以是pig，java等！

action触发的任务，都是在远程运行，而不是在oozie中运行！

触发的tasks 都是异步执行的，只有等到所有的tasks都执行完毕，才能调到下一个节点，同步执行的除外

oozie课通过回调（有唯一的URL），或者轮询探知任务结束，task应使用url通知，为防止网络等原因oozie会轮询探知任务结束！

Actions Have 2 Transitions, =ok= and =error=

If a computation/processing task -triggered by a workflow- completes successfully, it transitions took .

If a computation/processing task -triggered by a workflow- fails to complete successfully, its transitions toerror .

If a computation/processing task exits in error, there computation/processing task must provideerror-code and error-message information to Oozie. This information can be used fromdecision nodes to implement a fine grain error handling at workflow application level.

Each action type must clearly define all the error codes it can produce.

成功跳转到ok

失败 error

每一个erro必须定义error-code ,erroe-message information, infomation 可以用老决定-----

每一个action必须定义所有可能产生的error code

Map-Reduce ActionMap-Reduce Action

触发一个map-reduce任务

可以用来配置在执行前清理文件

阻塞直到map-reduce工作结束，才跳转到下一个action

退出状态，exit status (=FAILED=, KILLED or SUCCEEDED)-->决定跳转到那个节点

所有map-reduce必须的一些配置，可通过

the config-default.xml or
JobConf XML file bundled with the workflow application or
tag in workflow definition or workflowd的定义标签
Inline map-reduce action configuration or 在map-reduce ction的内部定义

An implementation of OozieActionConfigurator specified by the tag in workflow definition.

For example:

package com.example;import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.oozie.action.hadoop.OozieActionConfigurator;
import org.apache.oozie.action.hadoop.OozieActionConfiguratorException;
import org.apache.oozie.example.SampleMapper;
import org.apache.oozie.example.SampleReducer;
public class MyConfigClass implements OozieActionConfigurator {
    @Override
    public void configure(JobConf actionConf) throws OozieActionConfiguratorException {
        if (actionConf.getUser() == null) {
            throw new OozieActionConfiguratorException("No user set");
        }
        actionConf.setMapperClass(SampleMapper.class);
        actionConf.setReducerClass(SampleReducer.class);
        FileInputFormat.setInputPaths(actionConf, new Path("/user/" + actionConf.getUser() + "/input-data"));
        FileOutputFormat.setOutputPath(actionConf, new Path("/user/" + actionConf.getUser() + "/output"));
        ...
    }
}

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.5"> ... <action name="[NODE-NAME]"> <map-reduce> ... <job-xml>[JOB-XML-FILE]</job-xml> <configuration> <property> <name>[PROPERTY-NAME]</name> <value>[PROPERTY-VALUE]</value> </property> ... </configuration> <config-class>com.example.MyConfigClass</config-class> ... </map-reduce> <ok to="[NODE-NAME]"/> <error to="[NODE-NAME]"/> </action> ...</workflow-app>

Adding Files and Archives for the Job

The file, archive elements make available, to map-reduce jobs, files and archives. If the specified path is relative, it is assumed the file or archiver are within the application directory, in the corresponding sub-path. If the path is absolute, the file or archive it is expected in the given absolute path.

Files specified with the file element, will be symbolic links in the home directory of the task.

If a file is a native library (an '.so' or a '.so.#' file), it will be symlinked as and '.so' file in the task running directory, thus available to the task JVM.

To force a symlink for a file on the task running directory, use a '#' followed by the symlink name. For example 'mycat.sh#cat'.

Refer to Hadoop distributed cache documentation for details more details on files and archives.

用来为jobs指定文件，或者文件包（archive）

文件可以为sh 类库等

使用#跟符号链接，使得文件可用 For example 'mycat.sh#cat'

若为相对目录，默认为相对application的目录

Streaming

Streaming information can be specified in the streaming element.

The mapper and reducer elements are used to specify the executable/script to be used as mapper and reducer.

User defined scripts must be bundled with the workflow application and they must be declared in the files element of the streaming configuration. If the are not declared in the files element of the configuration it is assumed they will be available (and in the command PATH) of the Hadoop slave machines.

Some streaming jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.

The Mapper/Reducer can be overridden by a mapred.mapper.class or mapred.reducer.class properties in the job-xmlfile or configuration elements.

指定可执行文件/脚本作为reducer 或者mapper

文件必须在file标签中被指定，必须bundle到workflow application

Streaming Example:

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="firstjob">
        <map-reduce>
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <streaming>
                <mapper>/bin/bash testarchive/bin/mapper.sh testfile</mapper>
                <reducer>/bin/bash testarchive/bin/reducer.sh</reducer>
            </streaming>
            <configuration>
                <property>
                    <name>mapred.input.dir</name>
                    <value>${input}</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>${output}</value>
                </property>
                <property>
                    <name>stream.num.map.output.key.fields</name>
                    <value>3</value>
                </property>
            </configuration>
            <file>/users/blabla/testfile.sh#testfile</file>
            <archive>/users/blabla/testarchive.jar#testarchive</archive>
        </map-reduce>
        <ok to="end"/>
        <error to="kill"/>
    </action>
  ...
</workflow-app>

pig action

Syntax for Pig actions in Oozie schema 0.2:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.2">
    ...
    <action name="[NODE-NAME]">
        <pig>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <script>[PIG-SCRIPT]</script>
            <param>[PARAM-VALUE]</param>
                ...
            <param>[PARAM-VALUE]</param>
            <argument>[ARGUMENT-VALUE]</argument>
                ...
            <argument>[ARGUMENT-VALUE]</argument>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </pig>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

prepare 可以用来清理文件

configuration用来配置任务的

Pipes

Pipes information can be specified in the pipes element.

A subset of the command line options which can be used while using the Hadoop Pipes Submitter can be specified via elements - map, reduce, inputformat, partitioner, writer, program.

The program element is used to specify the executable/script to be used.

User defined program must be bundled with the workflow application.

Some pipe jobs require Files found on HDFS to be available to the mapper/reducer scripts. This is done using the file and archive elements described in the previous section.

Pipe properties can be overridden by specifying them in the job-xml file or configuration element.

给hadoop piepes用的。。。待查！！！

Syntax

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.5">
    ...
    <action name="[NODE-NAME]">
        <map-reduce>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
                <delete path="[PATH]"/>
                ...
                <mkdir path="[PATH]"/>
                ...
            </prepare>
            <streaming>
                <mapper>[MAPPER-PROCESS]</mapper>
                <reducer>[REDUCER-PROCESS]</reducer>
                <record-reader>[RECORD-READER-CLASS]</record-reader>
                <record-reader-mapping>[NAME=VALUE]</record-reader-mapping>
                ...
                <env>[NAME=VALUE]</env>
                ...
            </streaming>
			<!-- Either streaming or pipes can be specified for an action, not both -->
            <pipes>
                <map>[MAPPER]</map>
                <reduce>[REDUCER]</reducer>
                <inputformat>[INPUTFORMAT]</inputformat>
                <partitioner>[PARTITIONER]</partitioner>
                <writer>[OUTPUTFORMAT]</writer>
                <program>[EXECUTABLE]</program>
            </pipes>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <config-class>com.example.MyConfigClass</config-class>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </map-reduce>        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

Syntax

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.5">
    ...
    <action name="[NODE-NAME]">
        <map-reduce>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
                <delete path="[PATH]"/>
                ...
                <mkdir path="[PATH]"/>
                ...
            </prepare>
            <streaming>
                <mapper>[MAPPER-PROCESS]</mapper>
                <reducer>[REDUCER-PROCESS]</reducer>
                <record-reader>[RECORD-READER-CLASS]</record-reader>
                <record-reader-mapping>[NAME=VALUE]</record-reader-mapping>
                ...
                <env>[NAME=VALUE]</env>
                ...
            </streaming>
			<!-- Either streaming or pipes can be specified for an action, not both -->
            <pipes>
                <map>[MAPPER]</map>
                <reduce>[REDUCER]</reducer>
                <inputformat>[INPUTFORMAT]</inputformat>
                <partitioner>[PARTITIONER]</partitioner>
                <writer>[OUTPUTFORMAT]</writer>
                <program>[EXECUTABLE]</program>
            </pipes>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <config-class>com.example.MyConfigClass</config-class>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </map-reduce>        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

Pig Action

The pig action has to be configured with the job-tracker, name-node, pig script and the necessary parameters and configuration to run the Pig job.

job-tracker,name-node ,script 必要的parameter 必须配置

The configuration properties are loaded in the following above order i.e. job-xml and configuration, and the precedence order is later values override earlier values

配置加载顺序job-xml and configuration，先加载的覆盖后加载的！

param标签的参数在执行pig时传递给pig脚本，实际上是在执行之前进行替换

The pig action starts a Pig job.

The workflow job will wait until the pig job completes before continuing to the next action.

The pig action has to be configured with the job-tracker, name-node, pig script and the necessary parameters and configuration to run the Pig job.

A pig action can be configured to perform HDFS files/directories cleanup or HCatalog partitions cleanup before starting the Pig job. This capability enables Oozie to retry a Pig job in the situation of a transient failure (Pig creates temporary directories for intermediate data, thus a retry without cleanup would fail).

Hadoop JobConf properties can be specified as part of

the config-default.xml or
JobConf XML file bundled with the workflow application or
tag in workflow definition or
Inline pig action configuration.

The configuration properties are loaded in the following above order i.e. job-xml and configuration, and the precedence order is later values override earlier values.

Inline property values can be parameterized (templatized) using EL expressions.

The Hadoop mapred.job.tracker and fs.default.name properties must not be present in the job-xml and inline configuration.

As with Hadoop map-reduce jobs, it is possible to add files and archives to be available to the Pig job, refer to section [#FilesAchives][Adding Files and Archives for the Job].

Syntax for Pig actions in Oozie schema 0.2:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.2">
    ...
    <action name="[NODE-NAME]">
        <pig>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <script>[PIG-SCRIPT]</script>
            <param>[PARAM-VALUE]</param>
                ...
            <param>[PARAM-VALUE]</param>
            <argument>[ARGUMENT-VALUE]</argument>
                ...
            <argument>[ARGUMENT-VALUE]</argument>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </pig>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

Syntax for Pig actions in Oozie schema 0.1:

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="[NODE-NAME]">
        <pig>
            <job-tracker>[JOB-TRACKER]</job-tracker>
            <name-node>[NAME-NODE]</name-node>
            <prepare>
               <delete path="[PATH]"/>
               ...
               <mkdir path="[PATH]"/>
               ...
            </prepare>
            <job-xml>[JOB-XML-FILE]</job-xml>
            <configuration>
                <property>
                    <name>[PROPERTY-NAME]</name>
                    <value>[PROPERTY-VALUE]</value>
                </property>
                ...
            </configuration>
            <script>[PIG-SCRIPT]</script>
            <param>[PARAM-VALUE]</param>
                ...
            <param>[PARAM-VALUE]</param>
            <file>[FILE-PATH]</file>
            ...
            <archive>[FILE-PATH]</archive>
            ...
        </pig>
        <ok to="[NODE-NAME]"/>
        <error to="[NODE-NAME]"/>
    </action>
    ...
</workflow-app>

The prepare element, if present, indicates a list of paths to delete before starting the job. This should be used exclusively for directory cleanup or dropping of hcatalog table partitions for the job to be executed. The format to specify a hcatalog table partition URI is hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value]. In case of a hcatalog URI, the hive-site.xml needs to be shipped using file tag and the hcatalog and hive jars need to be placed in workflow lib directory or specified using archive tag.

The job-xml element, if present, must refer to a Hadoop JobConf job.xml file bundled in the workflow application. The job-xml element is optional and as of schema 0.4, multiple job-xml elements are allowed in order to specify multiple Hadoop JobConf job.xml files.

The configuration element, if present, contains JobConf properties for the underlying Hadoop jobs.

Properties specified in the configuration element override properties specified in the file specified in the job-xml element.

External Stats can be turned on/off by specifying the property oozie.action.external.stats.write as true or false in the configuration element of workflow.xml. The default value for this property is false.

The inline and job-xml configuration properties are passed to the Hadoop jobs submitted by Pig runtime.

The script element contains the pig script to execute. The pig script can be templatized with variables of the form ${VARIABLE}. The values of these variables can then be specified using the params element.

NOTE: Oozie will perform the parameter substitution before firing the pig job. This is different from theparameter substitution mechanism provided by Pig, which has a few limitations.

The params element, if present, contains parameters to be passed to the pig script.