Executing an Oozie workflow with Pig, Hive & Sqoop actions

In the earlier blog entries, we have looked into how install Oozie here and how to do the Click Stream analysis using Hive and Pig here. This blog is about executing a simple work flow which imports the User data from MySQL database using Sqoop, pre-processes the Click Stream data using Pig and finally doing some basic analytics on the User and the Click Stream using Hive.
Before running a Hive query, the table/column/column-types have to be defined. Because of this, the data for Hive needs to have some structure. Pig is better for processing semi structured data when compared to Hive. Here is Pig vs Hive at a very high level. Because of the above mentioned reason, one of the use case is that Pig being used for pre-processing (filter out the invalid records, massage the data etc) of the data to make it ready for Hive to consume.
The below DAG was generated by Oozie. The fork will spawn a Pig action (which cleans the Click Stream data) and a Sqoop action (which imports the user data from a MySQL database) in parallel. Once the Pig and the Sqoop actions are done, the Hive action will be started to do the final analytics combining the Click Stream and the User data.

Executing an Oozie workflow with Pig, Hive  Sqoop actions - 我爱我的老婆 - 游龙

Here are the steps to define the work flow and then execute it. This is with the assumption that? MySQL, Oozie and Hadoop have been installed, configured and work properly. Here are the instructions for installing and configuring Oozie.
- The work flow requires more than 2 map slots in the cluster, so if the work flow is executed on a single node cluster the following has to be included in the `mapred-site.xml`.

?

1

2

3

4

5

6

7

8

de><de>de>propertyde>de>>de>

de>de>de><de>de>namede>de>>mapred.tasktracker.map.tasks.maximum</de>de>namede>de>>de>

de>de>de><de>de>valuede>de>>4</de>de>valuede>de>>de>

de></de>de>propertyde>de>>de>

de><de>de>propertyde>de>>de>

de>de>de><de>de>namede>de>>mapred.tasktracker.reduce.tasks.maximum</de>de>namede>de>>de>

de>de>de><de>de>valuede>de>>4</de>de>valuede>de>>de>

de></de>de>propertyde>de>>de>

- Create the file `oozie-clickstream-examples/input-data/clickstream/clickstream.txt` in HDFS with the below content. Note than the last record is an invalid record which is filtered by Pig when the work flow is executed. The first field is the userId and the second field is the site visited by the user.

?

1

2

3

4

5

6

7

8

9

10

11

de>1de>de>,www.bbc.comde>

de>1de>de>,www.abc.comde>

de>1de>de>,www.gmail.comde>

de>2de>de>,www.cnn.comde>

de>2de>de>,www.eenadu.netde>

de>2de>de>,www.stackoverflow.comde>

de>2de>de>,www.businessweek.comde>

de>3de>de>,www.eenadu.netde>

de>3de>de>,www.stackoverflow.comde>

de>3de>de>,www.businessweek.comde>

de>A,www.thecloudavenue.comde>

- Create a user table in MySQL

?

1

2

3

4

5

6

7

de>CREATE TABLE user (de>

de>de>de>user_id INTEGER NOT NULL PRIMARY KEY,de>

de>de>de>name CHAR(de>de>32de>de>) NOT NULL,de>

de>de>de>age INTEGER,de>

de>de>de>country VARCHAR(de>de>32de>de>),de>

de>de>de>gender CHAR(de>de>1de>de>)de>

de>);de>

- And insert some data into it

?

1

2

3

de>insert into user values (de>de>1de>de>,de>de>"Tom"de>de>,de>de>20de>de>,de>de>"India"de>de>,de>de>"M"de>de>);de>

de>insert into user values (de>de>2de>de>,de>de>"Rick"de>de>,de>de>5de>de>,de>de>"India"de>de>,de>de>"M"de>de>);de>

de>insert into user values (de>de>3de>de>,de>de>"Rachel"de>de>,de>de>15de>de>,de>de>"India"de>de>,de>de>"F"de>de>);de>

- Extract the `oozie-4.0.0/oozie-sharelib-4.0.0.tar.gz` file from the Oozie installation folder and copy the mysql-connector-java-*.jar to the `share/lib/sqoop` folder. This jar is required for Sqoop to connect to the MySQL database and get the user data.
- Copy the above mentioned `share` folder into HDFS. Here is the significance of sharelib in Oozie. These are the common libraries which are used across different actions in Oozie.

?

1

de>bin/hadoop fs -put /home/vm4learning/Code/share/ /user/vm4learning/share/de>

- Create the work flow file in HDFS (oozie-clickstream-examples/apps/cs/workflow.xml). Note that the connect string for the Oozie has to be modified appropriately.

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

de><?de>de>xmlde> de>versionde>de>=de>de>"1.0"de> de>encodingde>de>=de>de>"UTF-8"de>de>?>de>

de><de>de>workflow-appde> de>xmlnsde>de>=de>de>"uri:oozie:workflow:0.2"de> de>namede>de>=de>de>"cs-wf-fork-join"de>de>>de>

de>de>de><de>de>startde> de>tode>de>=de>de>"fork-node"de>de>/>de>

de>de>de><de>de>forkde> de>namede>de>=de>de>"fork-node"de>de>>de>

de>de>de><de>de>pathde> de>startde>de>=de>de>"sqoop-node"de> de>/>de>

de>de>de><de>de>pathde> de>startde>de>=de>de>"pig-node"de> de>/>de>

de>de>de></de>de>forkde>de>>de>

de>de>de><de>de>actionde> de>namede>de>=de>de>"sqoop-node"de>de>>de>

de>de>de><de>de>sqoopde> de>xmlnsde>de>=de>de>"uri:oozie:sqoop-action:0.2"de>de>>de>

de>de>de><de>de>job-trackerde>de>>${jobTracker}</de>de>job-trackerde>de>>de>

de>de>de><de>de>name-nodede>de>>${nameNode}</de>de>name-nodede>de>>de>

de>de>de><de>de>preparede>de>>de>

de>de>de><de>de>deletede> de>pathde>de>=de>de>"${nameNode}/${examplesRootDir}/input-data/user"de>de>/>de>

de>de>de></de>de>preparede>de>>de>

de>de>de><de>de>configurationde>de>>de>

de>de>de><de>de>propertyde>de>>de>

de>de>de><de>de>namede>de>>mapred.job.queue.name</de>de>namede>de>>de>

de>de>de><de>de>valuede>de>>${queueName}</de>de>valuede>de>>de>

de>de>de></de>de>propertyde>de>>de>

de>de>de></de>de>configurationde>de>>de>

de>de>de><de>de>commandde>de>>import --connect jdbc:mysql://localhost/clickstream --table user --target-dir ${examplesRootDir}/input-data/user -m 1</de>de>commandde>de>>de>

de>de>de></de>de>sqoopde>de>>de>

de>de>de><de>de>okde> de>tode>de>=de>de>"joining"de>de>/>de>

de>de>de><de>de>errorde> de>tode>de>=de>de>"fail"de>de>/>de>

de>de>de></de>de>actionde>de>>de>

de>de>de><de>de>actionde> de>namede>de>=de>de>"pig-node"de>de>>de>

de>de>de><de>de>pigde>de>>de>

de>de>de><de>de>job-trackerde>de>>${jobTracker}</de>de>job-trackerde>de>>de>

de>de>de><de>de>name-nodede>de>>${nameNode}</de>de>name-nodede>de>>de>

de>de>de><de>de>preparede>de>>de>

de>de>de><de>de>deletede> de>pathde>de>=de>de>"${nameNode}${examplesRootDir}/intermediate"de>de>/>de>

de>de>de></de>de>preparede>de>>de>

de>de>de><de>de>configurationde>de>>de>

de>de>de><de>de>propertyde>de>>de>

de>de>de><de>de>namede>de>>mapred.job.queue.name</de>de>namede>de>>de>

de>de>de><de>de>valuede>de>>${queueName}</de>de>valuede>de>>de>

de>de>de></de>de>propertyde>de>>de>

de>de>de><de>de>propertyde>de>>de>

de>de>de><de>de>namede>de>>mapred.compress.map.output</de>de>namede>de>>de>

de>de>de><de>de>valuede>de>>true</de>de>valuede>de>>de>

de>de>de></de>de>propertyde>de>>de>

de>de>de></de>de>configurationde>de>>de>

de>de>de><de>de>scriptde>de>>filter.pig</de>de>scriptde>de>>de>

de>de>de><de>de>paramde>de>>INPUT=${examplesRootDir}/input-data/clickstream</de>de>paramde>de>>de>

de>de>de><de>de>paramde>de>>OUTPUT=${examplesRootDir}/intermediate</de>de>paramde>de>>de>

de>de>de></de>de>pigde>de>>de>

de>de>de><de>de>okde> de>tode>de>=de>de>"joining"de>de>/>de>

de>de>de><de>de>errorde> de>tode>de>=de>de>"fail"de>de>/>de>

de>de>de></de>de>actionde>de>>de>

de>de>de><de>de>joinde> de>namede>de>=de>de>"joining"de> de>tode>de>=de>de>"hive-node"de>de>/>de>

de>de>de><de>de>actionde> de>namede>de>=de>de>"hive-node"de>de>>de>

de>de>de><de>de>hivede> de>xmlnsde>de>=de>de>"uri:oozie:hive-action:0.2"de>de>>de>

de>de>de><de>de>job-trackerde>de>>${jobTracker}</de>de>job-trackerde>de>>de>

de>de>de><de>de>name-nodede>de>>${nameNode}</de>de>name-nodede>de>>de>

de>de>de><de>de>preparede>de>>de>

de>de>de><de>de>deletede> de>pathde>de>=de>de>"${nameNode}/${examplesRootDir}/finaloutput"de>de>/>de>

de>de>de></de>de>preparede>de>>de>

de>de>de><de>de>configurationde>de>>de>

de>de>de><de>de>propertyde>de>>de>

de>de>de><de>de>namede>de>>mapred.job.queue.name</de>de>namede>de>>de>

de>de>de><de>de>valuede>de>>${queueName}</de>de>valuede>de>>de>

de>de>de></de>de>propertyde>de>>de>

de>de>de></de>de>configurationde>de>>de>

de>de>de><de>de>scriptde>de>>script.sql</de>de>scriptde>de>>de>

de>de>de><de>de>paramde>de>>CLICKSTREAM=${examplesRootDir}/intermediate/</de>de>paramde>de>>de>

de>de>de><de>de>paramde>de>>USER=${examplesRootDir}/input-data/user/</de>de>paramde>de>>de>

de>de>de><de>de>paramde>de>>OUTPUT=${examplesRootDir}/finaloutput</de>de>paramde>de>>de>

de>de>de></de>de>hivede>de>>de>

de>de>de><de>de>okde> de>tode>de>=de>de>"end"de>de>/>de>

de>de>de><de>de>errorde> de>tode>de>=de>de>"fail"de>de>/>de>

de>de>de></de>de>actionde>de>>de>

de>de>de><de>de>killde> de>namede>de>=de>de>"fail"de>de>>de>

de>de>de><de>de>messagede>de>>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</de>de>messagede>de>>de>

de>de>de></de>de>killde>de>>de>

de>de>de><de>de>endde> de>namede>de>=de>de>"end"de>de>/>de>

de></de>de>workflow-appde>de>>de>

- Create the job.properties file in HDFS (oozie-clickstream-examples/apps/cs/job.properties).

?

1

2

3

4

5

6

7

8

9

de>nameNode=hdfs:de>de>//localhost:9000de>

de>jobTracker=localhost:de>de>9001de>

de>queueName=de>de>defaultde>

de>examplesRoot=oozie-clickstream-examplesde>

de>examplesRootDir=/user/${user.name}/${examplesRoot}de>

de>oozie.use.system.libpath=de>de>truede>

de>oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/csde>

- Create the Hive script file in HDFS (oozie-clickstream-examples/apps/cs/script.sql). The below mentioned query will find the top 3 url's visited by users whose age is less than 16.

?

1

2

3

4

5

6

7

de>DROP TABLE clickstream;de>

de>CREATE EXTERNAL TABLE clickstream (userid INT, url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY de>de>'\t'de> de>LOCATION de>de>'${CLICKSTREAM}'de>de>;de>

de>DROP TABLE user;de>

de>CREATE EXTERNAL TABLE user (user_id INT, name STRING, age INT, country STRING, gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY de>de>','de> de>LOCATION de>de>'${USER}'de>de>;de>

de>INSERT OVERWRITE DIRECTORY de>de>'${OUTPUT}'de> de>SELECT url,count(url) c FROM user u JOIN clickstream c ON (u.user_id=c.userid) where u.age<de>de>16de> de>group by url order by c DESC LIMIT de>de>3de>de>;de>

- Create the Pig script file in HDFS (oozie-clickstream-examples/apps/cs/filter.pig).

?

1

2

3

de>clickstream = load de>de>'$INPUT'de> de>using PigStorage(de>de>','de>de>) as (userid:de>de>intde>de>, url:chararray);de>

de>SPLIT clickstream INTO good_records IF userid is not de>de>nullde>de>,? bad_records IF userid is de>de>nullde>de>;de>

de>STORE good_records into de>de>'$OUTPUT'de>de>;de>

- Execute the Oozie workflow as below. Note that the `job.properties` file should be present in the local file system and not in HDFS.

?

1

de>bin/oozie job -oozie http:de>de>//localhost:11000/oozie -config /home/vm4learning/Code/oozie-clickstream-examples/apps/cs/job.properties -runde>

- Initially the job will be in the `RUNNING` state and finally will reach the `SUCCEEDED` state. The progress of the work flow can be monitored from Oozie console at http://localhost:11000/oozie/.

Executing an Oozie workflow with Pig, Hive  Sqoop actions - 我爱我的老婆 - 游龙

- The output should appear as below in the `oozie-clickstream-examples/finaloutput/000000_0` file in HDFS.

?

1

2

3

de>www.businessweek.com de>de>2de>

de>www.eenadu.net de>de>2de>

de>www.stackoverflow.com de>de>2de>

Here are some final thoughts on Oozie.
- It's better to test the individual actions like Hive, Pig and Sqoop independent of Ooize and later integrate them in the Oozie work flow.
- The Oozie error messages very cryptic and the MapReduce log files need to be looked to figure out the actual error.
- The Web UI which comes with Oozie is very rudimentary and clumsy, need to look into some of the alternatives.
- The XML for creating the work flows is very verbose and is very error prone. Any UI for creating workflows for Oozie would be very helpful.
Will look into the alternatives for some of the above problems mentioned in a future blog entry.

Posted by Praveen Sripati at 12:15 PM

Email ThisBlogThis!Share to TwitterShare to Facebook

Labels: oozie, workflow

内容概要:本文提出了一种基于融合鱼鹰算法和柯西变异的改进麻雀优化算法(OCSSA),用于优化变分模态分解(VMD)的参数,进而结合卷积神经网络(CNN)与双向长短期记忆网络(BiLSTM)构建OCSSA-VMD-CNN-BILSTM模型,实现对轴承故障的高【轴承故障诊断】基于融合鱼鹰和柯西变异的麻雀优化算法OCSSA-VMD-CNN-BILSTM轴承诊断研究【西储大学数据】(Matlab代码实现)精度诊断。研究采用西储大学公开的轴承故障数据集进行实验验证,通过优化VMD的模态数和惩罚因子,有效提升了信号分解的准确性与稳定性,随后利用CNN提取故障特征,BiLSTM捕捉时间序列的深层依赖关系,最终实现故障类型的智能识别。该方法在提升故障诊断精度与鲁棒性方面表现出优越性能。; 适合人群:具备一定信号处理、机器学习基础,从事机械故障诊断、智能运维、工业大数据分析等相关领域的研究生、科研人员及工程技术人员。; 使用场景及目标:①解决传统VMD参数依赖人工经验选取的问题,实现参数自适应优化;②提升复杂工况下滚动轴承早期故障的识别准确率;③为智能制造与预测性维护提供可靠的技术支持。; 阅读建议:建议读者结合Matlab代码实现过程,深入理解OCSSA优化机制、VMD信号分解流程以及CNN-BiLSTM网络架构的设计逻辑,重点关注参数优化与故障分类的联动关系,并可通过更换数据集进一步验证模型泛化能力。
### 解决 `TransactionRequiredException` 方案 遇到 `javax.persistence.TransactionRequiredException: Executing an update/delete query` 异常通常是因为执行更新或删除操作时缺少必要的事务管理[^1]。 #### 使用 `@Transactional` 注解处理事务 为了确保更新和删除操作在一个有效的事务上下文中运行,可以利用Spring框架中的 `@Transactional` 注解来声明式地开启事务。此注解应放置于服务层的方法上或者类级别位置以覆盖整个类内的所有公共方法[^2]: ```java import org.springframework.transaction.annotation.Transactional; @Transactional public class DepartmentService { private final DeptRepository deptRepository; public DepartmentService(DeptRepository deptRepository){ this.deptRepository = deptRepository; } public void deleteAllDepartments(){ deptRepository.deleteAll(); } } ``` 对于自定义的JPQL查询语句,在编写接口时还需要额外标注 `@Modifying` 来表明这是一个会改变数据库状态的操作,并配合 `@Query` 定义具体的SQL语句: ```java import org.springframework.data.jpa.repository.JpaRepository; import org.springframework.data.jpa.repository.Modifying; import org.springframework.data.jpa.repository.Query; public interface DeptRepository extends JpaRepository&lt;Department, Long&gt; { @Modifying @Query(&quot;DELETE FROM Department d WHERE d.id IN :ids&quot;) void deleteByIds(@Param(&quot;ids&quot;) List&lt;Long&gt; ids); } ``` #### 配置文件设置 另外,确认应用程序配置已正确扫描并注册了带有 `@Transactional` 的组件和服务层bean。如果采用XML方式,则需在配置文件中加入相应的包路径扫描[^4]: ```xml &lt;context:component-scan base-package=&quot;com.example.yourproject.service&quot;/&gt; &lt;!-- 或者 --&gt; &lt;context:annotation-config/&gt; ``` 通过以上措施能够有效防止因缺乏事务而导致的 `TransactionRequiredException` 错误发生,从而安全可靠地完成所需的持久化操作。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值