Pig 的flatten, cogroup

本文介绍了SQL中分组查询的使用方法,并演示了如何统计每个班级不及格和优秀学生人数。此外,还探讨了flatten操作对tuple和bag的影响,以及cogroup和union操作的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文:http://blog.youkuaiyun.com/zythy/article/details/18426347

2个数据集:
student. 结构为(classNo,stuNo,score) 

C01,	N0101,	82
C01, 	N0102,	59
C01, 	N0103,	65
C02, 	N0201,	81
C02, 	N0202,	82
C02, 	N0203,	79
C03, 	N0301,	56
C03, 	N0302,	92
C03, 	N0306,	72

teacher 结构为(classNo,name) 

C01,	Zhang
C02, 	Sun
C03, 	Wang
C04, 	Dong


GROUP
执行以下命令:

grouped_student = group student by classNo parallel 2;
dump grouped_student;


结果如下:

(C02, {(C02,N0203,79), (C02,N0202,82), (C02,N0201,81)} )
(C01, {(C01,N0103,65), (C01,N0102,59), (C01,N0101,82)} )
(C03, {(C03,N0306,72), (C03,N0302,92), (C03,N0301,56)} )



其中的Paraller 2表示启用2个Reduce操作。
如何统计每个班级及格和优秀的学生人数呢?执行以下两个命令:

result = foreach grouped_student {
         fail =filter records by score < 60;
         excellent =filter records by score >=90;
         generate group, COUNT(fail) as fail, COUNT(excellent) as excellent;
};
dump result;

结果如下:
(C01, 1, 0)
(C02, 0, 0)
(C03, 1, 1)
 

FLATTEN
flatten操作,可以将数据格式扁平化。我们分别通过tuple和bag来看看flatten的作用:
1)  Flatten对tuple的作用
执行以下命令:

a= foreach student generate $0, ($1,$2);
dump a;

输出结果如下:
(C01, (N0101,82))
(C01, (N0102,59))
(C01, (N0103,65))
(C02, (N0201,81))
(C02, (N0202,82))
(C02, (N0203,79))
(C03, (N0301,56))
(C03, (N0302,92))
(C03, (N0306,72))


然后,执行:
b = foreach a generate $0, flatten($1);
dump b;


结果如下:
(C01, N0101,82)
(C01, N0102,59)
(C01, N0103,65)
(C02, N0201,81)
(C02, N0202,82)
(C02, N0203,79)
(C03, N0301,56)
(C03, N0302,92)
(C03, N0306,72)


由此看见,flatten作用于tuple时,将flatten对应的字段(tuple)中的字段扁平化为关系中的字段。 


2)  Flatten对bag的作用
执行以下命令
c = foreach records generate $0, <strong>{($1), ($1,$2)}</strong>;
dump c;


结果如下:
(C01, {(N0101),  (N0101,82)})
(C01, {(N0102),  (N0102,59)})
(C01, {(N0103),  (N0103,65)})
(C02, {(N0201),  (N0201,81)})
(C02, {(N0202),  (N0202,82)})
(C02, {(N0203),  (N0203,79)})
(C03, {(N0301),  (N0301,56)})
(C03, {(N0302),  (N0302,92)})
(C03, {(N0306),  (N0306,72)})


接下来执行:
d = foreach c generate $0, flatten($1);
dump d;


结果如下:
(C01, N0101)
(C01, N0101,82)
(C01, N0102)
(C01, N0102,59)
(C01, N0103)
(C01, N0103,65)
(C02, N0201)
(C02, N0201,81)
(C02, N0202)
(C02, N0202,82)
(C02, N0203)
(C02, N0203,79)
(C03, N0301)
(C03, N0301,56)
(C03, N0302)
(C03, N0302,92)
(C03, N0306)
(C03, N0306,72)


可以看出,flatten作用于bag时,会消除嵌套关系,生成类似于笛卡尔乘积的结果。 
 
COGROUP
Join的操作结果是平面的(一组元组),而COGROUP的结果是有嵌套结构的。
运行以下命令:
r1 = cogroup  student by classNo,  teacher by classNo;
dump r1;


结果如下:
(C01, {(C01,N0103,65),(C01,N0102,59), (C01,N0101,82)},{(C01,Zhang)})
(C02, {(C02,N0203,79),(C02,N0202,82), (C02,N0201,81)},{(C02,Sun)})
(C03, {(C03,N0306,72),(C03,N0302,92), (C03,N0301,56)},{(C03,Wang)})
(C04, {},                             {(C04,Dong)})


由结果可以看出:
1)  cogroup和join操作类似。
2)  生成的关系有3个字段。第一个字段为连接字段;第二个字段是一个包,值为关系1中的满足匹配关系的所有元组;第三个字段也是一个包,值为关系2中的满足匹配关系的所有元组。
3)  类似于Join的外连接。比如结果中的第四个记录,第二个字段值为空包,因为关系1中没有满足条件的记录。实际上第一条语句和以下语句等同:
r1= cogroup  student by classNo outer,   teacher by classNo outer;
 
如果你希望关系1或2中没有匹配记录时不在结果中出现,则可以分别在关系中使用inner而关键字进行排除。
执行以下语句:
r1 = cogroup  student by classNo inner,  teacher by classNo outer;
dump r1;

结果为:
(C01, {(C01,N0103,65),  (C01,N0102,59),  (C01,N0101,82)},{(C01,Zhang)})
(C02, {(C02,N0203,79),  (C02,N0202,82),  (C02,N0201,81)},{(C02,Sun)})
(C03, {(C03,N0306,72),  (C03,N0302,92),  (C03,N0301,56)},{(C03,Wang)})
 
如先前我们讲到的flatten,执行以下命令:
r2 = foreach r1 generate flatten($1), flatten($2);
dump r2;

结果如下:
(C01, N0103,65, C01,Zhang)
(C01, N0102,59, C01,Zhang)
(C01, N0101,82, C01,Zhang)
(C02, N0203,79, C02,Sun)
(C02, N0202,82, C02,Sun)
(C02, N0201,81, C02,Sun)
(C03, N0306,72, C03,Wang)
(C03, N0302,92, C03,Wang)
(C03, N0301,56, C03,Wang)
 
UNION
执行以下语句:
r_union = union  student,  teacher;
dump r_union;

结果如下:
(C01, N0101,82)
(C01, N0102,59)
(C01, N0103,65)
(C02, N0201,81)
(C02, N0202,82)
(C02, N0203,79)
(C03, N0301,56)
(C03, N0302,92)
(C03, N0306,72)
(C01, Zhang)
(C02, Sun)
(C03, Wang)
(C04, Dong)

可以看出:
1)  union是取两个记录集的并集。
2)  关系r_union的schema为未知(unknown),这是因为被union的两个关系的schema是不一样的。如果两个关系的schema是一致的,则union后的关系将和被union的关系的schema一致。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值