可扩展机器学习及相关技术
1. Pig 数据处理
1.1 标记扁平化
使用 A3 = foreach A2 generate flatten(tokens) as words;
命令可将每个标记化的行进一步拆分为单词标记。执行 dump A3
命令会输出如下内容:
(Hi)
(All)
(Welcome)
(to)
(Hadoop)
(Hadoop)
(class)
(integrating)
(with)
(R)
(Hadoop)
1.2 单词分组
使用 A4 = group A3 by words;
命令会创建单词的键值对,以及该单词在标记化列表中出现次数的列表。执行 dump A4
命令会输出如下内容:
(R,{(R)})
(Hi,{(Hi)})
(to,{(to)})
(All,{(All)})
(with,{(with)})
(class,{(class)})
(Hadoop,{(Hadoop),(Hadoop),(Hadoop)})
(Welcome,{(Welcome)})
(integrating,{(integrating)})
1.3 计数与排序
以下两个命令将生成单词及其在文档中出现次数的键值对,并按计数进行排序:
A5 = foreach A4 generate group,COUNT(A3);