HAWQ 技术解析（十） —— 过程语言

wzy0623

已于 2025-05-16 10:03:20 修改

阅读量5.3k

点赞数

CC 4.0 BY-SA版权

分类专栏：更好的 Hadoop 数据仓库解决方案——HAWQ 技术解析文章标签： HAWQ

于 2017-03-24 19:48:57 首次发布

本文链接：https://blog.youkuaiyun.com/wzy0623/article/details/65636569

更好的 Hadoop 数据仓库解决方案——HAWQ 技术解析专栏收录该内容

19 篇文章

订阅专栏

本文深入探讨HAWQ中的用户自定义函数（UDF），包括SQL函数、PL/pgSQL函数、表函数、参数个数可变的函数及多态类型函数等。同时介绍了如何递归遍历树形结构。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

HAWQ 支持用户自定义函数（user-defined functions，UDF），还支持给 HAWQ 内部的函数起别名。编写 UDF 的语言可以是 SQL、C、Java、Perl、Python、R 和 pgSQL，其中除 SQL 和 C 是 HAWQ 的内建语言，其他语言通常被称为过程语言（PLs），支持过程语言编程是对 HAWQ 核心的功能性扩展。HAWQ 我所使用过的 SQL-on-Hadoop 解决方案中唯一支持过程化编程的，Hive、SparkSQL、Impala 都没有此功能。对于习惯了编写存储过程的 DBA 来说，这无疑大大提高了 HAWQ 的易用性，冲这点也得给 HAWQ 点个赞。本篇主要讨论 HAWQ 内建的 SQL 语言函数和 PL/pgSQL 函数编程。为了便于说明，执行下面的 SQL 语句创建一个名为 channel 的示例表，并生成一些数据，后面定义的函数大都以操作 channel 表为例。

create table channel (
    id int not null,
    cname varchar(200) not null,
    parent_id int not null);

insert  into channel values (13,'首页',-1);
insert  into channel values (14,'tv580',-1);
insert  into channel values (15,'生活580',-1);
insert  into channel values (16,'左上幻灯片',13);
insert  into channel values (17,'帮忙',14);
insert  into channel values (18,'栏目简介',17);

select * from channel;

analyze channel;

一、HAWQ 内建 SQL 语言

缺省时，在 HAWQ 的所有数据库中都可以使用 SQL 和 C 语言编写用户自定义函数，SQL 函数中可执行任意条数的 SQL 语句，每条 SQL 语句必须以分号（;）分隔。SQL 函数可以返回 void 或返回 return 语句指定类型的数据，由于 HAWQ 只有函数而没有存储过程的概念，returns void 可用来模拟没有返回值的存储过程。所有非 returns void 函数的最后一句 SQL 必须是返回指定类型的 select 语句，函数返回最后一条查询语句的结果，可以是单行或多行结果集。下面是 SQL 函数的几个例子。

create function fn_count_channel() returns bigint as $$
    select count(*) from channel;
$$ language sql;

该函数没有参数，并返回 channel 表的记录数，函数的调用结果如图1 所示。

图1

修改上面定义的函数：

create or replace function fn_count_channel() returns bigint as $$
    select count(*) from channel;
    select count(*) from channel where parent_id=-1;
$$ language sql;

该函数体内执行了两条查询语句。在函数参数和返回值的定义没有变化时，可以使用 create or replace 重新定义函数体，该语法与 Oracle 类似。如果函数参数或返回值的定义发生变化，必须先删除再重建函数。函数返回最后一条查询语句的结果，即 parent_id=-1 的记录数，调用结果如图2 所示。

图2

再次修改 fn_count_channel() 函数：

create or replace function fn_count_channel() returns bigint as $$
    select count(*) from channel;
    create table t1 (a int);
    drop table t1;
    select count(*) from channel where parent_id=-1;
$$ language sql;

函数体中也能执行 DDL 语句，调用结果如图2 相同。

改变 fn_count_channel() 函数的返回值类型，必须先删除再重建，不能使用 create or replace 语法。

db1=# create or replace function fn_count_channel() returns void as $$
db1$# $$ language sql;
ERROR:  cannot change return type of existing function
HINT:  Use DROP FUNCTION first.
db1=# select fn_count_channel();
 fn_count_channel 
------------------
                3
(1 row)

db1=# drop function fn_count_channel();
DROP FUNCTION
db1=# create or replace function fn_count_channel() returns void as $$
db1$# $$ language sql;
CREATE FUNCTION
db1=# select fn_count_channel();
 fn_count_channel 
------------------
 
(1 row)

该函数没有返回值，而且函数体内没有任何 SQL 语句。

二、PL/pgSQL 函数

SQL 是关系数据库使用的查询语言，其最大的特点是简单易学，但主要问题是每条 SQL 语句必须由数据库服务器独立执行，而且缺少必要的变量定义、流程控制等编程手段，过程语言解决的就是这个问题。顾名思义，PL/pgSQL 以 PostgreSQL 作为编程语言，它能实现以下功能：

建立 plpgsql 函数。
为 SQL 语言增加控制结构。
执行复杂计算。
继承所有 PostgreSQL 的数据类型（包括用户自定义类型）、函数和操作符。

每条 SQL 语句由数据库服务器独立执行模式下，客户端应用向数据库服务器发送一个查询请求后，必须等待处理完毕，接收处理结果，做相应的计算，然后再向服务器发送后面的查询。通常客户端与数据库服务器不在同一物理主机上，这种频繁地进程间通信增加了网络开销。使用 PL/pgSQL 函数，可以将一系列查询和计算作为一组保存在数据库服务器中，它结合了过程语言的强大功能与 SQL 语言的易用性，并且显著降低了客户端/服务器的通信开销。正因如此，UDF 的性能比不使用存储函数的情况会有很大提高。

消除了客户端与服务器之间的额外往复，只需要一次调用并接收结果即可。
客户端不需要中间处理结果，从而避免了它和服务器之间的数据传输或转换。
避免多次查询解析。

PL/pgSQL 自动在所有 HAWQ 数据库中安装。PL/pgSQL 函数参数接收任何 HAWQ 服务器所支持的标量数据类型或数组类型，也可以返回这些数据类型。除此之外，PL/pgSQL 还可以接收或返回任何自定义的复合数据类型，也支持返回单行记录（record 类型）或多行结果集（setof record 或 table 类型）。返回结果集的函数通过执行 RETURN NEXT 语句生成一条返回的记录，与 PostgreSQL 不同，HAWQ 函数不支持 RETURN QUERY 语法。

PL/pgSQL 可以声明输出参数，这种方式可代替用 returns 语句显式指定返回数据类型的写法，当返回值是单行多列时，用输出参数的方式更方便。

三、给 HAWQ 内部函数起别名

许多 HAWQ 的内部函数是用 C 语言编写的，这些函数是在 HAWQ 集群初始化时声明的，并静态连接到 HAWQ 服务器。用户不能自己定义新的内部函数，但可以给已存在的内部函数起别名。下面的例子创建了一个新的函数 fn_all_caps，它是 HAWQ 的内部函数 upper 的别名。

create function fn_all_caps (text) returns text as 'upper' language internal strict;

该函数的调用结果如图3 所示。

图3

四、表函数

表函数返回多行结果集，调用方法就像查询一个 from 子句中的表、视图或子查询。如果表函数返回单列，那么返回的列名就是函数名。下面是一个表函数的例子，该函数返回 channel 表中给定 ID 值的数据。

create function fn_getchannel(int) returns setof channel as $$
    select * from channel where id = $1;
$$ language sql;

可以使用以下语句调用该函数：

select * from fn_getchannel(-1) as t1;
select * from fn_getchannel(13) as t1;

调用结果如图4 所示。

图4

与 PostgreSQL 不同，HAWQ 的表函数不能用于表连接。在 PostgreSQL 中以下查询可以正常执行，如图5 所示。

create table t1 (a int);
insert into t1 values (1);
select * from t1,fn_getchannel(13);

图5

但是在 HAWQ 中，同样的查询会报如图6 所示的错误。

图6

单独查询表函数是可以的。

create view vw_getchannel as select * from fn_getchannel(13);
select * from vw_getchannel;

查询结果如图7 所示。

图7

在某些场景下，函数返回的结果依赖于调用它的参数，为了支持这种情况，表函数可以被声明为返回伪类型（pseudotype）的记录。当这种函数用于查询中时，必须由查询本身指定返回的行结构。下面的例子使用动态 SQL，返回结果集依赖于作为入参的查询语句。

create or replace function fn_return_pseudotype ( str_sql text)
 returns setof record as
$$
declare
    v_rec record;
begin
   for v_rec in execute str_sql loop
      return next v_rec;
   end loop;
   return;
end;
$$
language plpgsql;

调用函数时必须显式指定返回的字段名及其数据类型。

select * from fn_return_pseudotype('select 1') t (id int);
select * from fn_return_pseudotype('select * from channel') t (id int,cname varchar(200),parent_id int);

查询结果如图8 所示。

图8

“PostgreSQL: Documentation: 8.2: Pseudo-Types”显示了 PostgreSQL 8.2 支持的伪类型。伪类型不能作为表列或变量的数据类型，但可以被用于函数的参数或返回值类型。

五、参数个数可变的函数

HAWQ 从 PostgreSQL 继承了一个非常好的特性，即函数参数的个数可变，原来做 Oracle 的时候，想实现这个功能是很麻烦的。参数个数可变是通过一个动态数组实现的，因此所有参数都应该具有相同的数据类型。这种函数将最后一个参数标识为 VARIADIC，并且参数必须声明为数组类型。下面是一个例子，实现类似原生函数 greatest 的功能。

create or replace function fn_mgreatest(variadic numeric[]) returns numeric as $$
declare 
    l_i numeric:=-99999999999999;
    l_x numeric;
    array1 alias for $1;
begin
  for i in array_lower(array1, 1) .. array_upper(array1, 1)
  loop  
    l_x:=array1[i];
    if l_x > l_i then
       l_i := l_x;
	end if;
  end loop;  
  return l_i;  
end;
$$ language 'plpgsql';

可以使用如下语句执行该函数。

select fn_mgreatest(array[10, -1, 5, 4.4]);
select fn_mgreatest(array[10, -1, 5, 4.4, 100]);

执行结果如图9 所示。

图9

六、多态类型

PostgreSQL 中的 anyelement、anyarray、anynonarray 和 anyenum 四种伪类型被称为多态类型，使用这些类型声明的函数叫做多态函数。多态函数的同一参数在每次调用函数时可以有不同数据类型，实际使用的数据类型由调用函数时传入的参数所确定。

多态参数和返回值是相互绑定的，当一个查询调用多态函数时，特定的数据类型在运行时解析。每个声明为 anyelement 的位置（参数或返回值）允许是任何实际的数据类型，但是在任何一次给定的调用中，anyelement 必须具有相同的实际数据类型。同样，每个声明为 anyarray 的位置允许是任何实际的数组数据类型，但是在任何一次给定的调用中，anyarray 也必须具有相同类型。如果某些位置声明为 anyarray，而另外一些位置声明为 anyelement，那么实际的数组元素类型必须与 anyelement 的实际数据类型相同。

anynonarray 在操作上与 anyelement 完全相同，它只是在 anyelement 的基础上增加了一个额外约束，即实际类型不能是数组。anyenum 在操作上也与 anyelement 完全相同，它只是在 anyelement 的基础上增加了一个额外约束，即实际类型必须是枚举（enum）类型。anynonarray 和 anyenum 并不是独立的多态类型，它们只是在 anyelement上增加了约束而已。例如，f(anyelement, anyenum) 与 f(anyenum, anyenum) 是等价的，实际参数都必须是同样的枚举类型。

如果一个函数的返回值被声明为多态类型，那么它的参数中至少应该有一个是多态的，并且参数与返回结果的实际数据类型必须匹配。例如，函数声明为 assubscript(anyarray, integer) returns anyelement。此函数的的第一个参数为数组类型，而且返回值必须是实际数组元素的数据类型。再比如一个函数的声明为 asf(anyarray) returns anyenum，那么参数只能是枚举类型的数组。

参数个数可变的函数也可以使用多态类型，实现方式是声明函数的最后一个参数为 VARIADIC anyarray。

例1：判断两个入参是否相等，每次调用的参数类型可以不同，但两个入参的类型必须相同。

create or replace function fn_equal (anyelement,anyelement)
    returns boolean as
    $$
begin
    if $1 = $2 then
        return true;
    else
        return false;
    end if;
end;
  $$
language 'plpgsql';

下列语句调用函数返回情况如图10 所示。

select fn_equal(1,1);
select fn_equal(1,'a');
select fn_equal('a','A');
select fn_equal(text 'a',text 'A');
select fn_equal(text 'a',text 'a');

图10

例2：遍历任意类型的数组，数组元素以行的形式返回。

create or replace function fn_unnest(anyarray)
returns setof anyelement
language 'sql' as
$$
   select $1[i] from generate_series(array_lower($1,1),array_upper($1,1)) i;
$$;

下列语句调用函数返回情况如图11 所示。

select fn_unnest(array[1,2,3,4]);
select fn_unnest(array['a','b','c']);

图11

例3：新建 fn_mgreatest1 函数，使它能返回任意数组类型中的最大元素。

create or replace function fn_mgreatest1(v anyelement, variadic anyarray) returns anyelement as $$
declare 
    l_i v%type;
    l_x v%type;
    array1 alias for $2;
begin
    l_i := array1[1];
    for i in array_lower(array1, 1) .. array_upper(array1, 1) loop  
        l_x:=array1[i];
        if l_x > l_i then
           l_i := l_x;
	    end if;
    end loop;  
    return l_i;  
end;
$$ language 'plpgsql';

说明：

变量不能定义成伪类型，但可以通过参数进行引用，如上面函数中的 l_i v%type。
动态数组必须是函数的最后一个参数。
第一个参数的作用仅是为变量定义数据类型，所以在调用函数时传空即可。

下列语句调用函数返回情况如图12 所示。

select fn_mgreatest1(null, array[10, -1, 5, 4.4]);
select fn_mgreatest1(null, array['a', 'b', 'c']);

图12

七、查看 UDF 定义

psql 的元命令 \df 可以查看 UDF 的定义，返回函数的参数与返回值的类型。用命令行的 -E 参数，还能够看到元命令对应的对系统表的查询语句。

[gpadmin@hdp3 ~]$ psql -d db1 -E
psql (8.2.15)
Type "help" for help.

db1=# \df
********* QUERY **********
SELECT n.nspname as "Schema",
  p.proname as "Name",
  CASE WHEN p.proretset THEN 'SETOF ' ELSE '' END ||
  pg_catalog.format_type(p.prorettype, NULL) as "Result data type",
  CASE WHEN proallargtypes IS NOT NULL THEN
    pg_catalog.array_to_string(ARRAY(
      SELECT
        CASE
          WHEN p.proargmodes[s.i] = 'i' THEN ''
          WHEN p.proargmodes[s.i] = 'o' THEN 'OUT '
          WHEN p.proargmodes[s.i] = 'b' THEN 'INOUT '
          WHEN p.proargmodes[s.i] = 'v' THEN 'VARIADIC '
        END ||
        CASE
          WHEN COALESCE(p.proargnames[s.i], '') = '' THEN ''
          ELSE p.proargnames[s.i] || ' ' 
        END ||
        pg_catalog.format_type(p.proallargtypes[s.i], NULL)
      FROM
        pg_catalog.generate_series(1, pg_catalog.array_upper(p.proallargtypes, 1)) AS s(i)
    ), ', ')
  ELSE
    pg_catalog.array_to_string(ARRAY(
      SELECT
        CASE
          WHEN COALESCE(p.proargnames[s.i+1], '') = '' THEN ''
          ELSE p.proargnames[s.i+1] || ' '
          END ||
        pg_catalog.format_type(p.proargtypes[s.i], NULL)
      FROM
        pg_catalog.generate_series(0, pg_catalog.array_upper(p.proargtypes, 1)) AS s(i)
    ), ', ')
  END AS "Argument data types",
  CASE
    WHEN p.proisagg THEN 'agg'
    WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'
    ELSE 'normal'
  END AS "Type"
FROM pg_catalog.pg_proc p
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
ORDER BY 1, 2, 4;
**************************

                                      List of functions
 Schema |         Name         | Result data type |       Argument data types       |  Type  
--------+----------------------+------------------+---------------------------------+--------
 public | fn_all_caps          | text             | text                            | normal
 public | fn_count_channel     | void             |                                 | normal
 public | fn_equal             | boolean          | anyelement, anyelement          | normal
 public | fn_getchannel        | SETOF channel    | integer                         | normal
 public | fn_mgreatest         | numeric          | variadic numeric[]              | normal
 public | fn_mgreatest1        | anyelement       | v anyelement, variadic anyarray | normal
 public | fn_return_pseudotype | SETOF record     | str_sql text                    | normal
 public | fn_unnest            | SETOF anyelement | anyarray                        | normal
(8 rows)

可以看到，用户自定义函数包含在 pg_proc 系统表中，以下语句查看函数体，查询结果如图13 所示。

select prosrc from pg_proc where proname='fn_return_pseudotype';

图13

八、删除 UDF

使用 drop function <function_name> 命令删除函数。注意，在该命令需要加上函数定义的参数类型列表，但不须带参数名。

db1=# drop function fn_mgreatest1;
ERROR:  syntax error at or near ";"
LINE 1: drop function fn_mgreatest1;
                                   ^
db1=# drop function fn_mgreatest1();
ERROR:  function fn_mgreatest1() does not exist
db1=# drop function fn_mgreatest1(anyelement, variadic anyarray);
DROP FUNCTION

九、UDF 实例 —— 递归树形遍历

经常在一个表中有父子关系的两个字段，比如 empno 与 manager，开篇建立的示例表 channel 也属于这种结构。在 Oracle 中可以使用 connect by 简单解决此类树的遍历问题，PostgreSQL 9 也有相似功能的 with recursive 语法。

with recursive t (id, cname, parent_id, path, depth)  as (
     select id, cname, parent_id, array[id] as path, 1 as depth
     from channel
     where parent_id = -1
     union all
     select  c.id, c.cname, c.parent_id, t.path || c.id, t.depth + 1 as depth
     from channel c
     join t on c.parent_id = t.id
     )
select id, cname, parent_id, path, depth from t
order by path;

上面的查询在 PostgreSQL 中的执行结果如图14 所示。

图14

但是，HAWQ 不支持 with recursive 语法，同样的查询，会返回如图15 所示的错误。

图15

我们可以使用 HAWQ 的递归函数功能，自己编写 UDF 来实现树的遍历。

建立函数从某节点向下遍历子节点，递归生成节点信息，函数返回以‘|’作为字段分隔符的字符串：

create or replace function fn_ChildLst(int, int)
returns setof character varying
as
$$
declare
    v_rec character varying;
begin       
    for v_rec in (select case when node = 1 then 
	                          q.id||'|'||q.cname||'|'||q.parent_id||'|'||$2
                         else fn_ChildLst(q.id, $2 + 1)
                         end
                    from (select id, cname, parent_id, node
                            from (select 1 as node
                                  union all
                                  select 2) nodes, channel
                           where parent_id = $1
                           order by id, node) q) loop
        return next v_rec;
    end loop;
    return;
end;
$$
language 'plpgsql';

建立节点复合数据类型：

create type tp_depth as (rn int, id int, cname varchar(200), parent_id int, depth int);

将 fn_ChildLst 函数的返回值转换为 tp_depth 类型：

create or replace function fn_ChildLst_split(int, int)
returns setof tp_depth
as
$$
    select cast(rownum as int) rn,
           cast(a[1] as int) id, 
           a[2] cname, 
           cast(a[3] as int) parent_id, 
           cast(a[4] as int) depth 
      from (select rownum,string_to_array(fn_ChildLst,'|') a 
              from (select row_number() over() as rownum,* 
                      from fn_ChildLst($1, $2)
                    union all 
                    select 0,id||'|'||cname||'|'||parent_id||'|'||($2 -1) from channel where id = $1) 
            t) t;
$$
language 'sql';

建立查询结果复合数据类型：

create type tp_result as 
(id int, 
 name1 varchar(1000), 
 parent_id int, 
 depth int,path varchar(200),
 pathname varchar(1000));

实现类似 Oracle SYS_CONNECT_BY_PATH 的功能，递归输出某节点 id 路径：

create or replace function fn_path(a_id integer) 
returns character varying as $$
declare
    v_result character varying;
    v_parent_id int;
begin 
    select t.parent_id into v_parent_id
      from channel as t where t.id = a_id; 
    if found then
        v_result := fn_path(v_parent_id) || '/' || a_id;
    else
        return '';
    end if;
    return v_result; 
end;
$$ language 'plpgsql';

递归输出某节点的 name 路径：

create or replace function fn_pathname(a_id integer) 
returns character varying as $$
declare
    v_result character varying;
    v_parent_id int;
 v_cname varchar(200);
 begin 
    select t.cname,t.parent_id into v_cname,v_parent_id
      from channel as t where t.id = a_id; 
    if found then
        v_result := fn_pathname(v_parent_id) || '/' || v_cname;
    else
        return '';
    end if;
    return v_result; 
end;
$$ language 'plpgsql';

建立输出子节点的函数：

create or replace function fn_showChildLst(int)
returns setof tp_result
as
$$
    select t1.id,
           repeat('  ', t1.depth)||'--'||t1.cname name1,
           t1.parent_id,
           t1.depth,
           fn_path(t1.id) path,
           fn_pathname(t1.id) pathname  
     from fn_ChildLst_split($1,1) t1
    order by t1.rn;
$$
language 'sql';

使用下面的语句调用函数，结果如图16 至图20 所示。

select * from fn_showChildLst(-1);
select * from fn_showChildLst(13);
select * from fn_showChildLst(14);
select * from fn_showChildLst(17);
select * from fn_showChildLst(18);

图16

图17

图18

图19

图20

从某节点向上追溯根节点，递归生成节点信息，函数返回以‘|’作为字段分隔符的字符串：

create or replace function fn_ParentLst(int, int)
returns setof character varying
as
$$
declare
    v_rec character varying;
begin       
    for v_rec in (select case when node = 1 then
                              q.id||'|'||q.cname||'|'||q.parent_id||'|'||$2
                         else fn_ParentLst(q.parent_id, $2 + 1)
                         end
                    from (select id, cname, parent_id, node
                            from (select 1 as node
                                  union all
                                  select 2) nodes, channel
                           where id = $1
                           order by id, node) q) loop
        return next v_rec;
    end loop;
    return;
end;
$$
language 'plpgsql';

将 fn_ParentLst 函数的返回值转换为 tp_depth 类型：

create or replace function fn_ParentLst_split(int, int)
returns setof tp_depth
as
$$
    select cast(rownum as int) rn,
           cast(a[1] as int) id, 
           a[2] cname, 
           cast(a[3] as int) parent_id, 
           cast(a[4] as int) depth 
      from (select rownum,string_to_array(fn_ParentLst,'|') a 
              from (select row_number() over() as rownum,* from fn_ParentLst($1, $2)) t) t;
$$
language 'sql';

建立输出父节点的函数：

create or replace function fn_showParentLst(int)
returns setof tp_result
as
$$
    select t1.id,
           repeat('  ', t1.depth)||'--'||t1.cname name1,
           t1.parent_id,
           t1.depth,
           fn_path(t1.id) path,
           fn_pathname(t1.id) pathname  
  from fn_ParentLst_split($1,0) t1
 order by t1.rn;
$$
language 'sql';

使用下面的语句调用函数，结果如图21 至图25 所示。

select * from fn_showParentLst(-1);
select * from fn_showParentLst(13);
select * from fn_showParentLst(14);
select * from fn_showParentLst(17);
select * from fn_showParentLst(18);

图21

图22

图23

图24

图25