首先 ,先要知道 array_join 及 array_sort的函数用法,详情请参考如下网址:
https://www.iteblog.com/archives/2459.html
下面给出Spark 2.4的 demo代码
select
row_number() OVER (PARTITION BY 1 ORDER BY 1) id,
md5(array_join(array_sort(collect_set(f.holder_id)),'|')) association_id,
current_timestamp() date_modified,
first(f.date_id) date_id,
array_join(array_sort(collect_set(f.holder_id)),'|') horder_ids_string,
size(collect_set(f.holder_id)) holder_count,
first(h.type) holder_type,
first(h.type_name) holder_type_name,
first(f.date_id) dt
from XXXXX f, XXXX h
where f.holde