Apache Druid 数组展开(Unnest)操作实战指南
概述
在数据分析领域,处理嵌套数据结构是常见的需求。Apache Druid 作为一款高性能的实时分析数据库,提供了强大的数组展开(Unnest)功能,能够将嵌套的数组数据展开为多行记录。本文将详细介绍如何在Druid中使用Unnest操作处理数组类型数据。
数组展开的基本概念
数组展开(Unnest)是指将包含数组的列拆分为多行,每行包含数组中的一个元素。例如,对于包含值["a","b"]
的列,展开后会生成两行,分别包含值"a"
和"b"
。
使用注意事项
- 数据量膨胀:展开操作会显著增加行数,需谨慎使用
- 嵌套限制:Druid不支持数组嵌套数组的展开
- 性能影响:复杂展开操作可能影响查询性能
实战准备
数据准备
我们使用以下示例数据,其中包含多个数组类型的列:
{
"t":"2000-01-01",
"m1":1.0,
"m2":1.0,
"dim1":"",
"dim2":["a"],
"dim3":["a","b"],
"dim4":["x","y"],
"dim5":["a","b"]
}
数据加载
可以通过SQL或原生JSON两种方式加载数据:
SQL方式
REPLACE INTO nested_data OVERWRITE ALL
SELECT
TIME_PARSE("t") as __time,
dim1, dim2, dim3, dim4, dim5,
m1, m2
FROM TABLE(
EXTERN(
'{"type":"inline","data":"..."}',
'{"type":"json"}',
'[{"name":"t","type":"string"},...]'
)
)
PARTITIONED BY YEAR
JSON方式
{
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "inline",
"data":"..."
},
"inputFormat": {
"type": "json"
}
},
"dataSchema": {
"dataSource": "nested_data",
"granularitySpec": {
"segmentGranularity": "YEAR"
},
"timestampSpec": {
"column": "t",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"dim1", "dim2", "dim3", "dim4", "dim5"
]
},
"metricsSpec": [
{
"name": "m1",
"type": "floatSum",
"fieldName": "m1"
}
]
}
}
}
SQL方式展开数组
基本语法
SELECT column_alias_name
FROM datasource
CROSS JOIN UNNEST(source_expression) AS table_alias_name(column_alias_name)
单列展开
SELECT d3
FROM "nested_data"
CROSS JOIN UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)
注意:对于多值字符串维度,需要使用MV_TO_ARRAY
函数转换。
虚拟列展开
SELECT dim4, dim5, d45
FROM nested_data
CROSS JOIN UNNEST(ARRAY[dim4,dim5]) AS example_table(d45)
或者使用数组连接:
SELECT dim4, dim5, d45
FROM nested_data
CROSS JOIN UNNEST(ARRAY_CONCAT(dim4,dim5)) AS example_table(d45)
多列展开
SELECT dim3, dim4, dim5, d3, d45
FROM "nested_data"
CROSS JOIN UNNEST(MV_TO_ARRAY("dim3")) AS foo1(d3)
CROSS JOIN UNNEST(ARRAY[dim4,dim5]) AS foo2(d45)
带过滤的展开
SELECT d3
FROM (SELECT * FROM nested_data WHERE dim2 IN ('abc'))
CROSS JOIN UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)
分组统计
SELECT d3, COUNT(*)
FROM nested_data
CROSS JOIN UNNEST(MV_TO_ARRAY(dim3)) AS example_table(d3)
GROUP BY d3
ORDER BY d3 DESC
原生查询方式展开数组
Scan查询
{
"queryType": "scan",
"dataSource": {
"type": "unnest",
"base": {
"type": "table",
"name": "nested_data"
},
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"columns": ["__time", "dim1", "dim2", "dim3", "m1", "m2", "unnest-dim3"]
}
GroupBy查询
{
"queryType": "groupBy",
"dataSource": {
"type": "unnest",
"base": "nested_data",
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"dimensions": ["unnest-dim3"],
"limitSpec": {
"type": "default",
"columns": [{
"dimension": "unnest-dim3",
"direction": "descending"
}]
}
}
TopN查询
{
"queryType": "topN",
"dataSource": {
"type": "unnest",
"base": {
"type": "table",
"name": "nested_data"
},
"virtualColumn": {
"type": "expression",
"name": "unnest-dim3",
"expression": "\"dim3\""
}
},
"dimension": {
"type": "default",
"dimension": "unnest-dim3",
"outputName": "topN-unnest-d3",
"outputType": "STRING"
},
"metric": {
"type": "inverted",
"metric": {
"type": "numeric",
"metric": "a0"
}
},
"aggregations": [
{
"type": "floatMin",
"name": "a0",
"fieldName": "m1"
}
]
}
虚拟列展开
{
"queryType": "scan",
"dataSource":{
"type": "unnest",
"base": {
"type": "table",
"name": "nested_data"
},
"virtualColumn": {
"type": "expression",
"name": "dim45",
"expression": "array_concat(\"dim4\",\"dim5\")",
"outputType": "ARRAY<STRING>"
}
},
"columns": ["dim45", "m1"]
}
最佳实践
- 性能优化:只展开必要的列,避免不必要的数据膨胀
- 过滤优先:在展开前先过滤数据,减少处理的数据量
- 合理使用:评估是否真的需要展开操作,有时其他查询方式可能更高效
- 监控资源:展开操作可能消耗大量资源,需监控集群状态
通过本文的介绍,您应该已经掌握了在Apache Druid中使用Unnest操作处理数组数据的方法。无论是通过SQL还是原生查询,都能灵活应对各种数组展开场景。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考