文章目录
前言
- SEQUENCE函数 与 FILTER函数 与 SIZE函数 组合比较字符串的差异
验证SQL代码
- 为了验证SIZE(FILTER(SEQUENCE(…)))这个字符差异分析函数的正确性,我编写了以下测试SQL:
可以直接运行
-- 验证字符差异分析函数的测试SQL
WITH test_data AS (
SELECT
'VIN001' as vin,
'ABC123' as t1_engine,
'ABC124' as t2_engine,
'1个字符差异' as expected_result
UNION ALL
SELECT
'VIN002' as vin,
'CA6DM2-42E52' as t1_engine,
'CA6DM2-46E52' as t2_engine,
'1个字符差异' as expected_result
UNION ALL
SELECT
'VIN003' as vin,
'XYZ789' as t1_engine,
'XWZ789' as t2_engine,
'2个字符差异' as expected_result
UNION ALL
SELECT
'VIN004' as vin,
'ENGINE001' as t1_engine,
'ENGINE002' as t2_engine,
'1个字符差异' as expected_result
UNION ALL
SELECT
'VIN005' as vin,
'TEST1234' as t1_engine,
'TEST5678' as t2_engine,
'4个字符差异' as expected_result
),
diff_analysis AS (
SELECT
vin,
t1_engine,
t2_engine,
expected_result,
-- 核心分析函数
SIZE(
FILTER(
SEQUENCE(1, LENGTH(t1_engine)),
i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
)
) as actual_diff_count,
-- 分解步骤验证
SEQUENCE(1, LENGTH(t1_engine)) as char_positions,
FILTER(
SEQUENCE(1, LENGTH(t1_engine)),
i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
) as diff_positions,
-- 逐个字符对比
TRANSFORM(
SEQUENCE(1, LENGTH(t1_engine)),
i -> STRUCT(
i as position,
SUBSTRING(UPPER(t1_engine), i, 1) as t1_char,
SUBSTRING(UPPER(t2_engine), i, 1) as t2_char,
CASE WHEN SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
THEN 'DIFF'
ELSE 'SAME' END as status
)
) as detailed_comparison
FROM test_data
)
SELECT
vin,
t1_engine,
t2_engine,
expected_result,
actual_diff_count,
char_positions,
diff_positions,
detailed_comparison,
CASE
WHEN actual_diff_count = 1 AND expected_result LIKE '%1个字符差异%' THEN '✅ 正确'
WHEN actual_diff_count = 2 AND expected_result LIKE '%2个字符差异%' THEN '✅ 正确'
WHEN actual_diff_count = 4 AND expected_result LIKE '%4个字符差异%' THEN '✅ 正确'
ELSE '❌ 异常'
END as verification_result
FROM diff_analysis
ORDER BY vin;
函数分解详解
1. SEQUENCE函数
SEQUENCE(1, LENGTH(t1_engine_no))
-
作用: 生成从1到字符串长度的整数序列
-
示例: 对于’ABC123’ (长度6),生成[1, 2, 3, 4, 5, 6]
-
目的: 为每个字符位置创建索引
2. FILTER函数
FILTER(
sequence_result,
i -> SUBSTRING(UPPER(t1_engine_no), i, 1) != SUBSTRING(UPPER(t2_engine_no), i, 1)
)
-
作用: 过滤出字符不同的位置索引
-
Lambda函数: i -> condition 对每个位置i检查字符是否不同
-
SUBSTRING函数: 提取指定位置的单个字符
-
UPPER函数: 统一转为大写,忽略大小写差异
3. SIZE函数
SIZE(filtered_sequence)
-
作用: 计算过滤后序列的长度
-
结果: 即字符差异的数量
实际应用示例
场景1: 单个字符差异
-- 输入: 'ABC123' vs 'ABC124'
-- 处理过程:
SEQUENCE(1, 6) → [1,2,3,4,5,6]
FILTER(...) → [6] -- 只有第6个字符不同 ('3' vs '4')
SIZE([6]) → 1
-- 结果: 1个字符差异 ✅
场景2: 多个字符差异
-- 输入: 'TEST1234' vs 'TEST5678'
-- 处理过程:
SEQUENCE(1, 8) → [1,2,3,4,5,6,7,8]
FILTER(...) → [5,6,7,8] -- 第5-8个字符不同 ('1234' vs '5678')
SIZE([5,6,7,8]) → 4
-- 结果: 4个字符差异 ✅
实际应用_[使用序列函数(Spark 3.0+)]
-- 使用sequence函数生成位置序列(Spark 3.0+)
WITH diff_analysis AS (
SELECT
t1.vin,
t1.engine_no as t1_engine,
t2.engine_no as t2_engine,
SIZE(
FILTER(
SEQUENCE(1, LENGTH(t1.engine_no)),
i -> SUBSTRING(UPPER(t1.engine_no), i, 1) != SUBSTRING(UPPER(t2.engine_no), i, 1)
)
) as diff_char_count
FROM (
SELECT vin, vehicle_model, engine_no
FROM dwd.dwd_001
WHERE `date` = 20251119
AND engine_no NOT IN ('', 'N/A')
) t1
INNER JOIN (
SELECT vin, vehicle_model, engine_no
FROM ods.ods_001
WHERE `date` = 20251119
AND engine_no != 'NA'
) t2 ON t1.vin = t2.vin
WHERE LENGTH(t1.engine_no) = LENGTH(t2.engine_no)
AND UPPER(t1.engine_no) != UPPER(t2.engine_no)
)
SELECT
diff_char_count,
COUNT(*) as record_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 5
GROUP BY diff_char_count
ORDER BY diff_char_count;
验证数据sql
SELECT diff_char_count, vin, t1_engine, t2_engine FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 1 limit 100;
字段解释
| 字段名 | 含义说明 |
|---|---|
| diff_char_count | 字符差异数量,表示两个表中相同VIN记录的engine_no字段有多少个字符不同 |
| record_count | 记录数量,表示对应字符差异数量的数据有多少条 |
| percentage | 百分比,表示该差异级别在所有差异记录中的占比 |
总结
-
这个SIZE(FILTER(SEQUENCE(…)))组合函数是一个强大的字符串差异分析工具,通过函数式编程的方式优雅地解决了字符级比较的问题。在实际数据质量分析项目中,它帮助我们精确识别了不同级别的数据不一致问题,为数据治理提供了有力的技术支持。
-
通过验证SQL的测试,确认该函数能够准确计算字符串间的字符差异数量,为后续的数据清洗和映射规则制定提供了可靠依据。
如果此篇文章有帮助到您, 希望打大佬们能
关注、点赞、收藏、评论支持一波,非常感谢大家!
如果有不对的地方请指正!!!
Spark SQL字符差异分析函数详解


被折叠的 条评论
为什么被折叠?



