Spark SQL字符差异分析函数验证与详解

Spark SQL字符差异分析函数详解

前言

  • SEQUENCE函数 与 FILTER函数 与 SIZE函数 组合比较字符串的差异

验证SQL代码

  • 为了验证SIZE(FILTER(SEQUENCE(…)))这个字符差异分析函数的正确性,我编写了以下测试SQL:
  • 可以直接运行
-- 验证字符差异分析函数的测试SQL
WITH test_data AS (
  SELECT 
    'VIN001' as vin,
    'ABC123' as t1_engine,
    'ABC124' as t2_engine,
    '1个字符差异' as expected_result
  UNION ALL
  SELECT 
    'VIN002' as vin,
    'CA6DM2-42E52' as t1_engine,
    'CA6DM2-46E52' as t2_engine,
    '1个字符差异' as expected_result
  UNION ALL
  SELECT 
    'VIN003' as vin,
    'XYZ789' as t1_engine,
    'XWZ789' as t2_engine,
    '2个字符差异' as expected_result
  UNION ALL
  SELECT 
    'VIN004' as vin,
    'ENGINE001' as t1_engine,
    'ENGINE002' as t2_engine,
    '1个字符差异' as expected_result
  UNION ALL
  SELECT 
    'VIN005' as vin,
    'TEST1234' as t1_engine,
    'TEST5678' as t2_engine,
    '4个字符差异' as expected_result
),
diff_analysis AS (
  SELECT 
    vin,
    t1_engine,
    t2_engine,
    expected_result,
    -- 核心分析函数
    SIZE(
      FILTER(
        SEQUENCE(1, LENGTH(t1_engine)),
        i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
      )
    ) as actual_diff_count,
    -- 分解步骤验证
    SEQUENCE(1, LENGTH(t1_engine)) as char_positions,
    FILTER(
      SEQUENCE(1, LENGTH(t1_engine)),
      i -> SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1)
    ) as diff_positions,
    -- 逐个字符对比
    TRANSFORM(
      SEQUENCE(1, LENGTH(t1_engine)),
      i -> STRUCT(
        i as position,
        SUBSTRING(UPPER(t1_engine), i, 1) as t1_char,
        SUBSTRING(UPPER(t2_engine), i, 1) as t2_char,
        CASE WHEN SUBSTRING(UPPER(t1_engine), i, 1) != SUBSTRING(UPPER(t2_engine), i, 1) 
             THEN 'DIFF' 
             ELSE 'SAME' END as status
      )
    ) as detailed_comparison
  FROM test_data
)
SELECT 
  vin,
  t1_engine,
  t2_engine,
  expected_result,
  actual_diff_count,
  char_positions,
  diff_positions,
  detailed_comparison,
  CASE 
    WHEN actual_diff_count = 1 AND expected_result LIKE '%1个字符差异%' THEN '✅ 正确'
    WHEN actual_diff_count = 2 AND expected_result LIKE '%2个字符差异%' THEN '✅ 正确'
    WHEN actual_diff_count = 4 AND expected_result LIKE '%4个字符差异%' THEN '✅ 正确'
    ELSE '❌ 异常'
  END as verification_result
FROM diff_analysis
ORDER BY vin;

函数分解详解

1. SEQUENCE函数

SEQUENCE(1, LENGTH(t1_engine_no))
  • 作用: 生成从1到字符串长度的整数序列

  • 示例: 对于’ABC123’ (长度6),生成[1, 2, 3, 4, 5, 6]

  • 目的: 为每个字符位置创建索引

2. FILTER函数

FILTER(
    sequence_result,
    i -> SUBSTRING(UPPER(t1_engine_no), i, 1) != SUBSTRING(UPPER(t2_engine_no), i, 1)
)
  • 作用: 过滤出字符不同的位置索引

  • Lambda函数: i -> condition 对每个位置i检查字符是否不同

  • SUBSTRING函数: 提取指定位置的单个字符

  • UPPER函数: 统一转为大写,忽略大小写差异

3. SIZE函数

SIZE(filtered_sequence)
  • 作用: 计算过滤后序列的长度

  • 结果: 即字符差异的数量

实际应用示例

场景1: 单个字符差异

-- 输入: 'ABC123' vs 'ABC124'
-- 处理过程:
SEQUENCE(1, 6)[1,2,3,4,5,6]
FILTER(...)[6]  -- 只有第6个字符不同 ('3' vs '4')
SIZE([6])1
-- 结果: 1个字符差异 ✅

场景2: 多个字符差异

-- 输入: 'TEST1234' vs 'TEST5678'
-- 处理过程:
SEQUENCE(1, 8)[1,2,3,4,5,6,7,8]
FILTER(...)[5,6,7,8]  -- 第5-8个字符不同 ('1234' vs '5678')
SIZE([5,6,7,8])4
-- 结果: 4个字符差异 ✅

实际应用_[使用序列函数(Spark 3.0+)]

-- 使用sequence函数生成位置序列(Spark 3.0+)
WITH diff_analysis AS (
    SELECT 
        t1.vin,
        t1.engine_no as t1_engine,
        t2.engine_no as t2_engine,
        SIZE(
            FILTER(
                SEQUENCE(1, LENGTH(t1.engine_no)),
                i -> SUBSTRING(UPPER(t1.engine_no), i, 1) != SUBSTRING(UPPER(t2.engine_no), i, 1)
            )
        ) as diff_char_count
    FROM (
        SELECT vin, vehicle_model, engine_no 
        FROM dwd.dwd_001
        WHERE `date` = 20251119
        AND engine_no NOT IN ('', 'N/A')
    ) t1
    INNER JOIN (
        SELECT vin, vehicle_model, engine_no 
        FROM ods.ods_001 
        WHERE `date` = 20251119
        AND engine_no != 'NA'
    ) t2 ON t1.vin = t2.vin
    WHERE LENGTH(t1.engine_no) = LENGTH(t2.engine_no)
        AND UPPER(t1.engine_no) != UPPER(t2.engine_no)
)
SELECT 
    diff_char_count,
    COUNT(*) as record_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 5
GROUP BY diff_char_count
ORDER BY diff_char_count;

验证数据sql

SELECT diff_char_count, vin, t1_engine, t2_engine FROM diff_analysis
WHERE diff_char_count BETWEEN 1 AND 1 limit 100;

字段解释

字段名含义说明
diff_char_count字符差异数量,表示两个表中相同VIN记录的engine_no字段有多少个字符不同
record_count记录数量,表示对应字符差异数量的数据有多少条
percentage百分比,表示该差异级别在所有差异记录中的占比

总结

  • 这个SIZE(FILTER(SEQUENCE(…)))组合函数是一个强大的字符串差异分析工具,通过函数式编程的方式优雅地解决了字符级比较的问题。在实际数据质量分析项目中,它帮助我们精确识别了不同级别的数据不一致问题,为数据治理提供了有力的技术支持。

  • 通过验证SQL的测试,确认该函数能够准确计算字符串间的字符差异数量,为后续的数据清洗和映射规则制定提供了可靠依据。

如果此篇文章有帮助到您, 希望打大佬们能关注点赞收藏评论支持一波,非常感谢大家!
如果有不对的地方请指正!!!

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

lfwh

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值