数据治理组件-apache-griffin

原创已于 2025-04-01 11:20:44 修改 · 1.4k 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#apache #大数据

于 2025-04-01 10:31:50 首次发布

数据治理组件-apache-griffin

griffin介绍

Apache Griffin提供了一套定义明确的数据质量领域模型，涵盖了大多数通用的数据质量问题。它还定义了一套数据质量DSL，帮助用户定义其质量标准。它提供了从不同角度衡量数据质量的统一流程，帮助您构建可信赖的数据资产，从而提升您对业务的信心。
优势 :基于spark计算引擎，能对接几乎市面上所有的数据库。在不占用数据库的资源情况下，完成稽核任务，并输出稽核报告。

griffin框架

在这里插入图片描述

Data Quality Definitions
数据质量定义板块，涉及到的维度，
1. 准确性
2. 完整性
3. 及时性
4. 唯一性
5. 有效性
6. 一致性
Scheduler
基于spark计算引擎，启动任务稽核任务。

griffin数据流转情况

定义数据质量
计算数据质量
得到数据稽核结果

使用例子
a. 定义环境配置文件
指定spark计算引擎的参数，稽核结果的写入位置（打印屏幕，写入到hdfs，写入到es）

{
  "spark": {
    "log.level": "WARN"
  },
  "sinks": [
    {
      "type": "console"
    },
    {
      "type": "hdfs",
      "config": {
        "path": "hdfs:///griffin/persist"
      }
    },
    {
      "type": "elasticsearch",
      "config": {
        "method": "post",
        "api": "http://es:9200/griffin/accuracy"
      }
    }
  ]
}

b. 定义稽核规则
指定数据源的连接信息，包含数据源的名称，然后配置准确性的规则，得到稽核结果。

{
  "name": "accu_batch",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "source",
      "baseline": true,
      "connector": {
        "type": "jdbc",
        "config": {
          "user": "xxx",
          "password": "xxx",
          "tablename":  "stu",
          "where": "id < 3",
          "url":"jdbc:mysql://localhost:3306/test",
          "database": "test",
          "driver": "com.mysql.jdbc.Driver"
        }
      }
    },
    {
      "name": "target",
      "connector": {
        "type": "jdbc",
        "config": {
          "user": "xxx",
          "password": "xxx",
          "tablename":  "stu2",
          "where": "id < 3",
          "url":"jdbc:mysql://localhost:3306/test",
          "database": "test",
          "driver": "com.mysql.jdbc.Driver"
        }
      }
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "out.dataframe.name": "accu",
        "rule": "source.id = target.id AND upper(source.name) = upper(target.name) ",
        "details": {
          "source": "source",
          "target": "target",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "out": [
          {
            "type": "record",
            "name": "missRecords"
          }
        ]
      }
    ]
  },
  "sinks": [
    "consoleSink"
  ]
}

griffin代码解读

在这里插入图片描述

DataSource如何产生

  "data.sources": [
    {
      "name": "source",
      "baseline": true,
      "connector": {
        "type": "jdbc",
        "config": {
          "user": "xxx",
          "password": "xxx",
          "tablename":  "stu",
          "where": "id < 3",
          "url":"jdbc:mysql://localhost:3306/test",
          "database": "test",
          "driver": "com.mysql.jdbc.Driver"
        }
      }
    }]

// 在BatchDqApp 类的run方法，加载datasources
val dataSources =DataSourceFactory.getDataSources(sparkSession, null, dqParam.getDataSources)

在这里插入图片描述
下面是获取DataConnector的方法

以下是hive的DataConnector如何实现的
那么它在BatchDQApp中是如何调用的呢

job构建

  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "completeness",
        "out.dataframe.name": "comp",
        "rule": "email, post_code, first_name",
        "out": [
          {
            "type": "metric",
            "name": "comp"
          }
        ]
      }
    ]
  }

在这里插入图片描述
发现代码是三类DQStep的构建