DolphinScheduler3.0.1(数据质量)

Refer

  • 遵循历史,依然要感谢各位在网上的输出!

dolphinscheduler3.0.1HIVE数据源
dolphinscheduler3.0.1数据质量

1.data-quality-jar包修改

1.配置文件修改

  • 想要执行数据质量相关的任务,首先需要对配置文件进行修改
  • 文件位置:/dolphinscheduler/tools/conf/common.properties
  • 修改内容:将data-quality.jar.name变更为work-server下libs中的对应data-quality的jar包名称(/dolphinscheduler/worker-server/libs/dolphinscheduler-data-quality-3.0.1.jar)
  • 其次是自己打包的data-quality文件夹中一定要注意修改mysql8的连接器scope为compile,需要在对应pom的位置添加一下scope,否则会报错ClassNotFoundException
  • 源码中关于spark的scope不需要修改!!!
[INFO] 2023-05-16 14:30:08.590 +0800 -  -> Exception in thread "main" java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
		at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
		at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
		at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
		at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
		at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:101)
		at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:101)
		at scala.Option.foreach(Option.scala:407)
		at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:101)
		at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:39)
		at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:33)
		at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
		at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
		at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
		at scala.Option.getOrElse(Option.scala:189)
		at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
		at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
		at org.apache.dolphinscheduler.data.quality.flow.batch.reader.JdbcReader.read(JdbcReader.java:72)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.registerInputTempView(SparkBatchExecution.java:77)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.lambda$execute$0(SparkBatchExecution.java:48)
		at java.util.ArrayList.forEach(ArrayList.java:1259)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.execute(SparkBatchExecution.java:48)
		at org.apache.dolphinscheduler.data.quality.context.DataQualityContext.execute(DataQualityContext.java:62)
		at org.apache.dolphinscheduler.data.quality.DataQualityApplication.main(DataQualityApplication.java:70)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
		at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
		at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
		at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
		at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
		at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
		at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
		at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[INFO] 2023-05-16 14:30:08.854 +0800 - process has exited, execute path:/opt/software/dolphinscheduler/data/exec/process/8741536538048/8741552919232_24/16267/77676, processId:713393 ,exitStatusCode:1 ,processWaitForStatus:true ,processExitValue:1

在这里插入图片描述

在这里插入图片描述

  • 不进行修改肯定报错
    在这里插入图片描述

2.依赖不明确问题

Multiple sources found for csv (org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2, org.apache.spark.sql.execution.datasources.csv.CSVFileFormat),
	 please specify the fully qualified class name.
	       
		at org.apache.spark.sql.errors.QueryCompilationErrors$.findMultipleDataSourceError(QueryCompilationErrors.scala:1045)
		at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:697)
		at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
		at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
		at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
		at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
		at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:839)
		at org.apache.dolphinscheduler.data.quality.flow.batch.writer.file.BaseFileWriter.outputImpl(BaseFileWriter.java:113)
		at org.apache.dolphinscheduler.data.quality.flow.batch.writer.file.HdfsFileWriter.write(HdfsFileWriter.java:40)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.executeWriter(SparkBatchExecution.java:130)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.execute(SparkBatchExecution.java:58)
		at org.apache.dolphinscheduler.data.quality.context.DataQualityContext.execute(DataQualityContext.java:62)
		at org.apache.dolphinscheduler.data.quality.DataQualityApplication.main(DataQualityApplication.java:70)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
		at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
		at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
		at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
		at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
		at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
		at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
		at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

3.scope多修改问题

  • 源码中关于spark的scope不需要修改!!!,修改了的话则会跟原始集群里的版本冲突,这个我整了好久才明白。。。避坑避坑!
[INFO] 2023-05-16 12:08:24.026 +0800 -  -> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
	Multiple sources found for csv (org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2, org.apache.spark.sql.execution.datasources.csv.CSVFileFormat),
	 please specify the fully qualified class name.
	       
		at org.apache.spark.sql.errors.QueryCompilationErrors$.findMultipleDataSourceError(QueryCompilationErrors.scala:1045)
		at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:697)
		at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
		at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
		at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
		at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
		at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:839)
		at org.apache.dolphinscheduler.data.quality.flow.batch.writer.file.BaseFileWriter.outputImpl(BaseFileWriter.java:113)
		at org.apache.dolphinscheduler.data.quality.flow.batch.writer.file.HdfsFileWriter.write(HdfsFileWriter.java:40)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.executeWriter(SparkBatchExecution.java:130)
		at org.apache.dolphinscheduler.data.quality.execution.SparkBatchExecution.execute(SparkBatchExecution.java:58)
		at org.apache.dolphinscheduler.data.quality.context.DataQualityContext.execute(DataQualityContext.java:62)
		at org.apache.dolphinscheduler.data.quality.DataQualityApplication.main(DataQualityApplication.java:70)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
		at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
		at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
		at java.lang.reflect.Method.invoke(Method.java:498)
		at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
		at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
		at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
		at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
		at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
		at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
		at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
		at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2.实现方式

1.Spark提交

  • DolphinScheduler这边是按照安装的spark,按照自定义资源配置,使用data-quality的jar包提交匹配的内容语句来实现质量实现规则和数据拉取等流程
    在这里插入图片描述

2.结构解析

{
	// 源工程:dolphinscheduler_dml.sql
	// 数据表:t_ds_dq_rule
	// 该表中配置了基础规则的基础功能,下图1
    "name":"$t(uniqueness_check)",
    "env":{
        "type":"batch",
        "config":null
    },
    "readers":[
        {
            "type":"HIVE",
            "config":{
                "database":"${hive_database_name}",
                "password":"${password}",
                "driver":"org.apache.hive.jdbc.HiveDriver",
                "user":"root",
                "output_table":"${hive_database_name}_${hive_table_name}",
                "table":"${hive_table_name}",
                "url":"jdbc:hive2://${ip}:${port}/${hive_database_name}"
            }
        },
        {
            "type":"JDBC",
            "config":{
                "database":"${mysql_database_name}",
                "password":"${password}",
                "driver":"com.mysql.cj.jdbc.Driver",
                "user":"root",
                "output_table":"t_ds_dq_task_statistics_value",
                "table":"t_ds_dq_task_statistics_value",
                "url":"jdbc:mysql://${ip}:${port}/${mysql_database_name}?useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false&amp;allowPublicKeyRetrieval=true&amp;allowLoadLocalInfile=false&amp;autoDeserialize=false&amp;allowLocalInfile=false&amp;allowUrlInLocalInfile=false"
            }
        }
    ],
    "transformers":[
        {
        	// 源工程:dolphinscheduler_dml.sql
			// 数据表:t_ds_dq_comparison_type
			// 该表中配置了基础规则的基础功能,下图2
            "type":"sql",
            "config":{
                "index":1,
                "output_table":"day_range",
                "sql":"select round(avg(statistics_value),2) as day_avg from t_ds_dq_task_statistics_value where data_time &gt;=date_trunc('DAY', '2023-05-15 17:28:53') and data_time &lt; date_add(date_trunc('day', '2023-05-15 17:28:53'),1) and unique_code = 'JRS5R5DSESMCYT3LQ5JQ6GV2I/R7J/KYV2/KTF1N9V4=' and statistics_name = 'cnt'"
            }
        },
        {
            "type":"sql",
            "config":{
                "index":2,
                "output_table":"duplicate_items",
                "sql":"SELECT type FROM ${hive_database_name}_${hive_table_name} group by type having count(*) &gt; 1"
            }
        },
        {
        	// 源工程:dolphinscheduler_dml.sql
			// 数据表:t_ds_dq_rule_execute_sql
			// 该表中配置了基础规则的基础功能,下图3
            "type":"sql",
            "config":{
                "index":3,
                "output_table":"duplicate_count",
                "sql":"SELECT COUNT(*) AS duplicates FROM duplicate_items"
            }
        }
    ],
    "writers":[
        {
            "type":"JDBC",
            "config":{
                "database":"${mysql_database_name}",
                "password":"${password}",
                "driver":"com.mysql.cj.jdbc.Driver",
                "user":"root",
                "table":"t_ds_dq_execute_result",
                "url":"jdbc:mysql://${ip}:${port}/${mysql_database_name}?useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false&amp;allowPublicKeyRetrieval=true&amp;allowLoadLocalInfile=false&amp;autoDeserialize=false&amp;allowLocalInfile=false&amp;allowUrlInLocalInfile=false",
                "sql":"select 0 as rule_type,'$t(uniqueness_check)' as rule_name,0 as process_definition_id,16235 as process_instance_id,77488 as task_instance_id,cnt AS statistics_value,day_range.day_avg AS comparison_value,2 AS comparison_type,0 as check_type,0 as threshold,0 as operator,0 as failure_strategy,'hdfs://${HDFS_NAME}/data_quality_error_data/0_16235_data_quality_test' as error_output_path,'2023-05-15 17:28:53' as create_time,'2023-05-15 17:28:53' as update_time from duplicate_count full join day_range"
            }
        },
        {
            "type":"JDBC",
            "config":{
                "database":"${mysql_database_name}",
                "password":"${password}",
                "driver":"com.mysql.cj.jdbc.Driver",
                "user":"root",
                "table":"t_ds_dq_task_statistics_value",
                "url":"jdbc:mysql://${ip}:${port}/${mysql_database_name}?useUnicode=true&amp;characterEncoding=UTF-8&amp;useSSL=false&amp;allowPublicKeyRetrieval=true&amp;allowLoadLocalInfile=false&amp;autoDeserialize=false&amp;allowLocalInfile=false&amp;allowUrlInLocalInfile=false",
                "sql":"select 0 as process_definition_id,77488 as task_instance_id,6 as rule_id,'JRS5R5DSESMCYT3LQ5JQ6GV2I/R7J/KYV2/KTF1N9V4=' as unique_code,'cnt'AS statistics_name,cnt AS statistics_value,'2023-05-15 17:28:53' as data_time,'2023-05-15 17:28:53' as create_time,'2023-05-15 17:28:53' as update_time from duplicate_count"
            }
        },
        {
            "type":"hdfs_file",
            "config":{
                "path":"hdfs://${HDFS_NAME}/data_quality_error_data/0_16235_data_quality_test",
                "input_table":"duplicate_items"
            }
        }
    ]
}

1.图1

在这里插入图片描述

2.图2

在这里插入图片描述

3.图3及问题

在这里插入图片描述

1.未知cnt字段
  • 由于在图3中可见SQL语句实际上如下
    • 原始SQL语句:SELECT COUNT(*) AS duplicates FROM duplicate_items
  • 但是下游使用上述语句的输出表的时候实际上是duplicate_count.cnt
  • 这就导致了下述的问题
    • 所以需要在t_ds_dq_rule_execute_sql表中将原始SQL语句变更为
    • 变更SQL语句:SELECT COUNT(*) AS cnt FROM duplicate_items
[INFO] 2023-05-15 17:29:04.335 +0800 -  -> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'cnt' given input columns: [day_range.day_avg, duplicate_count.duplicates]; line 1 pos 138;
	'Project [0 AS rule_type#69, (uniqueness_check) AS rule_name#70, 0 AS process_definition_id#71, 16235 AS process_instance_id#72, 77488 AS task_instance_id#73, 'cnt AS statistics_value#74, day_avg#60 AS comparison_value#75, 2 AS comparison_type#76, 0 AS check_type#77, 0 AS threshold#78, 0 AS operator#79, 0 AS failure_strategy#80, hdfs://HDFS8001702/user/yarn/data_quality_error_data/0_16235_data_quality_test AS error_output_path#81, 2023-05-15 17:28:53 AS create_time#82, 2023-05-15 17:28:53 AS update_time#83]
	+- Join FullOuter
	……………………

3.创建数据源

  • 监控数据质量的前提是要有想要监控的源数据,所以首先需要在DolphinScheduler的数据源中心创建想要监控的目标数据源
  • DolphinScheduler支持的数据源类型包括:MYSQL,POSTGRESQL,HIVE/IMPALA,SPARK,CLICKHOUSE,ORACLE,SQLSERVER,DB2,PRESTO,REDSHIFT。感觉常用的数据库连接都是支持的
  • 创建过程也比较简单,选好对应的数据库类型,填写对应的IP和端口即可,HIVE的端口我这里因为是用腾讯云测试的,所以端口是7001,正常访问默认应该是10000,其实就是HiveServer提供Thrift服务的端口,密码看个人有没有配置,因为直接连接的是HiveServer所以数据库地址不用写HDFS的前缀,建好以后测试连接通过即可
    在这里插入图片描述

4.数据质量检测节点

一开始我在数据质量的模块里找了半天如何创建监控任务,直到我在工作类型中看到DATA_QUALITY,我……
根据官方文档中的失败策略可预料的数据质量场景:
在这里插入图片描述

  1. 单独有一个模块去做所有核心数据的质量监控报警等等
  2. 监控上游任务数据,保障下游任务不空跑或及时发现问题避免资源浪费

核心配置项如下,不同的监控规则可以在数据质量的规则管理界面查看
在这里插入图片描述

  • 数据质量任务默认是用spark任务在跑的,文档中也提到了“数据质量任务的运行环境为Spark2.4.0,其他版本尚未进行过验证,用户可自行验证”,目前我这边使用的Spark版本是3.2.2,暂时妥协了,重新配置了一个spark2.4,目前是可以使用了
    在这里插入图片描述

5.数据质量结果

评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

无语梦醒

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值