可以检测Hive
的元数据,比如Hive
表元数据存在Mysql
中,可以在Mysql
中查询
mysql> desc TBLS;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| TBL_ID | bigint(20) | NO | PRI | NULL | |
| CREATE_TIME | int(11) | NO | | NULL | |
| DB_ID | bigint(20) | YES | MUL | NULL | |
| LAST_ACCESS_TIME | int(11) | NO | | NULL | |
| OWNER | varchar(767) | YES | | NULL | |
| RETENTION | int(11) | NO | | NULL | |
| SD_ID | bigint(20) | YES | MUL | NULL | |
| TBL_NAME | varchar(128) | YES | MUL | NULL | |
| TBL_TYPE | varchar(128) | YES | | NULL | |
| VIEW_EXPANDED_TEXT | mediumtext | YES | | NULL | |
| VIEW_ORIGINAL_TEXT | mediumtext | YES | | NULL | |
+--------------------+--------------+------+-----+---------+-------+
mysql> desc TABLE_PARAMS;
+-------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| TBL_ID | bigint(20) | NO | PRI | NULL | |
| PARAM_KEY | varchar(256) | NO | PRI | NULL | |
| PARAM_VALUE | varchar(4000) | YES | | NULL | |
+-------------+---------------+------+-----+---------+-------+
比如进行0行检测
,可以写成脚本,定时执行,这样就可以哪些表只有0行
0行检测可以理解成一些表不应该存在0行的情况,如果有,需要及时告警和排查原因,甚至是0行数据会影响下游任务,需要考虑阻断下游任务的继续执行,一方面减少下游任务异常的多余告警,二来节省下游任务执行的资源。
mysql> select a.TBL_ID, a.TBL_NAME, b.PARAM_KEY, b.PARAM_VALUE from TBLS as a join TABLE_PARAMS as b where a.TBL_ID = b.TBL_ID and TBL_NAME="score" and PARAM_KEY="numRows";
+--------+----------+-----------+-------------+
| TBL_ID | TBL_NAME | PARAM_KEY | PARAM_VALUE |
+--------+----------+-----------+-------------+
| 7 | score | numRows | 0 |
| 33 | score | numRows | 0 |
| 151 | score | numRows | 0 |
| 242 | score | numRows | 0 |
+--------+----------+-----------+-------------+
阈值检测
可以提供一些类 sql
的语法,同样是作为离线的定时任务来执行检查。当然阈值检查必须考虑检查范围的问题,抽样肯定要比全量更效率更高,但是全量肯定比抽样更稳妥,需要结合资源和业务来综合衡量。
hive> set checkMode = SAMPLING;
hive> select * from emp where empno>100;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 NULL 10
总结
数据质量检测
可以写一些脚本定时执行!