工作中,常有其他部门或者其它公司打过来的json格式的日志需要处理,往往是在解析的过程中发现了问题,比如格式错误,缺少字段等,再让研发去改,改完再检查,一来一回耽误时间也比较被动,所以写了一个python的脚本,可以发给研发进行自检,减少后面的麻烦。
json格式检查
第一步是用python解析json串,常有嵌套的json在拼接时不规范的情况,比如多加了双引号,对于这样的问题,可用如下的代码检查
import json
try:
data = json.loads(line.strip()) #line是待检查的json串
except:
print "json格式有误,请检查===>",line
continue
日期格式的检查
有时日期格式不规范,可用如下方式检查
import time
try:
dt = data['dt']
time.strptime(dt,"%Y-%m-%d %H:%M:%S")
except:
print "日期时间字段(dt)格式有误,请检查===>",line
禁用中文的检查
有时,一些字段,不希望出现中文,可用如下方式
import re
zh_pattern = re.compile(u'[\u4e00-\u9fa5]+')
tmp = zh_pattern.search(data['z'])
if tmp:
print "z字段中,含有中文,请使用英文===>",line
必要字段的检查
日志要求有若干字段,且不能为空,此时可用如下方式检查
check = ['a','b','c']; tmp = ""
for tag in check:
if tag not in data or data[tag] is None or data[tag] == '':
tmp += tag+","
if len(tmp)>0:
tmp = tmp[:-1]
print tmp,"字段缺失或值为空,请检查===>",line
综合以上内容,完整的脚本(check.py)如下
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import re
import json
import time
reload(sys)
sys.setdefaultencoding('utf-8')
def checklog():
check = ['a','b','c']
cnt = 0; json_error = 0; dt_error = 0; zh_error = 0; miss_error = 0
for line in sys.stdin:
if not line or not line.strip():
continue
line = "".join(i for i in line if ord(i)>31) #去除特殊字符
cnt += 1
# json格式
try:
data = json.loads(line.strip())
except:
print "json格式有误,请检查===>",line
json_error += 1
continue
# dt字段
try:
dt = data['dt']
time.strptime(dt,"%Y-%m-%d %H:%M:%S")
except:
print "日期时间字段(dt)格式有误,请检查===>",line
dt_error += 1
# 禁用中文字段
zh_pattern = re.compile(u'[\u4e00-\u9fa5]+')
tmp = zh_pattern.search(data['z'])
if tmp:
print "z字段中含有中文,请使用英文===>",line
zh_error += 1
# 其他必要字段
tmp = ""
for tag in check:
if tag not in data or data[tag] is None or data[tag] == '':
tmp += tag+","
if len(tmp)>0:
tmp = tmp[:-1]
print tmp,"字段缺失或值为空,请检查===>",line
miss_error += 1
print '===================完成==================='
print '本次检查共%d条日志, json格式错误%d条,dt字段错误%d条,z字段错误或缺失%d条,其他必要字段缺失%d条'%(cnt,json_error,dt_error,zh_error,miss_error)
if __name__=='__main__':
checklog()
使用下面的文件 (t.txt) 进行测试
{"dt":"2017-11-02 11:11:11","z":"hello","a":1,"b":2,"c":3,"js":{"d":4}}
{"dt":"2017-11-02","z":"hello","a":1,"b":2,"c":3,"js":{"d":4}}
{"dt":"2017-11-02 11:11:11","z":"中","a":1,"b":2,"c":3,"js":{"d":4}}
{"dt":"2017-11-02 11:11:11","z":"hello","a":1,"b":2,"c":3,"js":"{"d":4}"}
{"dt":"2017-11-02 11:11:11","z":"hello","a":1,"js":{"d":4}}
会有如下的输出
cat t.txt | python check.py
日期时间字段(dt)格式有误,请检查===> {"dt":"2017-11-02","z":"hello","a":1,"b":2,"c":3,"js":{"d":4}}
z字段中含有中文,请使用英文===> {"dt":"2017-11-02 11:11:11","z":"中","a":1,"b":2,"c":3,"js":{"d":4}}
json格式有误,请检查===> {"dt":"2017-11-02 11:11:11","z":"hello","a":1,"b":2,"c":3,"js":"{"d":4}"}
b,c 字段缺失或值为空,请检查===> {"dt":"2017-11-02 11:11:11","z":"hello","a":1,"js":{"d":4}}
===================完成===================
本次检查共5条日志, json格式错误1条,dt字段错误1条,z字段错误或缺失1条,其他必要字段缺失1条