前言
目前线上环境nginx日志一天10亿左右,日志已经实时放入kafka队列中,想实时监控分析, 首先,理清我们的需求:1.实时监控 reponse的情况
2.实时监控 request path 的流量变化
3.实时监控 请求时间
4.从request 里面解析其中的参数,监控变化量
上面最难的是从request解析指定的字段出来做监控,这里必须经过一次解析后才能做统计
想了几种方案:
- 最先想到的是搭建ES集群,利用ES集群实现实时聚合搜索
- 利用现在流行的flink spark storm 等实时流处理方案
- 利用现有的presto ,实现presto on kafka
一、方案选取
根据我们现有的实际情况,不方便增加更多的机器,也没有人专门写代码,所以选择了第3种
kafka --> logstash --> kafka --> presto
二、各项配置
1.logstash配置
配置文件如下:
# 我们的nginx日志是按照竖线分割的 xx|xx|x|x|
input {
kafka {
topics => "nginx"
group_id => "logstash"
bootstrap_servers => "192.168.x.xx:9092"
codec => "plain"
}
}
filter{
# 提取需要的字段,不是每个字段都用到了
mutate {
split => { "message" => "|" }
add_field => {
"time" => "%{[message][3]}"
"request" => "%{[message][7]}"
"response" => "%{[message][9]}"
"req_length" => "%{[message][5]}"
"bytes" => "%{[message][10]}"
"up_addr" => "%{[message][17]}"
}
remove_field => [ "message" ]
}
# path?a=xx&b=xx&uuid=xxx&c=xx
# 解析request ,这里是提取了uuid
kv {
source => "request"
field_split => "&?"
target => "kv"
include_keys => ["uuid"]
}
# 提取path字段,不要后面的参数,我们想统计那个path访问最多
mutate {
split => {"request" => "?" }
add_field => { "req_url" => "%{[request][0]}" }
remove_field => ["request"]
}
#使用nginx 日志时间替换timestamp
date {
match => [ "timet" , "[dd/MMM/YYYY:HH:mm:ss Z]"]
timezone => "+08:00"
target => "@timestamp"
locale => "en"
}
# 增加一个字段,将timestamp转换为long类型
ruby{
code => "event.set('unix_ms_time',(event.get('@timestamp').to_f.round(3)*1000).to_i)"
}
}
output{
#stdout{}
kafka {
bootstrap_servers => "192.168.1.xx:9092"
codec => json
topic_id => "nginxlog"
}
}
处理后的消息为:
{
"req_len": "1096",
"response": "200",
"req_url": "/get/api",
"unix_ms_time": 1618912479000,
"@timestamp": "2021-04-20T09:54:39.000Z",
"bytes": "46",
"time": "[20/Apr/2021:17:54:39 +0800]",
"up_addr": "192.168.16.62:8080",
"kv": {
"uuid": "898767826552698"
},
"@version": "1"
}
经过logstash解析后,提取了我们需要的信息,大大减少了无用的信息,此时可以使用presto on kafka直接查询了
2.presto 增加kafka connector
参考:https://prestodb.io/docs/current/connector.html
在presto catalog 目录下增加 kafka.properties文件
connector.name=kafka
kafka.nodes=192.168.xx.xx:9092
kafka.table-names=nginxlog
kafka.hide-internal-columns=false
在presto的etc目录下新建kafka目录,在kafka目录下新建nginxlog.json
{
"tableName": "nginxlog",
"schemaName": "default",
"topicName": "nginxlog",
"message": {
"dataFormat": "json",
"fields": [
{
"name": "req_leng",
"mapping": "req_len",
"type": "VARCHAR"
},
{
"name": "bytes",
"mapping": "bytes",
"type": "VARCHAR"
},
{
"name": "response",
"mapping": "response",
"type": "VARCHAR"
},
{
"name": "req_url",
"mapping": "req_url",
"type": "VARCHAR"
},
{
"name": "time_ms",
"mapping": "unix_ms_time",
"type": "BIGINT"
},
{
"name": "up_addr",
"mapping": "up_addr",
"type": "VARCHAR"
},
{
"name": "kv",
"mapping": "kv",
"type": "VARCHAR"
}
]
}
}
重启presto server
3. 分析监控
bin/presto --server localhost:8080 --catalog kafka --schema default
Column | Type | Extra | Comment
-------------------+---------+-------+---------------------------------------------
req_len | varchar | |
bytes | varchar | |
response | varchar | |
req_url | varchar | |
time_ms | bigint | |
up_addr | varchar | |
kv | varchar | |
_partition_id | bigint | | Partition Id
_partition_offset | bigint | | Offset for the message within the partition
_message_corrupt | boolean | | Message data is corrupt
_message | varchar | | Message text
_message_length | bigint | | Total number of message bytes
_key_corrupt | boolean | | Key data is corrupt
_key | varchar | | Key text
_key_length | bigint | | Total number of key bytes
_timestamp | bigint | | Offset Timestamp
(16 rows)
短横线开头的是默认的字段
_timestamp 表示了进入kafka的时间
_partition_id 表示在哪个分区
这样,如果我们就可以使用sql 直接进行统计分析了
-- 求response的分布
select count(*) as total,response from nginxlog where _timestamp > xx and _timestamp < xx and _partition_id =x group by response order by total desc;
-- 求一段时间内访问path分布
select count(*) as total,req_url from nginxlog where _timestamp > xx and _timestamp < xx and _partition_id =x group by req_url order by total desc;