ELK日志延后且logstash的CPU利用率100%问题处理

故事过程:

公司本来有个A业务有两台logstash,之前按CPU利用率一直在40~50%左右吧,属于另一个小伙子Z一直跟进,忽然有一天CPU就开始报警,然后他也不清楚咋解决,就来找我一起处理。

现象:logstash的CPU利用率突然增多,一直持续在100%,而且还存在日志延后问题,扩增CPU资源用处不大,后面新增了两个logstash节点,日志不延后了,但是CPU利用率还是很高。。。

 排查过程:

1.观察索引的日志量,其实观察下来,日志缺失增多了,有两个多的增多了30%~50%,有一部分索引是正常的,所以,我们最开始觉得扩增CPU核数就OK,那就干!!!扩增完以后,果然有效果!嘿嘿,还有扩资源解决不了的事???不可能/傲娇,结果到了晚上,CPU接着跑满了,慌了慌了。

2.计划重启logstash服务,解决CPU利用率高的问题。毕竟是生产,这种事还是要悄咪咪的干,不过是高可用双节点,无所谓,莽就完了。冲冲冲!!!完球,哥们慌了,为啥logstash启动起来CPU就直接拉满,而且看运行日志全是报错,我问我另一个同事Z,他说问题不大,之前一直这样,这就对了!!!我一脸质疑,不过还是选择相信,毕竟老艺术家的从容与淡定。然后我俩就开始搞另一个节点,我去!!!另一个节点也一堆报错,慌了慌了。。。老艺术家也顶不住了啊。一顿慌乱,然后我们发现日志在我们重启的logstash的过程中出现了滞后问题,我俩就开始慌,一开始以为日志不传送了,后面决定让他运行一会看看,搜嘎!!!原来能继续传输,只不过消息有积压,过一段时间就好了。。。拿捏~,其实理由我俩都想好了,Z说发现日志有滞后问题,过来找我解决,经过我来一顿操作猛如虎,最后日志同步中,CPU因为要处理日志积压问题就比较高。哈哈哈哈哈哈,说的有鼻子有眼的,我都信了,拿捏~,让我们拭目以待

3.害!!!服了我都,咋还报警啊,日志我前天看都同步好了,CPU咋还报警,真烦,关键还没思路啊,硬着头皮去看日志吧。还真发现了俩问题:

[2025-02-13T00:00:08,677][WARN ][logstash.filters.kv      ][main][d56775fd1fcfbcec3a2c73fa866ae5246b1b4f0f25a71902554353148819ff6e] Timeout reached in KV filter with value entry too large; first 255 chars are `"HOST=[aaaa] TID=[11] TIME=[2025-02-12 23:59:35.384] Type=[I] REFFUN=[tmesis.http.server.doService(18)] Msg=[{\"S-_MODULE\":\"server\",\"S-_SERVICE\":\"Service\",\"S-_CLIENT\":\"DESKTOP-U4F8AND\",\"S-SERVICE_FROM\":\"COMC\",\"S-CALL_SERVI"`
[2025-02-13T00:00:11,441][WARN ][logstash.outputs.elasticsearch][main][fa3aff76f47ae58bbd1bd4d1453ae4f0a17e77151f4dc3ad86a69b512dbbf01e] Could not index event to Elasticsearch. {:status=>400, :action=>["create", {:_id=>nil, :_index=>"MMEESS-2025.02.12", :routing=>nil}, {"REFFUN"=>"tmesis.service.Service.run(205)", "CHANNEL"=>"/PRDA/Server", "Type"=>"I", "HOST"=>"PRDA01", "EQUIP"=>"AAA-09-151", "TIME"=>"2025-02-13 00:00:00.868", "fields"=>{"logtopic"=>"plan", "factory"=>"AAA8", "app"=>"AAA3"}, "log"=>{"file"=>{"path"=>"d:\\PlantU\\Server\\log\\log.log"}, "flags"=>["multiline"], "offset"=>45652}, "RID"=>"20251232135", "TID"=>"460", "@timestamp"=>2025-02-12T16:00:00.868Z, "BEGIN_END"=>"BEGIN", "message"=>"[SERVICE] HOST=[PRDA01] RID=[2025123123123123] TID=[123] CHANNEL=[/PRDA/Server] TIME=[2025-02-13 00:00:00.868] CLIENT=[] CLIENTVER=[] Type=[I] SVC=[E2MQueryCarrierInforForERack] BEGIN_END=[BEGIN] EQUIP=[AAA-09-151] REFFUN=[tmesis.service.Service.run(205)] \nPARAMETER=[{\n  \"S-_FACTORY\": \"AAA\",\n  \"S-_LANGUAGE\": \"EN\",\n  \"S-_PASSPORT\": \"\",\n  \"S-_PASSWORD\": \"\",\n  \"S-_USERID\": \"AAA\",\n  \"S-OPERATOR_ID\": \"AAA\",\n  \"S-EQUIP_ID\": \"AAA-09-151\",\n  \"LT-CARRIER_LIST\": [\n    {\n      \"S-CARRIER_ID\": \"AAA017\"\n    }\n  ],\n  \"STEP\": \"P\",\n  \"S-_SERVICE\": \"E2Mk\",\n  \"S-_MODULE\": \"EISM\"\n}]\n\tRequest Begin\n", "info"=>"HOST=[PRDA01] RID=[2025021123123123 TID=[460] CHANNEL=[/PRDA/Server] TIME=[2025-02-13 00:00:00.868] CLIENT=[] CLIENTVER=[] Type=[I] SVC=[EorERack] BEGIN_END=[BEGIN] EQUIP=[AAA-09-151] REFFUN=[tmesis.service.Service.run(205)] \n", "SVC"=>"ASDADASDADA", "MOLD"=>"SERVICE"}], :response=>{"create"=>{"_index"=>"AAA-2025.02.12", "_type"=>"_doc", "_id"=>"ZPzj-pQBWASPWzZbqJy-", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [EQUIP] of type [date] in document with id 'ZPzj-pQBWASPWzZbqJy-'. Preview of field's value: 'AAA-09-151'", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"failed to parse date field [AAA-09-151] with format [strict_date_optional_time||epoch_millis]", "caused_by"=>{"type"=>"date_time_parse_exception", "reason"=>"date_time_parse_exception: Failed to parse with all enclosed parsers"}}}}}}

待老夫细细看来:

        1)字符串超出限制。按照百度啥的查了查,调完limit参数,根本logstash服务都起不来。

        2)400的报错网上说是单个node节点最大分片数为1000,我查了查我们改过这个设置了,我们是10000,而且目前 分片才2000多,不可能是这个问题。

        3)主机名格式错误、时间格式错误。好好好,原来应用把日志格式改动了

4.好像知道报错了,又好像不知道。于是,就找一条日志放入在线grok debug中生成匹配格式,补充一条到logstash的匹配规则里面,重启logstash,生效。

在线Grok Debug工具,Grok校验|调试

似乎现在的在线debug工具不能用了,难崩。

添加的是第一行,因为日志格式不匹配,而且logstash无法解析msg这个消息,所以就把msg这部分丢弃了,至此,CPU利用率降下来了。

            "message" , "(?m)^\[%{WORD:MOLD}\] (?<info>(.*?)(?= Msg=\[))",
            "message" , "(?m)^\[%{WORD:MOLD}\] (?<info>(.*?\n))"

input {
  kafka {
        id => "kafka"
        client_id => "kafka"
        bootstrap_servers => '192.168.20.73:9092,192.168.20.74:9092,192.168.20.75:9092'
		topics => ["aaaaaa"]
        codec => json { charset => "UTF-8" }
		group_id => "bbb"
        consumer_threads => 3
        auto_offset_reset => earliest
        decorate_events => false
  }	
}

filter {

       grok {
			match => [
			"message" , "(?m)^\[%{WORD:MOLD}\] (?<info>(.*?)(?= Msg=\[))",
			"message" , "(?m)^\[%{WORD:MOLD}\] (?<info>(.*?\n))"
			]
       }
	   
       kv {
         source => "info"
       }
	  mutate {
	    convert => {"DurTime" => "float"}
		remove_field => ["host", "agent", "ecs", "tag", "@version", "input"]
	  }
	  
	  date {
		match => ["TIME", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss.SSS", "ISO8601"]
		timezone => "Asia/Shanghai"
		target => "@timestamp"
	  }

 }


output {
       #stdout { codec => rubydebug }
       elasticsearch {
			hosts => ["192.168.20.76:9200", "192.168.20.86:9200", "192.168.20.87:9200"]
			manage_template => true
		    action => "create"
			data_stream => "auto"
			index => ["aaa-%{[fields][logtopic]}-%{[fields][factory]}-%{[fields][app]}-%{+YYYY.MM.dd}"]
		  }
}

下降后的CPU利用率如下:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值