一、背景
突然收到一台机器挂掉的告警,去上面查看发现机器正常,uptime正常,没有重启情况,UID 20029的用户是tidb,su - tidb的时间较长,node_exporter有如下报错
couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:13
二、排查
1.检查系统日志
cat /var/log/syslog
# 这里发现了大量的9100端口被占用,然后导致node_exporter启动失败的情况
Apr 20 06:07:26 timon-test01eu-p001 node_exporter[17353]: ts=2023-04-20T06:07:26.413Z caller=node_exporter.go:202 level=error err="listen tcp 0.0.0.0:9100: bind: address already in use"
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Unit entered failed state.
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: node_exporter.service: Service hold-off time over, scheduling restart.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Stopped Prometheus Node Exporter.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Started Prometheus Node Exporter.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=node_exporter.go:182 level=info msg="Starting node_exporter" version="(version=1.3.1, branch=HEAD, revision=a2321e7b940ddcff26873612bccdf7cd4c42b6b6)"
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=node_exporter.go:183 level=info msg="Build context" build_context="(go=go1.1

文章描述了一台机器收到挂掉告警后,通过检查系统日志发现9100端口被大量占用,导致node_exporter服务启动失败的问题。问题根源在于系统中存在两个占用9100端口的node_exporter服务,这引发了DBUS连接数超过限制。解决方案包括清理DBUS连接,停止并移除多余的node_exporter服务实例。
最低0.47元/天 解锁文章
1492

被折叠的 条评论
为什么被折叠?



