一、背景
突然收到一台机器挂掉的告警,去上面查看发现机器正常,uptime正常,没有重启情况,UID 20029的用户是tidb,su - tidb的时间较长,node_exporter有如下报错
couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:13
二、排查
1.检查系统日志
cat /var/log/syslog
# 这里发现了大量的9100端口被占用,然后导致node_exporter启动失败的情况
Apr 20 06:07:26 timon-test01eu-p001 node_exporter[17353]: ts=2023-04-20T06:07:26.413Z caller=node_exporter.go:202 level=error err="listen tcp 0.0.0.0:9100: bind: address already in use"
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Unit entered failed state.
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: node_exporter.service: Service hold-off time over, scheduling restart.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Stopped Prometheus Node Exporter.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Started Prometheus Node Exporter.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=node_exporter.go:182 level=info msg="Starting node_exporter" version="(version=1.3.1, branch=HEAD, revision=a2321e7b940ddcff26873612bccdf7cd4c42b6b6)"
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=node_exporter.go:183 level=info msg="Build context" build_context="(go=go1.17.3, user=root@243aafa5525c, date=20211205-11:09:49)"
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.676Z caller=systemd_linux.go:150 level=info collector=systemd msg="Parsed flag --collector.systemd.unit-include" flag=.+
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.676Z caller=systemd_linux.go:152 level=info collector=systemd msg="Parsed flag --collector.systemd.unit-exclude" flag=.+\.(automount|device|mount|scope|slice)
Apr 20 06:12:47 timon-test01eu-p001 run_alertmanager.sh[4079]: level=warn ts=2023-04-20T06:12:47.525820383Z caller=delegate.go:239 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
Apr 20 06:13:25 timon-test01eu-p001 dhclient[958]: DHCPREQUEST of 192.168.1.1 on ens5 to 192.168.24.1 port 67 (xid=0x23f81881)
Apr 20 06:13:25 timon-test01eu-p001 dhclient[958]: DHCPACK of 192.168.1.1 from 192.168.24.1
Apr 20 06:13:25 timon-test01eu-p001 dhclient[958]: bound to 192.168.1.1 -- renewal in 1688 seconds.
Apr 20 06:15:01 timon-test01eu-p001 CRON[17530]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 20 06:17:01 timon-test01eu-p001 CRON[17554]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 20 06:19:02 timon-test01eu-p001 run_node_exporter.sh[8202]: time="2023-04-20T06:19:02Z" level=error msg="ERROR: systemd collector failed after 0.001009s: couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:132"
Apr 20 06:19:06 timon-test01eu-p001 run_node_exporter.sh[8202]: time="2023-04-20T06:19:06Z" level=error msg="ERROR: systemd collector failed after 0.001918s: couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:132"
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Starting Cleanup of Temporary Directories...
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:19:06 timon-test01eu-p001 systemd-tmpfiles[17617]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Started Cleanup of Temporary Directories.
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:19:16 timon-test01eu-p001 run_node_exporter.sh[8202]: time="2023-04-20T06:19:16Z" level=error msg="ERROR: systemd collector failed after 0.002401s: couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:132"
2.查看dbus内容
busctl list
# 可以看到里面有大量的node_exporter内容
NAME PID PROCESS USER CONNECTION UNIT SESSION DESCRIPTION
:1.0 1 systemd root :1.0 init.scope - -
:1.18644 17183 accounts-daemon root :1.18644 accounts-daemon.service - -
:1.2 1183 systemd-logind root :1.2 systemd-logind.service - -
:1.3 1206 polkitd root :1.3 polkitd.service - -
:1.4 1205 unattended-upgr root :1.4 unattended-upgrades.se... - -
:1.91406151 27059 snapd root :1.91406151 snapd.service - -
:1.97799812 8202 node_exporter tidb :1.97799812 node_exporter-9100.ser... - -
:1.97799813 8202 node_exporter tidb :1.97799813 node_exporter-9100.ser... - -
:1.97799816 8202 node_exporter tidb :1.97799816 node_exporter-9100.ser... - -
:1.97799817 8202 node_exporter tidb :1.97799817 node_exporter-9100.ser... - -
:1.97799818 8202 node_exporter tidb :1.97799818 node_exporter-9100.ser... - -
:1.97799819 8202 node_exporter tidb :1.97799819 node_exporter-9100.ser... - -
:1.97799820 8202 node_exporter tidb :1.97799820 node_exporter-9100.ser... - -
:1.97799821 8202 node_exporter tidb :1.97799821 node_exporter-9100.ser... - -
:1.97799822 8202 node_exporter tidb :1.97799822 node_exporter-9100.ser... - -
:1.97799823 8202 node_exporter tidb :1.97799823 node_exporter-9100.ser... - -
:1.97799824 8202 node_exporter tidb :1.97799824 node_exporter-9100.ser... - -
:1.97799825 8202 node_exporter tidb :1.97799825 node_exporter-9100.ser... - -
:1.97799826 8202 node_exporter tidb :1.97799826 node_exporter-9100.ser... - -
:1.97799827 8202 node_exporter tidb :1.97799827 node_exporter-9100.ser... - -
:1.97799828 8202 node_exporter tidb :1.97799828 node_exporter-9100.ser... - -
:1.97799829 8202 node_exporter tidb :1.97799829 node_exporter-9100.ser... - -
:1.97799830 8202 node_exporter tidb :1.97799830 node_exporter-9100.ser... - -
:1.97799831 8202 node_exporter tidb :1.97799831 node_exporter-9100.ser... - -
:1.97799832 8202 node_exporter tidb :1.97799832 node_exporter-9100.ser... - -
:1.97799833 8202 node_exporter tidb :1.97799833 node_exporter-9100.ser... - -
:1.97799834 8202 node_exporter tidb :1.97799834 node_exporter-9100.ser... - -
:1.97799835 8202 node_exporter tidb :1.97799835 node_exporter-9100.ser... - -
:1.97799836 8202 node_exporter tidb :1.97799836 node_exporter-9100.ser... - -
:1.97799837 8202 node_exporter tidb :1.97799837 node_exporter-9100.ser... - -
:1.97799838 8202 node_exporter tidb :1.97799838 node_exporter-9100.ser... - -
:1.97799839 8202 node_exporter tidb :1.97799839 node_exporter-9100.ser... - -
:1.97799840 8202 node_exporter tidb :1.97799840 node_exporter-9100.ser... - -
:1.97799841 8202 node_exporter tidb :1.97799841 node_exporter-9100.ser... - -
3.原因
产生上述情况的原因是因为我们的system中,或者system和docker中启动了两个占有9100端口的node_exporter服务,导致启动的频繁重试,所以才造成上面的情况
# 查看docker或者system中的node服务情况
systemctl list-unit-files|grep node
docker ps
三、解决
上面的情况是因为system产生了意外报错,导致了node_exporter被放到dbus里面,所以需要修复下,运行下面的语句进行修复,参考这个帖子
# 恢复dbus
systemctl daemon-reexec
# 停止并删除多余的exporter服务
# 如果是docker启动的服务
docker stop node_exporter
docker rm node_exporter
# 如果是system启动的服务
systemctl stop node_exporter
systemctl disable node_exporter