linux-node_exporter突然挂掉(couldn‘t get dbus connection)

文章描述了一台机器收到挂掉告警后,通过检查系统日志发现9100端口被大量占用,导致node_exporter服务启动失败的问题。问题根源在于系统中存在两个占用9100端口的node_exporter服务,这引发了DBUS连接数超过限制。解决方案包括清理DBUS连接,停止并移除多余的node_exporter服务实例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、背景

突然收到一台机器挂掉的告警,去上面查看发现机器正常,uptime正常,没有重启情况,UID 20029的用户是tidb,su - tidb的时间较长,node_exporter有如下报错

couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:13

二、排查

1.检查系统日志

cat  /var/log/syslog
# 这里发现了大量的9100端口被占用,然后导致node_exporter启动失败的情况
Apr 20 06:07:26 timon-test01eu-p001 node_exporter[17353]: ts=2023-04-20T06:07:26.413Z caller=node_exporter.go:202 level=error err="listen tcp 0.0.0.0:9100: bind: address already in use"
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Unit entered failed state.
Apr 20 06:07:26 timon-test01eu-p001 systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: node_exporter.service: Service hold-off time over, scheduling restart.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Stopped Prometheus Node Exporter.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Started Prometheus Node Exporter.
Apr 20 06:07:27 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=node_exporter.go:182 level=info msg="Starting node_exporter" version="(version=1.3.1, branch=HEAD, revision=a2321e7b940ddcff26873612bccdf7cd4c42b6b6)"
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=node_exporter.go:183 level=info msg="Build context" build_context="(go=go1.17.3, user=root@243aafa5525c, date=20211205-11:09:49)"
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.675Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.676Z caller=systemd_linux.go:150 level=info collector=systemd msg="Parsed flag --collector.systemd.unit-include" flag=.+
Apr 20 06:07:27 timon-test01eu-p001 node_exporter[17360]: ts=2023-04-20T06:07:27.676Z caller=systemd_linux.go:152 level=info collector=systemd msg="Parsed flag --collector.systemd.unit-exclude" flag=.+\.(automount|device|mount|scope|slice)
Apr 20 06:12:47 timon-test01eu-p001 run_alertmanager.sh[4079]: level=warn ts=2023-04-20T06:12:47.525820383Z caller=delegate.go:239 component=cluster msg="dropping messages because too many are queued" current=4098 limit=4096
Apr 20 06:13:25 timon-test01eu-p001 dhclient[958]: DHCPREQUEST of 192.168.1.1 on ens5 to 192.168.24.1 port 67 (xid=0x23f81881)
Apr 20 06:13:25 timon-test01eu-p001 dhclient[958]: DHCPACK of 192.168.1.1 from 192.168.24.1
Apr 20 06:13:25 timon-test01eu-p001 dhclient[958]: bound to 192.168.1.1 -- renewal in 1688 seconds.
Apr 20 06:15:01 timon-test01eu-p001 CRON[17530]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Apr 20 06:17:01 timon-test01eu-p001 CRON[17554]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 20 06:19:02 timon-test01eu-p001 run_node_exporter.sh[8202]: time="2023-04-20T06:19:02Z" level=error msg="ERROR: systemd collector failed after 0.001009s: couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:132"
Apr 20 06:19:06 timon-test01eu-p001 run_node_exporter.sh[8202]: time="2023-04-20T06:19:06Z" level=error msg="ERROR: systemd collector failed after 0.001918s: couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:132"
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Starting Cleanup of Temporary Directories...
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:19:06 timon-test01eu-p001 systemd-tmpfiles[17617]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Started Cleanup of Temporary Directories.
Apr 20 06:19:06 timon-test01eu-p001 systemd[1]: Failed to propagate agent release message: Operation not supported
Apr 20 06:19:16 timon-test01eu-p001 run_node_exporter.sh[8202]: time="2023-04-20T06:19:16Z" level=error msg="ERROR: systemd collector failed after 0.002401s: couldn't get units: couldn't get dbus connection: The maximum number of active connections for UID 20029 has been reached" source="collector.go:132"

2.查看dbus内容

busctl list
# 可以看到里面有大量的node_exporter内容
NAME                                 PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.0                                   1 systemd         root             :1.0          init.scope                -          -
:1.18644                           17183 accounts-daemon root             :1.18644      accounts-daemon.service   -          -
:1.2                                1183 systemd-logind  root             :1.2          systemd-logind.service    -          -
:1.3                                1206 polkitd         root             :1.3          polkitd.service           -          -
:1.4                                1205 unattended-upgr root             :1.4          unattended-upgrades.se... -          -
:1.91406151                        27059 snapd           root             :1.91406151   snapd.service             -          -
:1.97799812                         8202 node_exporter   tidb             :1.97799812   node_exporter-9100.ser... -          -
:1.97799813                         8202 node_exporter   tidb             :1.97799813   node_exporter-9100.ser... -          -
:1.97799816                         8202 node_exporter   tidb             :1.97799816   node_exporter-9100.ser... -          -
:1.97799817                         8202 node_exporter   tidb             :1.97799817   node_exporter-9100.ser... -          -
:1.97799818                         8202 node_exporter   tidb             :1.97799818   node_exporter-9100.ser... -          -
:1.97799819                         8202 node_exporter   tidb             :1.97799819   node_exporter-9100.ser... -          -
:1.97799820                         8202 node_exporter   tidb             :1.97799820   node_exporter-9100.ser... -          -
:1.97799821                         8202 node_exporter   tidb             :1.97799821   node_exporter-9100.ser... -          -
:1.97799822                         8202 node_exporter   tidb             :1.97799822   node_exporter-9100.ser... -          -
:1.97799823                         8202 node_exporter   tidb             :1.97799823   node_exporter-9100.ser... -          -
:1.97799824                         8202 node_exporter   tidb             :1.97799824   node_exporter-9100.ser... -          -
:1.97799825                         8202 node_exporter   tidb             :1.97799825   node_exporter-9100.ser... -          -
:1.97799826                         8202 node_exporter   tidb             :1.97799826   node_exporter-9100.ser... -          -
:1.97799827                         8202 node_exporter   tidb             :1.97799827   node_exporter-9100.ser... -          -
:1.97799828                         8202 node_exporter   tidb             :1.97799828   node_exporter-9100.ser... -          -
:1.97799829                         8202 node_exporter   tidb             :1.97799829   node_exporter-9100.ser... -          -
:1.97799830                         8202 node_exporter   tidb             :1.97799830   node_exporter-9100.ser... -          -
:1.97799831                         8202 node_exporter   tidb             :1.97799831   node_exporter-9100.ser... -          -
:1.97799832                         8202 node_exporter   tidb             :1.97799832   node_exporter-9100.ser... -          -
:1.97799833                         8202 node_exporter   tidb             :1.97799833   node_exporter-9100.ser... -          -
:1.97799834                         8202 node_exporter   tidb             :1.97799834   node_exporter-9100.ser... -          -
:1.97799835                         8202 node_exporter   tidb             :1.97799835   node_exporter-9100.ser... -          -
:1.97799836                         8202 node_exporter   tidb             :1.97799836   node_exporter-9100.ser... -          -
:1.97799837                         8202 node_exporter   tidb             :1.97799837   node_exporter-9100.ser... -          -
:1.97799838                         8202 node_exporter   tidb             :1.97799838   node_exporter-9100.ser... -          -
:1.97799839                         8202 node_exporter   tidb             :1.97799839   node_exporter-9100.ser... -          -
:1.97799840                         8202 node_exporter   tidb             :1.97799840   node_exporter-9100.ser... -          -
:1.97799841                         8202 node_exporter   tidb             :1.97799841   node_exporter-9100.ser... -          -

3.原因

产生上述情况的原因是因为我们的system中,或者system和docker中启动了两个占有9100端口的node_exporter服务,导致启动的频繁重试,所以才造成上面的情况

# 查看docker或者system中的node服务情况
systemctl list-unit-files|grep node
docker ps

三、解决

上面的情况是因为system产生了意外报错,导致了node_exporter被放到dbus里面,所以需要修复下,运行下面的语句进行修复,参考这个帖子

# 恢复dbus
systemctl daemon-reexec
# 停止并删除多余的exporter服务
# 如果是docker启动的服务
docker stop node_exporter 
docker rm node_exporter
# 如果是system启动的服务
systemctl stop node_exporter
systemctl disable node_exporter
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值