k8s集成prometheus报错:err=“log series: open /prometheus/wal: no such file or directory“

问题排查:
在这里插入图片描述
分别查看pod日志:prometheus-k8s-0以及prometheus-operator-78bd98fc99-bmkbq
经查:prometheus-k8s-0日志显示 err=“log series: open /prometheus/wal: no such file or directory”
k logs prometheus-k8s-0 -n monitoring -c prometheus -f
在这里插入图片描述
查看 journalctl -f -u kubelet 日志报错:remove xxx : device or resource busy
在这里插入图片描述
分析有可能是由于数据损坏导致pod不能对数据目录进行更改操作导致。
尝试解决办法:
1、直接手动删除或者移除数据目录
2、查找数据目录进程,杀掉进程后,手动移除数据目录
办法一解决步骤:
找到数据目录,查看数据目录下是否存在数据:

#此路径为公司内部环境还原客户环境测试过程中产生
ll /data/381/k8s/lib/kubelet/pods/d8a23fa5-98ea-11ea-b346-52540018083f_bak02/volume-subpaths/prometheus-k8s-db/prometheus/2

发现并无数据内容,直接删除目录 d8a23fa5-98ea-11ea-b346-52540018083f_bak02
发现报错:device or resource busy
分析应该是有其他进程在用导致,因此将docker服务停止后再次操作:
仍然报错:device or resource busy,尝试第二种办法:
办法二解决步骤:
查找目录进程,杀死后,再对目录进程操作:

lsof -w +D d8a23fa5-98ea-11ea-b346-52540018083f_bak02

之后kill -9 PID,再次 rm -rf d8a23fa5-98ea-11ea-b346-52540018083f_bak02
发现报错:device or resource busy
分析:应该是该目录为磁盘挂载目录,因此直接不能直接删除

df /data/381/k8s/lib/kubelet/pods/d8a23fa5-98ea-11ea-b346-52540018083f_bak02/volume-subpaths/prometheus-k8s-db/prometheus
df -h /data/381/k8s/lib/kubelet/pods/d8a23fa5-98ea-11ea-b346-52540018083f_bak02/volume-subpaths/prometheus-k8s-db/prometheus

果然 该路径挂载在 /data目录下,因此,想要对目录进程操作,就需要先卸载,具体操作步骤如下:

#查看docker状态
service docker status
#查看k8s状态
systemctl status kubelet
#停止docker
service docker stop
#卸载磁盘目录
umount /data/381/k8s/lib/kubelet/pods/d8a23fa5-98ea-11ea-b346-52540018083f_bak02/volume-subpaths/prometheus-k8s-db/prometheus/2
如果umount报错使用lsof -w +D xxx 查询出所有的进程,kill掉,然后在umount。
#检查目录是否挂载
df /data/381/k8s/lib/kubelet/pods/d8a23fa5-98ea-11ea-b346-52540018083f_bak02/volume-subpaths/prometheus-k8s-db/prometheus/2
#删除问题数据目录
rm -rf /data/381/k8s/lib/kubelet/pods/d8a23fa5-98ea-11ea-b346-52540018083f_bak02
#重启docker服务
service docker start
#查看docker状态
service docker status
#查看k8s状态
systemctl status kubelet
#启动k8s
systemctl start kubelet
#查看monitoring空间下的pod状态
k get pods -n monitoring
#重启prometheus相关pod
k delete pod prometheus-k8s-0 -n monitoring
k delete pod prometheus-operator-6dfc87c946-5p7mp -n monitoring

注意⚠️:umount 的时候一定要全路径
如果umount时,不成功,如下:
在这里插入图片描述
不在当前目录执行umount,仍不成功,如下:
在这里插入图片描述
解决办法:

尝试使用:lsof -w +D xxx 的方式杀掉所有相关进程
在这里插入图片描述
待全部kill后,问题解决,可以 umount 了
另外在执行期间最好查看kubelet日志:journalctl -f -u kubelet 观察日志动态。
以上步骤执行结束后,观察pod的状态处于running

[root@dljycs-rancher1 categraf-v0.4.3-linux-amd64]# ./categraf --test --inputs prometheus 2025/03/10 15:00:18 main.go:150: I! runner.binarydir: /data/categraf-v0.4.3-linux-amd64 2025/03/10 15:00:18 main.go:151: I! runner.hostname: dljycs-rancher1 2025/03/10 15:00:18 main.go:152: I! runner.fd_limits: (soft=65536, hard=65536) 2025/03/10 15:00:18 main.go:153: I! runner.vm_limits: (soft=unlimited, hard=unlimited) 2025/03/10 15:00:18 provider_manager.go:60: I! use input provider: [local] 2025/03/10 15:00:18 ibex_agent.go:19: I! ibex agent disabled! 2025/03/10 15:00:18 agent.go:38: I! agent starting 2025/03/10 15:00:18 agent.go:46: I! [*agent.MetricsAgent] started 2025/03/10 15:00:18 prometheus_agent.go:27: I! prometheus scraping started! 2025/03/10 15:00:18 agent.go:46: I! [*agent.PrometheusAgent] started 2025/03/10 15:00:18 agent.go:49: I! agent started ts=2025-03-10T07:00:18.831Z caller=web.go:559 level=info component=web msg="Start listening for connections" address=127.0.0.1:0 ts=2025-03-10T07:00:18.831Z caller=prometheus.go:843 level=info msg="Starting WAL storage ..." ts=2025-03-10T07:00:18.831Z caller=dir_locker.go:77 level=warn msg="A lockfile from a previous execution already existed. It was repla ced" file=/data/categraf-v0.4.3-linux-amd64/data-agent/lockts=2025-03-10T07:00:18.831Z caller=prometheus.go:726 level=info msg="Stopping scrape discovery manager..." ts=2025-03-10T07:00:18.831Z caller=prometheus.go:740 level=info msg="Stopping notify discovery manager..." ts=2025-03-10T07:00:18.831Z caller=prometheus.go:762 level=info msg="Stopping scrape manager..." ts=2025-03-10T07:00:18.831Z caller=prometheus.go:736 level=info msg="Notify discovery manager stopped" ts=2025-03-10T07:00:18.832Z caller=notifier.go:608 level=info component=notifier msg="Stopping notification manager..." ts=2025-03-10T07:00:18.832Z caller=prometheus.go:918 level=info msg="Notifier manager stopped" ts=2025-03-10T07:00:18.832Z caller=prometheus.go:722 level=info msg="Scrape discovery manager stopped" ts=2025-03-10T07:00:18.832Z caller=tls_config.go:313 level=info component=web msg="Listening on" address=127.0.0.1:35523 ts=2025-03-10T07:00:18.832Z caller=tls_config.go:316 level=info component=web msg="TLS is disabled." http2=false address=127.0.0.1:355 23ts=2025-03-10T07:00:18.832Z caller=prometheus.go:756 level=info msg="Scrape manager stopped" ts=2025-03-10T07:00:18.832Z caller=prometheus.go:928 level=error err="opening storage failed lock DB directory: resource temporarily u navailable"
03-11
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值