Kubernetes 主节点宕机恢复记录 MountVolume.SetUp failed for volume "kube-dns-config"

       今天早上,发现原来运行的好好的Kubernetes集群不能正常工作了,dashboard界面打不开,主节点上 docker ps 不显示任何运行中容器,重启 kubelet 后,短暂恢复,之后再次陷入不可用状态,经过反复重启观察,发现是etcd不断重启,最后失败,导致其它组件相继失败。

        etcd 日志如下:

2018-02-06 02:25:24.564269 I | etcdmain: etcd Version: 3.0.17
2018-02-06 02:25:24.564531 I | etcdmain: Git SHA: cc198e2
2018-02-06 02:25:24.564544 I | etcdmain: Go Version: go1.6.4
2018-02-06 02:25:24.564554 I | etcdmain: Go OS/Arch: linux/amd64
2018-02-06 02:25:24.564563 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2018-02-06 02:25:24.564636 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-02-06 02:25:24.565021 I | etcdmain: listening for peers on http://localhost:2380
2018-02-06 02:25:24.565114 I | etcdmain: listening for client requests on 127.0.0.1:2379
2018-02-06 02:25:24.568360 I | etcdserver: recovered store from snapshot at index 5900621
2018-02-06 02:25:24.568399 I | etcdserver: name = default
2018-02-06 02:25:24.568406 I | etcdserver: data dir = /var/lib/etcd
2018-02-06 02:25:24.568413 I | etcdserver: member dir = /var/lib/etcd/member
2018-02-06 02:25:24.568418 I | etcdserver: heartbeat = 100ms
2018-02-06 02:25:24.568423 I | etcdserver: election = 1000ms
2018-02-06 02:25:24.568428 I | etcdserver: snapshot count = 10000
2018-02-06 02:25:24.568464 I | etcdserver: advertise client URLs = http://127.0.0.1:2379
2018-02-06 02:25:24.760641 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 5904480
2018-02-06 02:25:24.760850 I | raft: 8e9e05c52164694d became follower at term 12
2018-02-06 02:25:24.760904 I | raft: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 5904480, applied: 5900621, 

lastindex: 5904480, lastterm: 12]
2018-02-06 02:25:24.761062 I | api: enabled capabilities for version 3.0
2018-02-06 02:25:24.761111 I | membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2018-02-06 02:25:24.761125 I | membership: set the cluster version to 3.0 from store
2018-02-06 02:25:24.829180 I | mvcc: restore compact to 2314830
2018-02-06 02:25:24.975979 I | etcdserver: starting server... [version: 3.0.17, cluster version: 3.0]
2018-02-06 02:25:25.661519 I | raft: 8e9e05c52164694d is starting a new election at term 12
2018-02-06 02:25:25.661581 I | raft: 8e9e05c52164694d became candidate at term 13
2018-02-06 02:25:25.661596 I | raft: 8e9e05c52164694d received vote from 8e9e05c52164694d at term 13
2018-02-06 02:25:25.661620 I | raft: 8e9e05c52164694d became leader at term 13
2018-02-06 02:25:25.661645 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 13
2018-02-06 02:25:25.663336 I | etcdserver: published {Name:default ClientURLs:[http://127.0.0.1:2379]} to cluster cdf818194e3a8c32
2018-02-06 02:25:25.663379 I | etcdmain: ready to serve client requests
2018-02-06 02:25:25.663797 N | etcdmain: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2018-02-06 02:25:37.403730 W | etcdserver: apply entries took too long [25.315155ms for 1 entries]
2018-02-06 02:25:37.403752 W | etcdserver: avoid queries with large range/delete range!
2018-02-06 02:25:49.930163 N | osutil: received terminated signal, shutting down...

         查找好多资料,大多说是 etcd 版本问题,可是原来是好好的呀!!!明显不是这个问题。


       费了半天劲,最终排查出的问题是:主节点机器磁盘空间不够了,/ 目录使用率达到了92%


       删除部分文件后,磁盘使用率降到70%,重新启动 kubelet,发现etcd不再重启退出了,但是 kube-dns 一直启动不起来


       提示错误:

Events:
  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----				-------------	--------	------			-------
  1h		35m		36	kubelet, inm-bj-vip-ms04			Warning		FailedMount		Unable to mount volumes for pod "kube-dns-2425271678-bgzzp_kube-system(d7381d67-fb52-11e7-a2f6-0050568b6632)": timeout expired waiting for volumes to attach/mount for pod "kube-system"/"kube-dns-2425271678-bgzzp". list of unattached/unmounted volumes=[kube-dns-config]
  1h		35m		36	kubelet, inm-bj-vip-ms04			Warning		FailedSync		Error syncing pod
  1h		34m		48	kubelet, inm-bj-vip-ms04			Warning		FailedMount		MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory
  30m		30m		1	kubelet, inm-bj-vip-ms04			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 
  28m		10m		9	kubelet, inm-bj-vip-ms04			Warning		FailedMount		Unable to mount volumes for pod "kube-dns-2425271678-bgzzp_kube-system(d7381d67-fb52-11e7-a2f6-0050568b6632)": timeout expired waiting for volumes to attach/mount for pod "kube-system"/"kube-dns-2425271678-bgzzp". list of unattached/unmounted volumes=[kube-dns-config]
  28m		10m		9	kubelet, inm-bj-vip-ms04			Warning		FailedSync		Error syncing pod
  30m		10m		18	kubelet, inm-bj-vip-ms04			Warning		FailedMount		MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory
  10m		10m		1	kubelet, inm-bj-vip-ms04			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 
  10m		9m		6	kubelet, inm-bj-vip-ms04			Warning		FailedMount		MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory
  9m		9m		1	kubelet, inm-bj-vip-ms04			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 
  7m		5m		2	kubelet, inm-bj-vip-ms04			Warning		FailedMount		Unable to mount volumes for pod "kube-dns-2425271678-bgzzp_kube-system(d7381d67-fb52-11e7-a2f6-0050568b6632)": timeout expired waiting for volumes to attach/mount for pod "kube-system"/"kube-dns-2425271678-bgzzp". list of unattached/unmounted volumes=[kube-dns-config]
  7m		5m		2	kubelet, inm-bj-vip-ms04			Warning		FailedSync		Error syncing pod
  9m		3m		11	kubelet, inm-bj-vip-ms04			Warning		FailedMount		MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory
  1m		1m		1	kubelet, inm-bj-vip-ms04			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "kube-dns-token-3hdnc" 
  1m		50s		8	kubelet, inm-bj-vip-ms04			Warning		FailedMount		MountVolume.SetUp failed for volume "kube-dns-config" : stat /var/lib/kubelet/pods/d7381d67-fb52-11e7-a2f6-0050568b6632/volumes/kubernetes.io~configmap/kube-dns-config: no such file or directory


          感觉是挂载卷问题,尝试重新安装网络都没有效果  ^~^|||


        最后,发大招,关闭 kubelet 和 docker 后,删除所有容器, 进入 /var/lib/kubelet/pods ,直接把 kube-dns 的pod 删掉(pod的名字在错误日志里有,这里就是d7381d67-fb52-11e7-a2f6-0050568b6632),再重启docker和 kubelet,dns服务恢复!!!


        末尾附上 dns 恢复依次执行的命令:

# systemctl stop kubelet
# docker kill $(docker ps -a -q)
# docker rm $(docker ps -a -q)
# systemctl stop docker
# cd /var/lib/kubelet/pods
# mv d7381d67-fb52-11e7-a2f6-0050568b6632 /tmp
# systemctl start docker
# systemctl start kubelet


   


        

<think>好的,我现在遇到了Kubernetes集群中的一些问题,需要仔细分析并解决。根据用户提供的Events信息,主要涉及调度失败、挂载卷失败和网络未就绪的问题。让我一步步来思考如何处理。 首先看第一个事件:FailedScheduling警告。原因是0/3节点可用,节点有不可容忍的污点{node.kubernetes.io/not-ready: }。这说明所有节点都被标记为NotReady,导致Pod无法调度。可能的问题包括节点状态异常、kubelet未正常运行或网络插件未正确安装。我需要检查节点状态,使用kubectl get nodes确认是否Ready。如果节点是NotReady,可能需要检查kubelet服务状态,或者查看节点日志找出具体原因。 接下来是FailedMount事件,挂载config-volume失败,提示coredns的配置未注册。这可能是因为ConfigMap未正确创建或名称不匹配。我需要检查kube-system命名空间下的ConfigMap是否存在,名称是否与coredns的部署配置中的volume引用一致。如果ConfigMap不存在,可能需要重新应用CoreDNS的配置文件。 第三个事件是NetworkNotReady,网络插件未初始化,导致CNI插件未就绪。这通常是因为网络插件(如Calico、Flannel)没有正确安装或配置。需要确认网络插件的Pod是否处于Running状态,查看相关Pod的日志,检查是否有安装错误或配置问题。可能需要重新应用网络插件的manifest文件。 可能的解决步骤包括: 1. 检查节点状态,确保所有节点处于Ready状态。如果节点NotReady,登录节点检查kubelet和容器运行时(如Docker、containerd)是否正常运行。 2. 修复节点状态后,重新部署CoreDNS,确保相关ConfigMap正确存在。 3. 安装或修复网络插件,确保其Pod正常运行,从而解决CNI插件未初始化的问题。 另外,污点问题可能需要检查节点污点设置,确认Pod是否有对应的容忍配置。但根据事件信息,可能节点本身的问题导致污点未被容忍,需优先解决节点就绪问题。 过程中可能需要查看日志、验证配置文件,逐步排除各个组件的问题。需要按照顺序处理,先确保节点Ready,然后处理网络问题,最后解决CoreDNS的挂载问题。</think>根据提供的事件信息,集群存在调度、存储挂载和网络三个层面的问题,以下是逐步解决方案: --- ### **一、解决节点未就绪问题(FailedScheduling 警告)** #### 1. 检查节点状态 ```bash kubectl get nodes ``` - 若节点状态为 `NotReady`,需进一步诊断。 #### 2. 登录问题节点排查 - **检查 kubelet 服务状态**: ```bash systemctl status kubelet ``` 如果服务未运行,重启服务: ```bash systemctl restart kubelet ``` - **检查容器运行时状态**(如 Docker): ```bash systemctl status docker ``` 确保容器运行时正常,必要时重启服务。 #### 3. 查看节点详细状态 ```bash kubectl describe node <节点名称> ``` - 检查 `Conditions` 部分,关注 `Ready` 状态的具体原因(如磁盘压力、内存不足等)。 --- ### **二、修复存储挂载问题(FailedMount 警告)** #### 1. 确认 ConfigMap 是否存在 ```bash kubectl get configmap -n kube-system coredns ``` - 若不存在,重新部署 CoreDNS 配置: ```bash kubectl apply -f https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/coredns/coredns.yaml ``` #### 2. 检查 CoreDNS Deployment 配置 确保 `volumes` 部分正确引用 ConfigMap: ```yaml volumes: - name: config-volume configMap: name: coredns items: - key: Corefile path: Corefile ``` --- ### **三、解决网络问题(NetworkNotReady 警告)** #### 1. 确认网络插件已安装 - 检查网络插件 Pod 状态(如 Calico、Flannel): ```bash kubectl get pods -n kube-system | grep -E 'calico|flannel' ``` 若 Pod 未运行,查看日志定位问题: ```bash kubectl logs -n kube-system <网络插件Pod名称> ``` #### 2. 重新安装网络插件(以 Calico 为例) ```bash kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml ``` #### 3. 验证 CNI 插件初始化 - 等待 1-2 分钟后检查节点状态: ```bash kubectl get nodes ``` 节点应变为 `Ready`,且网络警告消失。 --- ### **四、重新部署 CoreDNS** #### 1. 删除并重建 CoreDNS Pod ```bash kubectl delete pod -n kube-system coredns-7bdc4cb885-ssng5 ``` Kubernetes 会自动重建 Pod,此时应无调度和挂载错误。 #### 2. 验证 CoreDNS 运行状态 ```bash kubectl get pods -n kube-system -l k8s-app=kube-dns ``` 确保 Pod 状态为 `Running`。 --- ### **总结排查顺序** 1. 修复节点 `NotReady` → 2. 解决网络插件问题 → 3. 修复 CoreDNS 配置。 通过以上步骤,集群应恢复正常调度、存储挂载和网络功能。若问题持续,需进一步检查集群日志(`/var/log/kubelet.log`)和网络插件配置。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值