calico-kube-controllers 启动失败处理

文章描述了Calico-Kube-Controllers由于尝试写入/status/status.json文件时遇到权限问题导致不断重启。通过分析日志,发现问题在于无法写入该文件。解决方案是创建并修改目录权限,然后重新应用配置,最终成功恢复系统运行。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

故障描述

calico-kube-controllers 异常,不断重启

日志信息如下

2023-02-21 01:26:47.085 [INFO][1] main.go 92: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0221 01:26:47.086980       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2023-02-21 01:26:47.087 [INFO][1] main.go 113: Ensuring Calico datastore is initialized
2023-02-21 01:26:47.106 [INFO][1] main.go 153: Getting initial config snapshot from datastore
2023-02-21 01:26:47.120 [INFO][1] main.go 156: Got initial config snapshot
2023-02-21 01:26:47.120 [INFO][1] watchersyncer.go 89: Start called
2023-02-21 01:26:47.120 [INFO][1] main.go 173: Starting status report routine
2023-02-21 01:26:47.120 [INFO][1] main.go 182: Starting Prometheus metrics server on port 9094
2023-02-21 01:26:47.120 [INFO][1] main.go 418: Starting controller ControllerType="Node"
2023-02-21 01:26:47.120 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2023-02-21 01:26:47.120 [INFO][1] node_syncer.go 65: Node controller syncer status updated: wait-for-ready
2023-02-21 01:26:47.120 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2023-02-21 01:26:47.120 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2023-02-21 01:26:47.120 [INFO][1] node_controller.go 143: Starting Node controller
2023-02-21 01:26:47.121 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-02-21 01:26:47.121 [INFO][1] resources.go 349: Main client watcher loop
2023-02-21 01:26:47.121 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.121 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.121 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.121 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.124 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2023-02-21 01:26:47.125 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2023-02-21 01:26:47.125 [INFO][1] node_syncer.go 65: Node controller syncer status updated: resync
2023-02-21 01:26:47.125 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-02-21 01:26:47.125 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.125 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.129 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-02-21 01:26:47.129 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-02-21 01:26:47.129 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-02-21 01:26:47.129 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-02-21 01:26:47.129 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2023-02-21 01:26:47.129 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2023-02-21 01:26:47.129 [INFO][1] node_syncer.go 65: Node controller syncer status updated: in-sync
2023-02-21 01:26:47.137 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints
2023-02-21 01:26:47.221 [INFO][1] node_controller.go 159: Node controller is now running
2023-02-21 01:26:47.226 [INFO][1] ipam.go 69: Synchronizing IPAM data
2023-02-21 01:26:47.236 [INFO][1] ipam.go 78: Node and IPAM data is in sync

定位问题在这里

Failed to write status error=open /status/status.json: permission denied

进入容器检查目录

尝试进入容器,但是该容器居然没 cat , ls 等常规命令,无法查看容器问题

检查配置

查看pod的配置,对比其它集群,没任何问题,一样的

[grg@i-A8259010 ~]$ kubectl describe pod calico-kube-controllers-9f49b98f6-njs2f -n kube-system
Name:                 calico-kube-controllers-9f49b98f6-njs2f
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 10.254.39.2/10.254.39.2
Start Time:           Thu, 16 Feb 2023 11:14:35 +0800
Labels:               k8s-app=calico-kube-controllers
                      pod-template-hash=9f49b98f6
Annotations:          cni.projectcalico.org/podIP: 10.244.29.73/32
                      cni.projectcalico.org/podIPs: 10.244.29.73/32
Status:               Running
IP:                   10.244.29.73
IPs:
  IP:           10.244.29.73
Controlled By:  ReplicaSet/calico-kube-controllers-9f49b98f6
Containers:
  calico-kube-controllers:
    Container ID:   docker://21594e3517a3fc8ffc5224496cec373117138acf5417d9a335a1c5e80e0c3802
    Image:          registry.custom.local:12480/kubeadm-ha/calico_kube-controllers:v3.19.1
    Image ID:       docker-pullable://registry.cn-beijing.aliyuncs.com/dotbalo/kube-controllers@sha256:2ff71ba65cd7fe10e183ad80725ad3eafb59899d6f1b2610446b90c84bf2425a
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 21 Feb 2023 09:34:06 +0800
      Finished:     Tue, 21 Feb 2023 09:35:15 +0800
    Ready:          False
    Restart Count:  1940
    Liveness:       exec [/usr/bin/check-status -l] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:      exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ENABLED_CONTROLLERS:  node
      DATASTORE_TYPE:       kubernetes
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-55jbn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-55jbn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                        From     Message
  ----     ------     ----                       ----     -------
  Warning  Unhealthy  31m (x15164 over 4d22h)    kubelet  Readiness probe failed: Failed to read status file /status/status.json: unexpected end of JSON input
  Warning  BackOff    6m23s (x23547 over 4d22h)  kubelet  Back-off restarting failed container
  Warning  Unhealthy  79s (x11571 over 4d22h)    kubelet  Liveness probe failed: Failed to read status file /status/status.json: unexpected end of JSON input

对比镜像

检查镜像版本,与其它集群一致,没问题

Image:          registry.custom.local:12480/kubeadm-ha/calico_kube-controllers:v3.19.1    
Image ID:       docker-pullable://registry.cn-beijing.aliyuncs.com/dotbalo/kube-controllers@sha256:2ff71ba65cd7fe10e183ad80725ad3eafb59899d6f1b2610446b90c84bf2425a

检查其余集群配置差异

检查与其它集群的配置信息,该机器的 docker 是原来已经安装的,版本是 19,其它机器是新安装的版本 20 。

处理方案

在无法重装 docker 的情况下

重启 pod,无效

百度,无相关信息

调整 calico-kube-controllers 配置

配置文件在 /etc/kubernetes/plugins/network-plugin/calico-typha.yaml

我们针对无法写入目录 /status ,添加卷映射

应用配置

mkdir /var/run/calico/status
chmod 777/var/run/calico/status
kubectl apply -f  /etc/kubernetes/plugins/network-plugin/calico-typha.yaml

到此系统恢复

<think>嗯,用户现在遇到了一个Kubernetes相关的错误,错误信息是“Error from server: Get "https://10.251.15.201:10250/containerLogs/kube-system/calico-kube-controllers-7d779f59db-8gnff/calico-kube-controllers": x509: certificate signed by unknown authority”。这看起来和证书有关,但具体是怎么回事呢? 首先,我需要理解这个错误的原因。x509证书错误通常是因为客户端不信任服务器提供的证书。在这里,用户可能使用kubectl试图获取某个Pod的日志,而kubectl尝试连接到节点的10250端口,但节点的kubelet的证书没有被客户端信任。 接下来,我需要考虑可能的解决方案。用户可能是在一个自签名的集群环境中,比如使用kubeadm安装的集群,这时候kubelet的证书可能没有被正确配置或者客户端没有正确的CA证书。 用户之前的问题是关于Calico网络的问题,现在这个错误可能是在尝试调试Calico控制器时出现的。因此,可能用户的环境中的证书配置有问题,或者在集群初始化过程中某些步骤被遗漏了。 可能的解决步骤包括检查kubelet的证书配置,确保kubectl使用的CA证书正确,或者更新集群的证书。另外,用户可能没有正确配置kubeconfig文件,导致无法验证kubelet的证书。 另外,用户可能在某些环境中使用了自签名证书,但未将CA证书添加到信任库中。或者,集群的证书已经过期,需要重新生成。 需要分步骤指导用户检查证书配置,可能包括查看kubelet的配置,检查证书的有效期,以及如何更新或重新生成证书。同时,可能需要指导用户如何绕过证书验证(虽然不推荐),或者正确配置信任。 还需要考虑用户是否有权限访问这些日志,或者是否存在网络策略阻止了访问,但根据错误信息,更可能的是证书问题。 总结下来,应该分步骤引导用户检查kubelet证书,更新集群证书,或者调整kubectl的配置以信任这些证书。同时,需要提醒用户注意安全,不要随意跳过证书验证。</think>### **问题分析** 当执行 `kubectl logs` 或类似命令时出现 `x509: certificate signed by unknown authority` 错误,表明 **kubectl 客户端无法验证目标节点(kubelet)的 TLS 证书合法性**。常见原因如下: | **原因** | **具体场景** | |--------------------------|----------------------------------------------------------------------------| | **证书链不完整** | 节点证书由自定义 CA 签发,但 kubectl 未加载该 CA 证书到信任链。 | | **证书过期或配置错误** | 节点证书已过期,或 kubelet 未正确配置证书。 | | **kubeconfig 配置错误** | kubectl 使用的 kubeconfig 文件未指向正确的 CA 证书(如误用旧集群配置)。 | | **代理或中间人干扰** | 网络中存在透明代理篡改证书(如企业网络拦截 HTTPS)。 | --- ### **分步解决方案** --- #### **1. 确认问题范围** 首先确认错误是否普遍存在: ```bash # 尝试访问其他节点或 Pod 的日志(替换 IP 和 Pod 名称) kubectl logs -n kube-system <pod-name> --v=6 2>&1 | grep "server certificate" ``` - 若仅特定节点报错 → 可能该节点证书配置异常。 - 若所有节点均报错 → 集群根 CA 证书未正确分发。 --- #### **2. 验证证书信任链** 检查 kubectl 是否信任目标节点的证书颁发机构(CA): ```bash # 提取目标节点 kubelet 的证书(需 SSH 登录该节点) openssl s_client -connect 10.251.15.201:10250 -showcerts 2>/dev/null | openssl x509 -noout -issuer -subject -dates ``` - **输出示例**: ``` issuer=O = system:masters, CN = kubernetes subject=O = system:nodes, CN = system:node:k8s-worker-1 notBefore=Jan 1 00:00:00 2023 GMT notAfter=Jan 1 00:00:00 2024 GMT ``` - 若 `issuer` 为自定义 CA(如 `CN = my-ca`),需确保 kubectl 信任该 CA。 - 若证书已过期(`notAfter` 早于当前时间),需续签证书。 --- #### **3. 修复方案** --- ##### **场景 1:使用 kubeadm 部署的集群(默认自签名证书)** kubeadm 生成的证书默认存储在 `/etc/kubernetes/pki`,需确保 kubectl 的 kubeconfig 文件正确引用集群 CA。 1. **检查 kubeconfig 的 CA 路径**: ```bash kubectl config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}' | base64 -d > /tmp/cluster-ca.crt openssl x509 -in /tmp/cluster-ca.crt -noout -issuer ``` - 输出应与节点证书的 `issuer` 一致。若不一致 → kubeconfig 文件未使用集群真实 CA。 2. **修正 kubeconfig**: 直接复制主节点的 `/etc/kubernetes/admin.conf` 到本地作为 kubeconfig: ```bash mkdir -p $HOME/.kube scp root@<master-ip>:/etc/kubernetes/admin.conf $HOME/.kube/config chmod 600 $HOME/.kube/config ``` --- ##### **场景 2:自定义 CA 签发证书** 若集群证书由自定义 CA 签发(如企业内网 CA),需将 CA 证书添加到 kubectl 的信任链。 1. **获取集群 CA 证书**: - 从主节点复制 CA 证书: ```bash scp root@<master-ip>:/etc/kubernetes/pki/ca.crt ./custom-ca.crt ``` 2. **配置 kubectl 信任该 CA**: - **Linux/macOS**: ```bash sudo cp custom-ca.crt /usr/local/share/ca-certificates/k8s-ca.crt sudo update-ca-certificates ``` - **Windows**: 双击 `custom-ca.crt` → 选择“安装证书” → 存储到“受信任的根证书颁发机构”。 --- ##### **场景 3:证书过期** 若证书已过期,需续签或重新生成。 1. **检查证书过期时间**: ```bash kubeadm certs check-expiration ``` 2. **续签证书**: ```bash kubeadm certs renew all # 续签所有证书 systemctl restart kubelet # 重启 kubelet ``` --- ##### **场景 4:绕过证书验证(仅限临时测试)** **(不推荐生产环境使用)** 在 kubectl 命令中跳过 TLS 验证: ```bash kubectl logs -n kube-system calico-kube-controllers-7d779f59db-8gnff --insecure-skip-tls-verify=true ``` --- ### **4. 验证修复** 1. **检查证书信任**: ```bash curl --cacert /etc/kubernetes/pki/ca.crt https://10.251.15.201:10250/healthz ``` - 输出应为 `ok`。 2. **获取日志**: ```bash kubectl logs -n kube-system calico-kube-controllers-7d779f59db-8gnff ``` --- ### **5. 高级排查** 若仍报错,检查以下配置: | **检查项** | **命令/文件** | |--------------------------|-----------------------------------------------------------------------------| | kubelet 证书配置 | `journalctl -u kubelet | grep -i tls` | | kubelet 启动参数 | `ps -ef | grep kubelet` → 检查 `--tls-cert-file` 和 `--tls-private-key-file` | | 节点防火墙规则 | `iptables -L -n -t nat` → 确保 10250 端口开放 | --- ### **总结** | **根本原因** | **解决方案** | |--------------------------|------------------------------------------| | kubectl 未信任集群 CA | 更新 kubeconfig 或系统 CA 存储 | | 节点证书过期 | 使用 `kubeadm certs renew` 续签证书 | | kubelet 证书配置错误 | 检查 kubelet 启动参数及证书文件权限 | | 网络中间人攻击或代理干扰 | 联系网络管理员排查代理设置 | 建议优先使用 **场景 1 或 2** 修复证书信任链问题,避免绕过 TLS 验证带来的安全风险。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值