k8s 污点驱逐详解-源码分析

本文详细分析了Kubernetes中NodeLifecycleController的工作原理,包括startNodeLifecycleController、NewNodeLifecycleController、NodeLifecycleController.run等阶段。重点讨论了taintManager的处理流程,如worker处理、nodeUpdate及podUpdate的处理逻辑,以及如何根据节点和Pod的状态进行NoExecute污点处理和Pod驱逐。还介绍了node分类、处理node状态、集群健康状态的监控以及速率限制策略。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

代码版本:1.17.4

1. startNodeLifecycleController

可以看到startNodeLifecycleController就是分为2个步骤:

  • NodeLifecycleController

  • NodeLifecycleController.run

func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
  lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
    ctx.InformerFactory.Coordination().V1().Leases(),
    ctx.InformerFactory.Core().V1().Pods(),
    ctx.InformerFactory.Core().V1().Nodes(),
    ctx.InformerFactory.Apps().V1().DaemonSets(),
    // node lifecycle controller uses existing cluster role from node-controller
    ctx.ClientBuilder.ClientOrDie("node-controller"),
    
    // 就是node-monitor-period参数
    ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,   
    
    // 就是node-startup-grace-period参数
    ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
    
    // 就是node-monitor-grace-period参数
    ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
    
    // 就是pod-eviction-timeout参数
    ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
    
    // 就是node-eviction-rate参数
    ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
    
    // 就是secondary-node-eviction-rate参数
    ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
    
    // 就是large-cluster-size-threshold参数
    ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
    
    // 就是unhealthy-zone-threshold参数
    ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
    
    // 就是enable-taint-manager参数  (默认打开的)
    ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
    
    // 就是这个是否打开--feature-gates=TaintBasedEvictions=true (默认打开的)
    utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
  )
  if err != nil {
    return nil, true, err
  }
  go lifecycleController.Run(ctx.Stop)
  return nil, true, nil
}
​

具体参数介绍

  • enable-taint-manager 默认为true, 表示允许NoExecute污点,并且将会驱逐pod

  • large-cluster-size-threshold 默认50,基于这个阈值来判断所在集群是否为大规模集群。当集群规模小于等于这个值的时候,会将--secondary-node-eviction-rate参数强制赋值为0

  • secondary-node-eviction-rate 默认0.01。 当zone unhealthy时候,一秒内多少个node进行驱逐node上pod。二级驱赶速率,当集群中宕机节点过多时,相应的驱赶速率也降低,默认为0.01。

  • node-eviction-rate float32 默认为0.1。驱赶速率,即驱赶Node的速率,由令牌桶流控算法实现,默认为0.1,即每秒驱赶0.1个节点,注意这里不是驱赶Pod的速率,而是驱赶节点的速率。相当于每隔10s,清空一个节点。

  • node-monitor-grace-period duration 默认40s, 多久node没有响应认为node为unhealthy

  • node-startup-grace-period duration 默认1分钟。多久允许刚启动的node未响应,认为unhealthy

  • pod-eviction-timeout duration 默认5min。当node unhealthy时候多久删除上面的pod(只在taint manager未启用时候生效)

  • unhealthy-zone-threshold float32 默认55%,多少比例的unhealthy node认为zone unhealthy

2. NewNodeLifecycleController

2.1 NodeLifecycleController结构体介绍

// Controller is the controller that manages node's life cycle.
type Controller struct {
  // taintManager监听节点的Taint/Toleration变化,用于驱逐pod
  taintManager *scheduler.NoExecuteTaintManager
  
  // 监听pod
  podLister         corelisters.PodLister
  podInformerSynced cache.InformerSynced
  kubeClient        clientset.Interface
​
  // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
  // to avoid the problem with time skew across the cluster.
  now func() metav1.Time
  
  // 返回secondary-node-eviction-rate参数值。就是根据集群是否为大集群,如果是大集群,返回secondary-node-eviction-rate,否则返回0
  enterPartialDisruptionFunc func(nodeNum int) float32
  
  // 返回evictionLimiterQPS参数
  enterFullDisruptionFunc    func(nodeNum int) float32
  
  // 返回集群有多少nodeNotReady, 并且返回bool值ZoneState用于判断zone是否健康。利用了unhealthyZoneThreshold参数
  computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
  
  // node map
  knownNodeSet map[string]*v1.Node
  
  // node健康信息map表
  // per Node map storing last observed health together with a local time when it was observed.
  nodeHealthMap *nodeHealthMap
  
  
  // evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
  // TODO(#83954): API calls shouldn't be executed under the lock.
  evictorLock     sync.Mutex
  
  // 存放node上pod是否已经执行驱逐的状态, 从这读取node eviction的状态是evicted、tobeeviced
  nodeEvictionMap *nodeEvictionMap
  // workers that evicts pods from unresponsive nodes.
  
  // zone的需要pod evictor的node列表
  zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
  
  // 存放需要更新taint的unready node列表--令牌桶队列
  // workers that are responsible for tainting nodes.
  zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue
  
  // 重试列表
  nodesToRetry sync.Map
  
  // 存放每个zone的健康状态,有stateFullDisruption、statePartialDisruption、stateNormal、stateInitial
  zoneStates map[string]ZoneState
  
  // 监听ds相关
  daemonSetStore          appsv1listers.DaemonSetLister
  daemonSetInformerSynced cache.InformerSynced
  
  // 监听node相关
  leaseLister         coordlisters.LeaseLister
  leaseInformerSynced cache.InformerSynced
  nodeLister          corelisters.NodeLister
  nodeInformerSynced  cache.InformerSynced
  
  getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)
​
  recorder record.EventRecorder
  
  // 之前推到的一对参数
  // Value controlling Controller monitoring period, i.e. how often does Controller
  // check node health signal posted from kubelet. This value should be lower than
  // nodeMonitorGracePeriod.
  // TODO: Change node health monitor to watch based.
  nodeMonitorPeriod time.Duration
  
  // When node is just created, e.g. cluster bootstrap or node creation, we give
  // a longer grace period.
  nodeStartupGracePeriod time.Duration
​
  // Controller will not proactively sync node health, but will monitor node
  // health signal updated from kubelet. There are 2 kinds of node healthiness
  // signals: NodeStatus and NodeLease. NodeLease signal is generated only when
  // NodeLease feature is enabled. If it doesn't receive update for this amount
  // of time, it will start posting "NodeReady==ConditionUnknown". The amount of
  // time before which Controller start evicting pods is controlled via flag
  // 'pod-eviction-timeout'.
  // Note: be cautious when changing the constant, it must work with
  // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
  // controller. The node health signal update frequency is the minimal of the
  // two.
  // There are several constraints:
  // 1. nodeMonitorGracePeriod must be N times more than  the node health signal
  //    update frequency, where N means number of retries allowed for kubelet to
  //    post node status/lease. It is pointless to make nodeMonitorGracePeriod
  //    be less than the node health signal update frequency, since there will
  //    only be fresh values from Kubelet at an interval of node health signal
  //    update frequency. The constant must be less than podEvictionTimeout.
  // 2. nodeMonitorGracePeriod can't be too large for user experience - larger
  //    value takes longer for user to see up-to-date node health.
  nodeMonitorGracePeriod time.Duration
​
  podEvictionTimeout          time.Duration
  evictionLimiterQPS          float32
  secondaryEvictionLimiterQPS float32
  largeClusterThreshold       int32
  unhealthyZoneThreshold      float32
​
  // if set to true Controller will start TaintManager that will evict Pods from
  // tainted nodes, if they're not tolerated.
  runTaintManager bool
​
  // if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
  // taints instead of evicting Pods itself.
  useTaintBasedEvictions bool
  
  // pod, node队列
  nodeUpdateQueue workqueue.Interface
  podUpdateQueue  workqueue.RateLimitingInterface
}

2.2 NewNodeLifecycleController

核心逻辑如下:

(1)根据参数初始化Controller

(2)定义了pod的监听处理逻辑。都是先nc.podUpdated,如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated函数处理

(3)实现找出所有node上pod的函数

(4)如果enable-taint-manager=true,node有变化都需要经过 nc.taintManager.NodeUpdated函数

(5)实现node的监听处理,这里不管开没开taint-manager,都是要监听

(6)实现node, ds, lease的list,用于获取对象

// NewNodeLifecycleController returns a new taint controller.
func NewNodeLifecycleController(
  leaseInformer coordinformers.LeaseInformer,
  podInformer coreinformers.PodInformer,
  nodeInformer coreinformers.NodeInformer,
  daemonSetInformer appsv1informers.DaemonSetInformer,
  kubeClient clientset.Interface,
  nodeMonitorPeriod time.Duration,
  nodeStartupGracePeriod time.Duration,
  nodeMonitorGracePeriod time.Duration,
  podEvictionTimeout time.Duration,
  evictionLimiterQPS float32,
  secondaryEvictionLimiterQPS float32,
  largeClusterThreshold int32,
  unhealthyZoneThreshold float32,
  runTaintManager bool,
  useTaintBasedEvictions bool,
) (*Controller, error) {
​
  // 1.根据参数初始化Controller
  nc := &Controller{
    省略代码
    ....
  }
  
  if useTaintBasedEvictions {
    klog.Infof("Controller is using taint based evictions.")
  }
  nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc
  nc.enterFullDisruptionFunc = nc.HealthyQPSFunc
  nc.computeZoneStateFunc &
K8S中,有一些常见的pod故障案例,其中包括: 1. 修改节点的污点为NoSchedule:当我们修改某个节点的污点为NoSchedule时,意味着该节点不再接受新的Pod调度。如果我们在这个节点上创建一个Pod,它将无法被调度到该节点上,从而导致Pod创建失败。 2. 修改节点的污点为NoExecute:和NoSchedule类似,当我们修改某个节点的污点为NoExecute时,该节点不仅不接受新的Pod调度,还会驱逐已经在该节点上运行的Pod。如果我们在这个节点上创建一个Pod,它将无法被调度到该节点上,从而导致Pod创建失败。 3. podAntiAffinity(pod反亲和性):这是一种解决Pod不能与已存在Pod部署在同一拓扑域中的问题的方法。通过设置podAntiAffinity规则,我们可以限制同一拓扑域中不能同时存在具有相同标签或标签选择器的Pod。这意味着如果我们试图在一个节点上创建一个与已经存在的Pod具有相同标签的Pod,它将无法被调度到该节点上,从而导致Pod创建失败。 综上所述,这些是K8S中常见的pod故障案例,它们可以通过对节点的污点进行修改或使用podAntiAffinity规则来解决。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* *3* [K8s pod详解](https://blog.youkuaiyun.com/m0_53157173/article/details/126621145)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值