k8s 污点驱逐详解-源码分析

zoux86

已于 2022-08-11 16:18:15 修改

阅读量827

点赞数

分类专栏： kubernetes学习笔记文章标签： kubernetes

于 2022-08-11 16:12:12 首次发布

本文链接：https://blog.youkuaiyun.com/zxyuliwuzhognzx11/article/details/126287517

版权

本文详细分析了Kubernetes中NodeLifecycleController的工作原理，包括startNodeLifecycleController、NewNodeLifecycleController、NodeLifecycleController.run等阶段。重点讨论了taintManager的处理流程，如worker处理、nodeUpdate及podUpdate的处理逻辑，以及如何根据节点和Pod的状态进行NoExecute污点处理和Pod驱逐。还介绍了node分类、处理node状态、集群健康状态的监控以及速率限制策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. startNodeLifecycleController
2. NewNodeLifecycleController
- 2.1 NodeLifecycleController结构体介绍
- 2.2 NewNodeLifecycleController
3. NodeLifecycleController.run
4 总结

代码版本：1.17.4

1. startNodeLifecycleController

可以看到startNodeLifecycleController就是分为2个步骤：

NodeLifecycleController
NodeLifecycleController.run

func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
  lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
    ctx.InformerFactory.Coordination().V1().Leases(),
    ctx.InformerFactory.Core().V1().Pods(),
    ctx.InformerFactory.Core().V1().Nodes(),
    ctx.InformerFactory.Apps().V1().DaemonSets(),
    // node lifecycle controller uses existing cluster role from node-controller
    ctx.ClientBuilder.ClientOrDie("node-controller"),
    
    // 就是node-monitor-period参数
    ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,   
    
    // 就是node-startup-grace-period参数
    ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
    
    // 就是node-monitor-grace-period参数
    ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
    
    // 就是pod-eviction-timeout参数
    ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
    
    // 就是node-eviction-rate参数
    ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
    
    // 就是secondary-node-eviction-rate参数
    ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
    
    // 就是large-cluster-size-threshold参数
    ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
    
    // 就是unhealthy-zone-threshold参数
    ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
    
    // 就是enable-taint-manager参数  （默认打开的）
    ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
    
    // 就是这个是否打开--feature-gates=TaintBasedEvictions=true （默认打开的）
    utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
  )
  if err != nil {
    return nil, true, err
  }
  go lifecycleController.Run(ctx.Stop)
  return nil, true, nil
}

具体参数介绍

enable-taint-manager 默认为true, 表示允许NoExecute污点，并且将会驱逐pod
large-cluster-size-threshold 默认50，基于这个阈值来判断所在集群是否为大规模集群。当集群规模小于等于这个值的时候，会将--secondary-node-eviction-rate参数强制赋值为0
secondary-node-eviction-rate 默认0.01。当zone unhealthy时候，一秒内多少个node进行驱逐node上pod。二级驱赶速率，当集群中宕机节点过多时，相应的驱赶速率也降低，默认为0.01。
node-eviction-rate float32 默认为0.1。驱赶速率，即驱赶Node的速率，由令牌桶流控算法实现，默认为0.1，即每秒驱赶0.1个节点，注意这里不是驱赶Pod的速率，而是驱赶节点的速率。相当于每隔10s，清空一个节点。
node-monitor-grace-period duration 默认40s, 多久node没有响应认为node为unhealthy
node-startup-grace-period duration 默认1分钟。多久允许刚启动的node未响应，认为unhealthy
pod-eviction-timeout duration 默认5min。当node unhealthy时候多久删除上面的pod（只在taint manager未启用时候生效）
unhealthy-zone-threshold float32 默认55%，多少比例的unhealthy node认为zone unhealthy

2. NewNodeLifecycleController

2.1 NodeLifecycleController结构体介绍

// Controller is the controller that manages node's life cycle.
type Controller struct {
// taintManager监听节点的Taint/Toleration变化，用于驱逐pod
taintManager *scheduler.NoExecuteTaintManager
// 监听pod
podLister corelisters.PodLister
podInformerSynced cache.InformerSynced
kubeClient clientset.Interface

// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
// to avoid the problem with time skew across the cluster.
now func() metav1.Time

// 返回secondary-node-eviction-rate参数值。就是根据集群是否为大集群，如果是大集群，返回secondary-node-eviction-rate,否则返回0
enterPartialDisruptionFunc func(nodeNum int) float32

// 返回evictionLimiterQPS参数
enterFullDisruptionFunc func(nodeNum int) float32

// 返回集群有多少nodeNotReady, 并且返回bool值ZoneState用于判断zone是否健康。利用了unhealthyZoneThreshold参数
computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState)

// node map
knownNodeSet map[string]*v1.Node

// node健康信息map表
// per Node map storing last observed health together with a local time when it was observed.
nodeHealthMap *nodeHealthMap

// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
// TODO(#83954): API calls shouldn't be executed under the lock.
evictorLock sync.Mutex

// 存放node上pod是否已经执行驱逐的状态， 从这读取node eviction的状态是evicted、tobeeviced
nodeEvictionMap *nodeEvictionMap
// workers that evicts pods from unresponsive nodes.

// zone的需要pod evictor的node列表
zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue

// 存放需要更新taint的unready node列表--令牌桶队列
// workers that are responsible for tainting nodes.
zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue

// 重试列表
nodesToRetry sync.Map

// 存放每个zone的健康状态,有stateFullDisruption、statePartialDisruption、stateNormal、stateInitial
zoneStates map[string]ZoneState

// 监听ds相关
daemonSetStore appsv1listers.DaemonSetLister
daemonSetInformerSynced cache.InformerSynced

// 监听node相关
leaseLister coordlisters.LeaseLister
leaseInformerSynced cache.InformerSynced
nodeLister corelisters.NodeLister
nodeInformerSynced cache.InformerSynced
getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)

recorder record.EventRecorder

// 之前推到的一对参数
// Value controlling Controller monitoring period, i.e. how often does Controller
// check node health signal posted from kubelet. This value should be lower than
// nodeMonitorGracePeriod.
// TODO: Change node health monitor to watch based.
nodeMonitorPeriod time.Duration

// When node is just created, e.g. cluster bootstrap or node creation, we give
// a longer grace period.
nodeStartupGracePeriod time.Duration

// Controller will not proactively sync node health, but will monitor node
// health signal updated from kubelet. There are 2 kinds of node healthiness
// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
// NodeLease feature is enabled. If it doesn't receive update for this amount
// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
// time before which Controller start evicting pods is controlled via flag
// 'pod-eviction-timeout'.
// Note: be cautious when changing the constant, it must work with
// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
// controller. The node health signal update frequency is the minimal of the
// two.
// There are several constraints:
// 1. nodeMonitorGracePeriod must be N times more than the node health signal
// update frequency, where N means number of retries allowed for kubelet to
// post node status/lease. It is pointless to make nodeMonitorGracePeriod
// be less than the node health signal update frequency, since there will
// only be fresh values from Kubelet at an interval of node health signal
// update frequency. The constant must be less than podEvictionTimeout.
// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
// value takes longer for user to see up-to-date node health.
nodeMonitorGracePeriod time.Duration

podEvictionTimeout time.Duration
evictionLimiterQPS float32
secondaryEvictionLimiterQPS float32
largeClusterThreshold int32
unhealthyZoneThreshold float32

// if set to true Controller will start TaintManager that will evict Pods from
// tainted nodes, if they're not tolerated.
runTaintManager bool

// if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
// taints instead of evicting Pods itself.
useTaintBasedEvictions bool
// pod, node队列
nodeUpdateQueue workqueue.Interface
podUpdateQueue workqueue.RateLimitingInterface
}

2.2 NewNodeLifecycleController

核心逻辑如下：

（1）根据参数初始化Controller

（2）定义了pod的监听处理逻辑。都是先nc.podUpdated，如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated函数处理

（3）实现找出所有node上pod的函数

（4）如果enable-taint-manager=true，node有变化都需要经过 nc.taintManager.NodeUpdated函数

（5）实现node的监听处理，这里不管开没开taint-manager，都是要监听

（6）实现node, ds, lease的list，用于获取对象

// NewNodeLifecycleController returns a new taint controller.
func NewNodeLifecycleController(
  leaseInformer coordinformers.LeaseInformer,
  podInformer coreinformers.PodInformer,
  nodeInformer coreinformers.NodeInformer,
  daemonSetInformer appsv1informers.DaemonSetInformer,
  kubeClient clientset.Interface,
  nodeMonitorPeriod time.Duration,
  nodeStartupGracePeriod time.Duration,
  nodeMonitorGracePeriod time.Duration,
  podEvictionTimeout time.Duration,
  evictionLimiterQPS float32,
  secondaryEvictionLimiterQPS float32,
  largeClusterThreshold int32,
  unhealthyZoneThreshold float32,
  runTaintManager bool,
  useTaintBasedEvictions bool,
) (*Controller, error) {

  // 1.根据参数初始化Controller
  nc := &Controller{
    省略代码
    ....
  }
  
  if useTaintBasedEvictions {
    klog.Infof("Controller is using taint based evictions.")
  }
  nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc
  nc.enterFullDisruptionFunc = nc.HealthyQPSFunc
  nc.computeZoneStateFunc &

最低0.47元/天解锁文章