Kubernetes 调度器

Kubernetes 调度器

Kubernetes 依赖 scheduler 组件于以确保调度 pod 能在集群中找到一个合适的节点,并使其以期望的状态运行。调度过程中,调度器不会修改Pod资源,而是从中读取数据并根据配置的策略挑选出最适合的节点,而后通过API调用将Pod绑定至挑选出的节点之上以完成调度过程.

在这里插入图片描述

工作逻辑

  1. kuberlet 的工作概述
    当用户请求通过 APIserver 到达 scheduler 后,通过 scheduler 的算法得出一个最适合运行该 pod 的节点后,会将结果传回 APIserver 并存储在 Etcd 当中,如非节点宕机或 pod 被 OOM 等原因驱逐,那么该 pod 会一致运行在这个节点,及时 pod 被被重建依然不会改变调度结果,而节点上的 kubelet 会一直 which APIserver 一旦出现关于自身节点的事件变动,这时候节点就会去获取 APIserver 上生命的资源清单来生成 pod,如根据清单下载或启动本地镜像,以及是否需要挂载存储卷等一系列工作
  2. kube-proxy 的工作概述
    创建 service 则与创建 pod 的形式相同,唯一不同的是 service 只是节点上的 iptables 或 lvs 规则,这个规则是通过节点上的 kube-proxy 来 which APIserver 并最终创建生成出来的
  3. APIserver 的数据序列化
    对 APIserver 来说,任何请求访问都视为 client,并检查授权和认证,只不过不通的 client 数据序列化的方式有所不同, kubectl 通过 json 来进行数据序列化,而集群内部组件通讯则使用由 Google 研发的 Protobuf 方式来实现

Scheduler 调度算法

Kubernetes内建了适合绝大多数场景中Pod资源调度需求的默认调度器,它支持同时使用算法基于原生及可定制的工具来选出集群中最适合运行当前Pod资源的一个节点,其核心目标是基于资源可用性将各Pod资源公平地分布于集群节点之上。目前,平台提供的默认调度器也称为通用调度器,它通过三个步骤完成调度操作:节点预选 Predicate、节点优先级排序 Priority 及节点择优 Select

在这里插入图片描述

Predicate

对一个容器来说能做两个维度的限制,第一维度为起始资源基本要求,满足才可以运行.第二维度为资源的限额,超出限额则不分配任何内存,而容器本身则提供当前占用状态,而众多节点当中不能满足起始资源基本要求的就会在 Predicate 中被排查,当然其中也包括其他诸如监听节点端口的容器而节点的端口已经被占用的情况,总之对这一步来说就是在所有节点中排除掉完全不能符合对应 pod 运行的基本要求的节点,预选策略工作机制遵循一票否则与反对法则机制

kubernetes 1.10 支持的预选策略,在所有 Scheduler 的调度算法中,默认情况下只启用了部分子集,如果需要生效其他调度策略则需要部署或后期配置时增加需要的调度算法

常用调度策略
  1. ChecknodeCondition,检查是否可以在节点报告磁盘或网络状态不可用的或未准备好的情况下,将 pod 调度到上面,默认启用该策略
  2. GaneralPredicates, 策略子集,默认启用该策略,包括多种预选:
    • hostName: 检查 pod.spec.hostname 如果 pod 定义了 hostName 那么则检查节点上的其他 pod 是否占用了该名称
    • podFistHostPorts: 检查 pod.spec.containers.ports.hostPort 如果 container 定义了 ports 那么检查节点上其他 pod 是否占用了该端口
    • matchNodeSelector: 检查节点上是否存在该 pod 的标签选择器需要的标签
    • podFistResources: 检查节点是否满足该 pod 的资源需求,在 describe node 的 Allocated resources
  3. NoDiskconflict: 是否不存在磁盘冲突,检查节点上是否满足 pod 上存储卷的需求,默认这个策略不启用
  4. PodToleratesNodeTaints: 检查 pod 的 pod.spec.tolerations 是否包含 Node 的污点,默认启用该策略
  5. PodToleratesNodeNoExecuteTaints: 检查 pod 的 pod.spec.tolerations 是否包含 Node 的 NoExecute 污点,默认这个策略不启用
  6. CheckNodeLabelPresence: 检查 Node 标签的存在性,默认这个策略不启用
  7. CheckServiceAffinity: 根据 pod 其缩在 service 的其他 pod 是否在该节点来决定是否调度到该节点,默认这个策略不启用
  8. 三个 CNCF 云原生计算基金会的默认启用的调度策略
    • MaxEBSVolume
    • MaxGCEPDVolumeCount
    • MaxAzureDiskVolumeCount
  9. CheckVolumeBinding: 检查节点上已绑定和未绑定的 PVC 是否能满足 pod 存储卷的需求,默认启用
  10. NoVolumZoneConfict: 在当前区域中检查节点的存储卷与 pod 对象是否存在存在冲突,默认启用
  11. CheckNodeMemoryPressure: 检查节点内存是否存在压力,默认启用
  12. CheckNodePIDPressure: 检查节点 PID 资源压力过大,默认启用
  13. CheckNodeDiskPressure: 检查节点磁盘 IO 压力是否过大,默认启用
  14. MatchInterPodAffinity: 检查节点是否满足 pod 的亲和或反亲和性条件,默认启用

Priority

预选策略筛选并生成一个节点列表后即进入第二阶段的优选过程.在这个过程中,调度器向每个通过预选的节点传递一系列的优选函数来计算其优先级分值,优先级分值介于0到10之间,其中0表示不适用,10表示最适合托管该 Pod 对象

在这里插入图片描述

常用优选函数
  1. LeastRequested: 节点的空闲资源与总容量的比值,得分高即表示空限量更大级的最优,他的算法如下

    (CPU(capacity-sum(pod_requested))*10/capacity+
    MEM(capacity-sum(pod_requested))*10/capacity)/2
    

    每个数值乘以 10 的原因是因为每一个优选函数的计算得分是 10,再将 CPU 和 MEM 的得分相加,总和再除以 2 因为是两个维度的数值

  2. BalancedResourceAlloction: CPU 和 MEM 资源被占用的比率越相近得分越高,需要结合 LeastRequested 来评估节点资源的使用量

  3. NodePreferAvoidPods: 此优选级函数权限默认为10000,它将根据节点是否设置了注解信息 scheduler.alpha.kubernetes.io/preferAvoidPods 来计算其优选级,计算方式是

    • 给定的节点无此注解信息时,其得分为10乘以权重10000
    • 存在此注解信息时,对于那些由 ReplicationController 或 ReplicaSet 控制器管控的Pod对象的得分为0,其他Pod对象会被忽略(得最高分)
  4. Nodeaffinity: 基于节点的亲和性调度偏好进行评估,它根据 Pod 资源中的 nodeSelector 对给定节点进行匹配度检查,成功匹配到的条目越多则节点得分越高,不过,其评估过程使用首选而非强制型的 PreferredDuringSchedulingIgnoredDuringExecution 标签选择器

  5. TaintToleration: 基于 Pod 对象对节点的污点容忍调度偏好进行其优先级的评估,它将 Pod 对象的 tolerations 列表与节点的 Taints 污点进行匹配度检查,成功匹配的条目越多,则节点得分越低

  6. SelectorSpread: 标签选择器分散度,查找与当前 pod 对象匹配的 Service、ReplicationController、ReplicaSet(RS)StatefulSet 而后查找与这些选择器匹配的现存Pod 对象及其所在的节点,则运行此类 Pod 对象越少的节点得分将越高.简单来说,如其名称所示此优选函数会尽量将同一标签选择器匹配到的Pod资源分散到不同的节点上运行

  7. InterPodAffinity: 遍历此 pod 的亲和性条目,并将那些能够匹配到的给定节点的的条目相加,值越大得分越高

  8. MostRequested: 与 LeastRequested 算法同样,但得分判断相反,这个函数尽可能的将一个节点资源用完,一般来说不与 LeastRequested 同时使用

  9. NodeLabel: 根据节点是否拥有某些标签,存在时得分不存在则不得分,或以标签个数来评定的分

  10. ImageLocality: 基于给定节点上拥有的运行当前 Pod 对象中的容器所依赖到的镜像文件来计算节点得分,不具有 Pod 依赖到的任何镜像文件的节点其得分为0,而拥有相应镜像文件的各节点中,所拥有的被依赖到的镜像文件其体积之和越大则节点得分越高,即节省下载的带宽流量

优选评估:
对于 pod 来说会根据所有已启用的优选函数做评估,并将得分相加峰值最高则为最佳,多个则进入 select 阶段,另外调度器还支持为每个优选函数指定一个简单的由正数值表示的权重,进行节点优先级分值的计算时,它首先将每个优选函数的计算得分乘以其权重(大多数优先级的默认 权重为1)然后将所有优选函数的得分相加从而得出节点的最终优先级分值,权重属性赋予了管理员定义优选函数倾向性的能力,下面是每个节点的最终优先级得分的计算公式:

finalScoreNode=(weight1*priorityFunc1)+(weight2*priorityFunc2)+ ...

Select

将 pod 绑定在优选的节点上,如果当优选结果不止一个则随机挑选

特有倾向

为特殊的 pod 的提供的一种选择节点的方式,可以通过该种方式参与或改变预选与优选的判断结果,从而实现高级调度方法,特殊倾向有如下三种类型

节点标签

当一些 pod 需要运行在特定 node 节点上时,此时应该对节点用标签做分类,而后 pod 定义时可以额外定义特有倾向性 pods.spec.nodeNamepods.spce.nodeSelector 此操作会在 Predicate 中判断

  1. 资源清单模板

    [root@master-0 ~]# kubectl explain pod.spec.nodeSelector
    KIND:     Pod
    VERSION:  v1
    
    FIELD:    nodeSelector <map[string]string>
    
    DESCRIPTION:
        NodeSelector is a selector which must be true for the pod to fit on a node.
        Selector which must match a node's labels for the pod to be scheduled on
        that node. More info:
        https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
    
  2. 示例

    [root@master-0 ~]# cat nodeselector.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-demo
      namespace: default
      labels:
          app: myapp
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v1
      nodeSelector:
        disktype: ssd
    [root@master-0 ~]# kubectl apply -f nodeselector.yaml
    pod/pod-demo created
    [root@master-0 ~]# kubectl label nodes slave-0.shared disktype=ssd
    node/slave-0.shared labeled
    

亲和性

亲和性可以在 pod.spec.affinity 中查看,并且从节点和 pod 两个维度定义

[root@master-0 ~]# kubectl explain pod.spec.affinity
KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.

   podAffinity <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).
节点亲和性

定义节点亲和性规则时有两种类型的节点亲和性规则:

  • 硬亲和性(required): 硬亲和性实现的是强制性规则,它是Pod调度时必须要满足的规则,而在不存在满足规则的节点时,Pod对象会被置为Pending状态
  • 软亲和性(preferred): 软亲和性规则实现的是一种柔性调度限制,它倾向于将 Pod 对象运行于某类特定的节点之上,而调度器也将尽量满足此需求,但在无法满足调度需求时它将退而求其次地选择一个不匹配规则的节点

无论是 required 和 preferred,在 Pod 资源基完成调度至某节点后,节点标签发生了改变而不再符合此节点亲和性规则时,调度器不会将Pod对象从此节点上移出

节点硬亲和性
  1. 节点硬亲和性 pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution

    [root@master-0 ~]# kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
    KIND:     Pod
    VERSION:  v1
    
    RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>
    
    DESCRIPTION:
        If the affinity requirements specified by this field are not met at
        scheduling time, the pod will not be scheduled onto the node. If the
        affinity requirements specified by this field cease to be met at some point
        during pod execution (e.g. due to an update), the system may or may not try
        to eventually evict the pod from its node.
    
        A node selector represents the union of the results of one or more label
        queries over a set of nodes; that is, it represents the OR of the selectors
        represented by the node selector terms.
    
    FIELDS:
      nodeSelectorTerms <[]Object> -required-       # 亲和的节点
        Required. A list of node selector terms. The terms are ORed.
    [root@master-0 ~]# cat nodeaffinity.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-nodeaffinity
      namespace: default
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v1
      affinity:
        nodeAffinity:
            nodeSelectorTerms:
            - matchExpressions:
              - key: zone
                operator: In
                values:
                - foo
                - bar
    [root@master-0 ~]# kubectl apply -f nodeaffinity.yaml
    pod/pod-nodeaffinity created                  # 此时节点中如果有标签为 zone 且包括值为 foo 或者 bar 则该 pod 才会 running
    
  2. 关于 pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms 的两种匹配方法

    • matchExpressions: 按节点标签列出的节点选择器要求列表
    • matchFields: 允许你根据一个或多个资源字段的值 筛选 Kubernetes 资源,如
      1. metadata.name=my-service
      2. metadata.namespace!=default
      3. status.phase=Pending
节点软亲和性

节点软亲和性 pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution

[root@master-0 ~]# kubectl explain pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: preferredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.

     An empty preferred scheduling term matches all objects with implicit weight
     0 (i.e. it's a no-op). A null preferred scheduling term matches no objects
     (i.e. is also a no-op).

FIELDS:
   preference <Object> -required-             # 倾向的节点
     A node selector term, associated with the corresponding weight.

   weight <integer> -required-                # 倾向权重
     Weight associated with matching the corresponding nodeSelectorTerm, in the
     range 1-100.
[root@master-0 ~]# cat nodeaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-demo
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - foo
            - bar
        weight: 60
[root@master-0 ~]# kubectl apply -f nodeaffinity-demo.yaml
pod/pod-nodeaffinity-demo created

Pod 亲和性

让需要有关联性的 pod 与 pod 之间运行在一起,虽然通过节点亲和性也可以实现但需要精心编排,而 pod 亲和性则是调度器会把第一个 pod 放置于任何位置,而后与其有亲和或反亲和性关系的 pod 根据此动态完成位置编排,而必须通过某些手段如节点标签来让 pod 亲和性与反亲和性的时有章可循

如果某些 pod 倾向于运行在同一位置,则表示它们具有亲和性,如果倾向于不要运行在同一位置,则表示他们有反亲和性,如两个 Nginx 同时监听 80 或出于安全考虑来隔离 pod

Pod 硬亲和性
  1. pod 硬亲和性 pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution

    [root@master-0 ~]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
    KIND:     Pod
    VERSION:  v1
    
    RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <[]Object>
    
    DESCRIPTION:
        If the affinity requirements specified by this field are not met at
        scheduling time, the pod will not be scheduled onto the node. If the
        affinity requirements specified by this field cease to be met at some point
        during pod execution (e.g. due to a pod label update), the system may or
        may not try to eventually evict the pod from its node. When there are
        multiple elements, the lists of nodes corresponding to each podAffinityTerm
        are intersected, i.e. all terms must be satisfied.
    
        Defines a set of pods (namely those matching the labelSelector relative to
        the given namespace(s)) that this pod should be co-located (affinity) or
        not co-located (anti-affinity) with, where co-located is defined as running
        on a node whose value of the label with key <topologyKey> matches that of
        any node on which a pod of the set of pods is running
    
    FIELDS:
      labelSelector <Object>             # 跟那个 pod 亲和,选定目标 pod 资源
        A label query over a set of resources, in this case pods.
    
      namespaces <[]string>              # 这组标签选择器匹配到的 pod 是哪个名称空间下的,如果不指定则默认使用正在创建的这个 pod 的 ns
        namespaces specifies which namespaces the labelSelector applies to (matches
        against); null or empty list means "this pod's namespace"
    
      topologyKey <string> -required-    # 位置拓扑的键
        This pod should be co-located (affinity) or not co-located (anti-affinity)
        with the pods matching the labelSelector in the specified namespaces, where
        co-located is defined as running on a node whose value of the label with
        key topologyKey matches that of any node on which any of the selected pods
        is running. Empty topologyKey is not allowed.
    
  2. 定义基准 pod 与 pod 硬亲和

    [root@master-0 ~]# cat pod-requiredaffinity-demo.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-first
      namespace: default
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: ikubernetes/myapp:v1
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-second
      namespace: default
      labels:
        app: db
    spec:
      containers:
      - name: busybox
        image: busybox:latest
        imagePullPolicy: IfNotPresent
        command: ["sh","-c","sleep 3600"]
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - {"key": "app", "operator": "In", "values": ["myapp"]}     # 选择基准 pod 的 label
            topologyKey: kubernetes.io/hostname         # 后置 busybox 的 pod 可以运行在那些节点,这里的条件则为 hostname 一致则只能是基准 pod 运行的那个节点
    [root@master-0 ~]# kubectl apply -f pod-requiredaffinity-demo.yaml
    pod/pod-first created
    pod/pod-second created
    

基于单一节点的 Pod 亲和性只在极个别的情况下才有可能会用到,较为常用的通常是基于同region、zone、或 rack 的拓扑位置约束,例如部署应用程序服务与数据库服务相关的 Pod 时,db Pod 可能会部署 foo 或 bar 这两个区域中的某节点之上,依赖于数据服务的 myapp Pod 对象可部署于 db Pod 所在区域内的节点上,当然,如果 db Pod 在两个区域 foo 和 bar 中各有副本运行,那么 myapp Pod 将可以运行于这两个区域的任何节点之上

在这里插入图片描述

Pod 反亲和性

在于 topologyKey 是一定不能相同的,除此之外则无任何区别

[root@master-0 ~]# kubectl label nodes slave-0.shared zone=foo
node/slave-0.shared labeled
[root@master-0 ~]# kubectl label nodes slave-1.shared zone=foo
node/slave-1.shared labeled
[root@master-0 ~]# cat pod-required-antiaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-first
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-second
  namespace: default
  labels:
    app: db
spec:
  containers:
  - name: busybox
    image: busybox:latest
    imagePullPolicy: IfNotPresent
    command: ["sh","-c","sleep 3600"]
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - {"key": "app", "operator": "In", "values": ["myapp"]}
        topologyKey: zone
[root@master-0 ~]# kubectl apply -f pod-required-antiaffinity-demo.yaml
pod/pod-first created
pod/pod-second created
[root@master-0 ~]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
pod-first    1/1     Running   0          3s
pod-second   0/1     Pending   0          3s
pod 软亲和和软非亲和

与 node 软亲和功能一致,不再赘述

Taints and Tolerations

Taints 在 node 上添加的键值属性,Tolerations 是 pod 上定义的能容忍 Taints 的列表,node 上可以标识某些污点,而 pod 能否运行在该 node 上则取决于 pod 是否能容忍这些污点标识

在这里插入图片描述

检查污点的调度在预选和优选中都会涉及,并且当 node 节点上出现新的且不被 pod 容忍的污点时,会有两种结果而结果取决 Taints.effect 中定义对 pod 的排斥效果

  • NoSchedule: 只影响调度过程,对已存 pod 不产生影响
  • NoExecute: 即影响调度过程也影响现存 pod,不容忍的则会被主动驱逐 pod,这个动作可以在 pods.spec.tolerations.tolerationSeconds 设置驱逐容忍期,默认为 0 秒
  • PreferNoSchedule: 柔性 NoSchedule
Taints
  1. 在 node 中定义,先看模板

    [root@master-0 ~]# kubectl explain node.spec.taints
    KIND:     Node
    VERSION:  v1
    
    RESOURCE: taints <[]Object>
    
    DESCRIPTION:
        If specified, the node's taints.
    
        The node this Taint is attached to has the "effect" on any pod that does
        not tolerate the Taint.
    
    FIELDS:
      effect <string> -required-
        Required. The effect of the taint on pods that do not tolerate the taint.
        Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
    
      key <string> -required-
        Required. The taint key to be applied to a node.
    
      timeAdded <string>
        TimeAdded represents the time at which the taint was added. It is only
        written for NoExecute taints.
    
      value <string>
        The taint value corresponding to the taint key.
    
  2. 命令行形式

    Usage:
      kubectl taint NODE NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options]
    [root@master-0 ~]# kubectl taint node slave-0.shared node-type=production:NoSchedule
    node/slave-0.shared tainted
    [root@master-0 ~]# kubectl get pod -owide         # 所有 pod 没有容忍度
    NAME                           READY   STATUS              RESTARTS   AGE     IP       NODE             NOMINATED NODE   READINESS GATES
    myapp-98skj                    0/1     ContainerCreating   0          6m27s   <none>   slave-1.shared   <none>           <none>
    myapp-deploy-5d645d645-7dsg5   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
    myapp-deploy-5d645d645-fm8tm   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
    myapp-deploy-5d645d645-wskql   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
    myapp-ms6lv                    0/1     ContainerCreating   0          6m27s   <none>   slave-1.shared   <none>           <none>
    [root@master-0 ~]# kubectl taint node slave-1.shared node-type=dev:NoExecute
    node/slave-1.shared tainted
    [root@master-0 ~]# kubectl get pod
    NAME                           READY   STATUS    RESTARTS   AGE
    myapp-deploy-5d645d645-dppsh   0/1     Pending   0          23s
    myapp-deploy-5d645d645-pcpfp   0/1     Pending   0          23s
    myapp-deploy-5d645d645-rtghf   0/1     Pending   0          23s
    myapp-gmxm6                    0/1     Pending   0          23s
    myapp-j8dhg                    0/1     Pending   0          23s
    
Tolerations

在 Pod 对象上定义容忍度时,它支持两种操作符

  • 等值比较: 表示容忍度与污点必须在key、value 和 effect 三者之上完全匹配
  • 存在性判断: 表示二者的 key 和 effect 必须完全匹配,而容忍度中的 value 字段要使用空值
  1. Toleration 模板

    [root@master-0 ~]# kubectl explain pods.spec.tolerations
    KIND:     Pod
    VERSION:  v1
    
    RESOURCE: tolerations <[]Object>
    
    DESCRIPTION:
        If specified, the pod's tolerations.
    
        The pod this Toleration is attached to tolerates any taint that matches the
        triple <key,value,effect> using the matching operator <operator>.
    
    FIELDS:
      effect <string>
        Effect indicates the taint effect to match. Empty means match all taint
        effects. When specified, allowed values are NoSchedule, PreferNoSchedule
        and NoExecute.
    
      key <string>
        Key is the taint key that the toleration applies to. Empty means match all
        taint keys. If the key is empty, operator must be Exists; this combination
        means to match all values and all keys.
    
      operator <string>            # Equal 等值比较和 Exists 存在性比较
        Operator represents a key's relationship to the value. Valid operators are
        Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for
        value, so that a pod can tolerate all taints of a particular category.
    
      tolerationSeconds <integer>    # 容忍期限
        TolerationSeconds represents the period of time the toleration (which must
        be of effect NoExecute, otherwise this field is ignored) tolerates the
        taint. By default, it is not set, which means tolerate the taint forever
        (do not evict). Zero and negative values will be treated as 0 (evict
        immediately) by the system.
    
      value <string>
        Value is the taint value the toleration matches to. If the operator is
        Exists, the value should be empty, otherwise just a regular string.
    
  2. 设置等值比较的容忍列表

    [root@master-0 ~]# cat deploy.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: myapp-deploy
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: myapp
          release: canary
      template:
        metadata:
          labels:
            app: myapp
            release: canary
        spec:
          containers:
          - name: myapp
            image: nginx:1.7
            ports:
            - name: http
              containerPort: 80
          tolerations:
          - key: "node-type"
            operator: "Equal"
            value: "production"
            effect: "NoSchedule"
    [root@master-0 ~]# kubectl apply -f deploy.yaml
    deployment.apps/myapp-deploy configured
    [root@master-0 ~]# kubectl get pod  -owide
    NAME                           READY   STATUS              RESTARTS   AGE     IP       NODE             NOMINATED NODE   READINESS GATES
    myapp-deploy-9f9d6df86-8w6qb   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
    myapp-deploy-9f9d6df86-d6vjg   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
    myapp-deploy-9f9d6df86-lhh78   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
    
  3. 设置存在性判断的容忍列表

    [root@master-0 ~]# cat deploy.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: myapp-deploy
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: myapp
          release: canary
      template:
        metadata:
          labels:
            app: myapp
            release: canary
        spec:
          containers:
          - name: myapp
            image: nginx:1.7
            ports:
            - name: http
              containerPort: 80
          tolerations:
          - key: "node-type"
            operator: "Exists"
            value: ""
            effect: ""                  # Exists 状态下 value 默认为通配符,所以可以通过 effect 来匹配节点,比如此时如果值为 NoSchedule 则 pod 会被全部调度到 slave-0 上
    [root@master-0 ~]# kubectl apply -f deploy.yaml
    deployment.apps/myapp-deploy configured
    [root@master-0 ~]# kubectl get pod -owide
    NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE             NOMINATED NODE   READINESS GATES
    myapp-deploy-7c7968f87c-d6b69   1/1     Running   0          12s   10.244.1.24   slave-1.shared   <none>           <none>
    myapp-deploy-7c7968f87c-f798g   1/1     Running   0          12s   10.244.2.21   slave-0.shared   <none>           <none>
    myapp-deploy-7c7968f87c-nvf9m   1/1     Running   0          12s   10.244.2.22   slave-0.shared   <none>           <none>
    

问题节点标识

Kubernetes 自1.6版本起支持使用污点自动标识问题节点,它通过节点控制器在特定条件下自动为节点添加污点信息实现,它们都使用 NoExecute 效用标识,因此不能容忍此类污点的现有 Pod 对象也会遭到驱逐,目前内建使用的此类污点包含如下几个

  • node.kubernetes.io/not-ready: 节点进入 NotReady 状态时被自动添加的污点
  • node.alpha.kubernetes.io/unreachable: 节点进入 NotReachable 状态时被自动添加的污点
  • node.kubernetes.io/out-of-disk: 节点进入 OutOfDisk 状态时被自动添加的污点
  • node.kubernetes.io/memory-pressure: 节点内存资源面临压力
  • node.kubernetes.io/disk-pressure: 节点磁盘资源面临压力
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值