Kubernetes 调度器

最新推荐文章于 2025-01-14 09:37:39 发布

FR13DNS

最新推荐文章于 2025-01-14 09:37:39 发布

阅读量581

点赞数

分类专栏： kubernetes 文章标签： kubernetes

本文链接：https://blog.youkuaiyun.com/FR13DNS/article/details/109907510

版权

kubernetes 专栏收录该内容

20 篇文章

订阅专栏

Kubernetes 调度器

Kubernetes 依赖 scheduler 组件于以确保调度 pod 能在集群中找到一个合适的节点,并使其以期望的状态运行。调度过程中，调度器不会修改Pod资源，而是从中读取数据并根据配置的策略挑选出最适合的节点，而后通过API调用将Pod绑定至挑选出的节点之上以完成调度过程.

在这里插入图片描述

工作逻辑

kuberlet 的工作概述
当用户请求通过 APIserver 到达 scheduler 后,通过 scheduler 的算法得出一个最适合运行该 pod 的节点后,会将结果传回 APIserver 并存储在 Etcd 当中,如非节点宕机或 pod 被 OOM 等原因驱逐,那么该 pod 会一致运行在这个节点,及时 pod 被被重建依然不会改变调度结果,而节点上的 kubelet 会一直 which APIserver 一旦出现关于自身节点的事件变动,这时候节点就会去获取 APIserver 上生命的资源清单来生成 pod,如根据清单下载或启动本地镜像,以及是否需要挂载存储卷等一系列工作
kube-proxy 的工作概述
创建 service 则与创建 pod 的形式相同,唯一不同的是 service 只是节点上的 iptables 或 lvs 规则,这个规则是通过节点上的 kube-proxy 来 which APIserver 并最终创建生成出来的
APIserver 的数据序列化
对 APIserver 来说,任何请求访问都视为 client,并检查授权和认证,只不过不通的 client 数据序列化的方式有所不同, kubectl 通过 json 来进行数据序列化,而集群内部组件通讯则使用由 Google 研发的 Protobuf 方式来实现

Scheduler 调度算法

Kubernetes内建了适合绝大多数场景中Pod资源调度需求的默认调度器，它支持同时使用算法基于原生及可定制的工具来选出集群中最适合运行当前Pod资源的一个节点，其核心目标是基于资源可用性将各Pod资源公平地分布于集群节点之上。目前，平台提供的默认调度器也称为通用调度器，它通过三个步骤完成调度操作：节点预选 Predicate、节点优先级排序 Priority 及节点择优 Select

在这里插入图片描述

Predicate

对一个容器来说能做两个维度的限制,第一维度为起始资源基本要求,满足才可以运行.第二维度为资源的限额,超出限额则不分配任何内存,而容器本身则提供当前占用状态,而众多节点当中不能满足起始资源基本要求的就会在 Predicate 中被排查,当然其中也包括其他诸如监听节点端口的容器而节点的端口已经被占用的情况,总之对这一步来说就是在所有节点中排除掉完全不能符合对应 pod 运行的基本要求的节点,预选策略工作机制遵循一票否则与反对法则机制

kubernetes 1.10 支持的预选策略,在所有 Scheduler 的调度算法中,默认情况下只启用了部分子集,如果需要生效其他调度策略则需要部署或后期配置时增加需要的调度算法

常用调度策略

ChecknodeCondition,检查是否可以在节点报告磁盘或网络状态不可用的或未准备好的情况下,将 pod 调度到上面,默认启用该策略
GaneralPredicates, 策略子集,默认启用该策略,包括多种预选:
- hostName: 检查 pod.spec.hostname 如果 pod 定义了 hostName 那么则检查节点上的其他 pod 是否占用了该名称
- podFistHostPorts: 检查 pod.spec.containers.ports.hostPort 如果 container 定义了 ports 那么检查节点上其他 pod 是否占用了该端口
- matchNodeSelector: 检查节点上是否存在该 pod 的标签选择器需要的标签
- podFistResources: 检查节点是否满足该 pod 的资源需求,在 describe node 的 Allocated resources
NoDiskconflict: 是否不存在磁盘冲突,检查节点上是否满足 pod 上存储卷的需求,默认这个策略不启用
PodToleratesNodeTaints: 检查 pod 的 pod.spec.tolerations 是否包含 Node 的污点,默认启用该策略
PodToleratesNodeNoExecuteTaints: 检查 pod 的 pod.spec.tolerations 是否包含 Node 的 NoExecute 污点,默认这个策略不启用
CheckNodeLabelPresence: 检查 Node 标签的存在性,默认这个策略不启用
CheckServiceAffinity: 根据 pod 其缩在 service 的其他 pod 是否在该节点来决定是否调度到该节点,默认这个策略不启用
三个 CNCF 云原生计算基金会的默认启用的调度策略
- MaxEBSVolume
- MaxGCEPDVolumeCount
- MaxAzureDiskVolumeCount
CheckVolumeBinding: 检查节点上已绑定和未绑定的 PVC 是否能满足 pod 存储卷的需求,默认启用
NoVolumZoneConfict: 在当前区域中检查节点的存储卷与 pod 对象是否存在存在冲突,默认启用
CheckNodeMemoryPressure: 检查节点内存是否存在压力,默认启用
CheckNodePIDPressure: 检查节点 PID 资源压力过大,默认启用
CheckNodeDiskPressure: 检查节点磁盘 IO 压力是否过大,默认启用
MatchInterPodAffinity: 检查节点是否满足 pod 的亲和或反亲和性条件,默认启用

Priority

预选策略筛选并生成一个节点列表后即进入第二阶段的优选过程.在这个过程中,调度器向每个通过预选的节点传递一系列的优选函数来计算其优先级分值,优先级分值介于0到10之间,其中0表示不适用,10表示最适合托管该 Pod 对象

在这里插入图片描述

常用优选函数

LeastRequested: 节点的空闲资源与总容量的比值,得分高即表示空限量更大级的最优,他的算法如下
```
(CPU(capacity-sum(pod_requested))*10/capacity+
MEM(capacity-sum(pod_requested))*10/capacity)/2
```
每个数值乘以 10 的原因是因为每一个优选函数的计算得分是 10,再将 CPU 和 MEM 的得分相加,总和再除以 2 因为是两个维度的数值
BalancedResourceAlloction: CPU 和 MEM 资源被占用的比率越相近得分越高,需要结合 LeastRequested 来评估节点资源的使用量
NodePreferAvoidPods: 此优选级函数权限默认为10000,它将根据节点是否设置了注解信息 scheduler.alpha.kubernetes.io/preferAvoidPods 来计算其优选级,计算方式是
- 给定的节点无此注解信息时,其得分为10乘以权重10000
- 存在此注解信息时,对于那些由 ReplicationController 或 ReplicaSet 控制器管控的Pod对象的得分为0，其他Pod对象会被忽略（得最高分）
Nodeaffinity: 基于节点的亲和性调度偏好进行评估,它根据 Pod 资源中的 nodeSelector 对给定节点进行匹配度检查,成功匹配到的条目越多则节点得分越高,不过,其评估过程使用首选而非强制型的 PreferredDuringSchedulingIgnoredDuringExecution 标签选择器
TaintToleration: 基于 Pod 对象对节点的污点容忍调度偏好进行其优先级的评估,它将 Pod 对象的 tolerations 列表与节点的 Taints 污点进行匹配度检查,成功匹配的条目越多,则节点得分越低
SelectorSpread: 标签选择器分散度,查找与当前 pod 对象匹配的 Service、ReplicationController、ReplicaSet（RS） 和 StatefulSet 而后查找与这些选择器匹配的现存Pod 对象及其所在的节点,则运行此类 Pod 对象越少的节点得分将越高.简单来说,如其名称所示此优选函数会尽量将同一标签选择器匹配到的Pod资源分散到不同的节点上运行
InterPodAffinity: 遍历此 pod 的亲和性条目,并将那些能够匹配到的给定节点的的条目相加,值越大得分越高
MostRequested: 与 LeastRequested 算法同样,但得分判断相反,这个函数尽可能的将一个节点资源用完,一般来说不与 LeastRequested 同时使用
NodeLabel: 根据节点是否拥有某些标签,存在时得分不存在则不得分,或以标签个数来评定的分
ImageLocality: 基于给定节点上拥有的运行当前 Pod 对象中的容器所依赖到的镜像文件来计算节点得分,不具有 Pod 依赖到的任何镜像文件的节点其得分为0,而拥有相应镜像文件的各节点中,所拥有的被依赖到的镜像文件其体积之和越大则节点得分越高,即节省下载的带宽流量

优选评估:
对于 pod 来说会根据所有已启用的优选函数做评估,并将得分相加峰值最高则为最佳,多个则进入 select 阶段,另外调度器还支持为每个优选函数指定一个简单的由正数值表示的权重,进行节点优先级分值的计算时,它首先将每个优选函数的计算得分乘以其权重（大多数优先级的默认权重为1）然后将所有优选函数的得分相加从而得出节点的最终优先级分值,权重属性赋予了管理员定义优选函数倾向性的能力,下面是每个节点的最终优先级得分的计算公式：

finalScoreNode=(weight1*priorityFunc1)+(weight2*priorityFunc2)+ ...

Select

将 pod 绑定在优选的节点上,如果当优选结果不止一个则随机挑选

特有倾向

为特殊的 pod 的提供的一种选择节点的方式,可以通过该种方式参与或改变预选与优选的判断结果,从而实现高级调度方法,特殊倾向有如下三种类型

节点标签

当一些 pod 需要运行在特定 node 节点上时,此时应该对节点用标签做分类,而后 pod 定义时可以额外定义特有倾向性 pods.spec.nodeName 或 pods.spce.nodeSelector 此操作会在 Predicate 中判断

资源清单模板

[root@master-0 ~]# kubectl explain pod.spec.nodeSelector
KIND:     Pod
VERSION:  v1

FIELD:    nodeSelector <map[string]string>

DESCRIPTION:
    NodeSelector is a selector which must be true for the pod to fit on a node.
    Selector which must match a node's labels for the pod to be scheduled on
    that node. More info:
    https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

示例

[root@master-0 ~]# cat nodeselector.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-demo
  namespace: default
  labels:
      app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  nodeSelector:
    disktype: ssd
[root@master-0 ~]# kubectl apply -f nodeselector.yaml
pod/pod-demo created
[root@master-0 ~]# kubectl label nodes slave-0.shared disktype=ssd
node/slave-0.shared labeled

亲和性

亲和性可以在 pod.spec.affinity 中查看,并且从节点和 pod 两个维度定义

[root@master-0 ~]# kubectl explain pod.spec.affinity
KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.

   podAffinity <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

节点亲和性

定义节点亲和性规则时有两种类型的节点亲和性规则:

硬亲和性(required): 硬亲和性实现的是强制性规则,它是Pod调度时必须要满足的规则,而在不存在满足规则的节点时,Pod对象会被置为Pending状态
软亲和性(preferred): 软亲和性规则实现的是一种柔性调度限制,它倾向于将 Pod 对象运行于某类特定的节点之上,而调度器也将尽量满足此需求,但在无法满足调度需求时它将退而求其次地选择一个不匹配规则的节点

无论是 required 和 preferred,在 Pod 资源基完成调度至某节点后,节点标签发生了改变而不再符合此节点亲和性规则时,调度器不会将Pod对象从此节点上移出

节点硬亲和性

节点硬亲和性 pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution

[root@master-0 ~]# kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>

DESCRIPTION:
    If the affinity requirements specified by this field are not met at
    scheduling time, the pod will not be scheduled onto the node. If the
    affinity requirements specified by this field cease to be met at some point
    during pod execution (e.g. due to an update), the system may or may not try
    to eventually evict the pod from its node.

    A node selector represents the union of the results of one or more label
    queries over a set of nodes; that is, it represents the OR of the selectors
    represented by the node selector terms.

FIELDS:
  nodeSelectorTerms <[]Object> -required-       # 亲和的节点
    Required. A list of node selector terms. The terms are ORed.
[root@master-0 ~]# cat nodeaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  affinity:
    nodeAffinity:
        nodeSelectorTerms:
        - matchExpressions:
          - key: zone
            operator: In
            values:
            - foo
            - bar
[root@master-0 ~]# kubectl apply -f nodeaffinity.yaml
pod/pod-nodeaffinity created                  # 此时节点中如果有标签为 zone 且包括值为 foo 或者 bar 则该 pod 才会 running

关于 pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms 的两种匹配方法
- matchExpressions: 按节点标签列出的节点选择器要求列表
- matchFields: 允许你根据一个或多个资源字段的值筛选 Kubernetes 资源,如
  1. metadata.name=my-service
  2. metadata.namespace!=default
  3. status.phase=Pending

节点软亲和性

节点软亲和性 pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution

[root@master-0 ~]# kubectl explain pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: preferredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.

     An empty preferred scheduling term matches all objects with implicit weight
     0 (i.e. it's a no-op). A null preferred scheduling term matches no objects
     (i.e. is also a no-op).

FIELDS:
   preference <Object> -required-             # 倾向的节点
     A node selector term, associated with the corresponding weight.

   weight <integer> -required-                # 倾向权重
     Weight associated with matching the corresponding nodeSelectorTerm, in the
     range 1-100.
[root@master-0 ~]# cat nodeaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-demo
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - foo
            - bar
        weight: 60
[root@master-0 ~]# kubectl apply -f nodeaffinity-demo.yaml
pod/pod-nodeaffinity-demo created

Pod 亲和性

让需要有关联性的 pod 与 pod 之间运行在一起,虽然通过节点亲和性也可以实现但需要精心编排,而 pod 亲和性则是调度器会把第一个 pod 放置于任何位置,而后与其有亲和或反亲和性关系的 pod 根据此动态完成位置编排,而必须通过某些手段如节点标签来让 pod 亲和性与反亲和性的时有章可循

如果某些 pod 倾向于运行在同一位置,则表示它们具有亲和性,如果倾向于不要运行在同一位置,则表示他们有反亲和性,如两个 Nginx 同时监听 80 或出于安全考虑来隔离 pod

Pod 硬亲和性

pod 硬亲和性 pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution

[root@master-0 ~]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
    If the affinity requirements specified by this field are not met at
    scheduling time, the pod will not be scheduled onto the node. If the
    affinity requirements specified by this field cease to be met at some point
    during pod execution (e.g. due to a pod label update), the system may or
    may not try to eventually evict the pod from its node. When there are
    multiple elements, the lists of nodes corresponding to each podAffinityTerm
    are intersected, i.e. all terms must be satisfied.

    Defines a set of pods (namely those matching the labelSelector relative to
    the given namespace(s)) that this pod should be co-located (affinity) or
    not co-located (anti-affinity) with, where co-located is defined as running
    on a node whose value of the label with key <topologyKey> matches that of
    any node on which a pod of the set of pods is running

FIELDS:
  labelSelector <Object>             # 跟那个 pod 亲和,选定目标 pod 资源
    A label query over a set of resources, in this case pods.

  namespaces <[]string>              # 这组标签选择器匹配到的 pod 是哪个名称空间下的,如果不指定则默认使用正在创建的这个 pod 的 ns
    namespaces specifies which namespaces the labelSelector applies to (matches
    against); null or empty list means "this pod's namespace"

  topologyKey <string> -required-    # 位置拓扑的键
    This pod should be co-located (affinity) or not co-located (anti-affinity)
    with the pods matching the labelSelector in the specified namespaces, where
    co-located is defined as running on a node whose value of the label with
    key topologyKey matches that of any node on which any of the selected pods
    is running. Empty topologyKey is not allowed.

定义基准 pod 与 pod 硬亲和

[root@master-0 ~]# cat pod-requiredaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-first
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-second
  namespace: default
  labels:
    app: db
spec:
  containers:
  - name: busybox
    image: busybox:latest
    imagePullPolicy: IfNotPresent
    command: ["sh","-c","sleep 3600"]
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - {"key": "app", "operator": "In", "values": ["myapp"]}     # 选择基准 pod 的 label
        topologyKey: kubernetes.io/hostname         # 后置 busybox 的 pod 可以运行在那些节点,这里的条件则为 hostname 一致则只能是基准 pod 运行的那个节点
[root@master-0 ~]# kubectl apply -f pod-requiredaffinity-demo.yaml
pod/pod-first created
pod/pod-second created

基于单一节点的 Pod 亲和性只在极个别的情况下才有可能会用到,较为常用的通常是基于同region、zone、或 rack 的拓扑位置约束,例如部署应用程序服务与数据库服务相关的 Pod 时，db Pod 可能会部署 foo 或 bar 这两个区域中的某节点之上,依赖于数据服务的 myapp Pod 对象可部署于 db Pod 所在区域内的节点上,当然，如果 db Pod 在两个区域 foo 和 bar 中各有副本运行，那么 myapp Pod 将可以运行于这两个区域的任何节点之上

在这里插入图片描述

Pod 反亲和性

在于 topologyKey 是一定不能相同的,除此之外则无任何区别

[root@master-0 ~]# kubectl label nodes slave-0.shared zone=foo
node/slave-0.shared labeled
[root@master-0 ~]# kubectl label nodes slave-1.shared zone=foo
node/slave-1.shared labeled
[root@master-0 ~]# cat pod-required-antiaffinity-demo.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-first
  namespace: default
  labels:
    app: myapp
spec:
  containers:
  - name: myapp
    image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-second
  namespace: default
  labels:
    app: db
spec:
  containers:
  - name: busybox
    image: busybox:latest
    imagePullPolicy: IfNotPresent
    command: ["sh","-c","sleep 3600"]
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - {"key": "app", "operator": "In", "values": ["myapp"]}
        topologyKey: zone
[root@master-0 ~]# kubectl apply -f pod-required-antiaffinity-demo.yaml
pod/pod-first created
pod/pod-second created
[root@master-0 ~]# kubectl get pod
NAME         READY   STATUS    RESTARTS   AGE
pod-first    1/1     Running   0          3s
pod-second   0/1     Pending   0          3s

pod 软亲和和软非亲和

与 node 软亲和功能一致,不再赘述

Taints and Tolerations

Taints 在 node 上添加的键值属性,Tolerations 是 pod 上定义的能容忍 Taints 的列表,node 上可以标识某些污点,而 pod 能否运行在该 node 上则取决于 pod 是否能容忍这些污点标识

在这里插入图片描述

检查污点的调度在预选和优选中都会涉及,并且当 node 节点上出现新的且不被 pod 容忍的污点时,会有两种结果而结果取决 Taints.effect 中定义对 pod 的排斥效果

NoSchedule: 只影响调度过程,对已存 pod 不产生影响
NoExecute: 即影响调度过程也影响现存 pod,不容忍的则会被主动驱逐 pod,这个动作可以在 pods.spec.tolerations.tolerationSeconds 设置驱逐容忍期,默认为 0 秒
PreferNoSchedule: 柔性 NoSchedule

Taints

在 node 中定义,先看模板

[root@master-0 ~]# kubectl explain node.spec.taints
KIND:     Node
VERSION:  v1

RESOURCE: taints <[]Object>

DESCRIPTION:
    If specified, the node's taints.

    The node this Taint is attached to has the "effect" on any pod that does
    not tolerate the Taint.

FIELDS:
  effect <string> -required-
    Required. The effect of the taint on pods that do not tolerate the taint.
    Valid effects are NoSchedule, PreferNoSchedule and NoExecute.

  key <string> -required-
    Required. The taint key to be applied to a node.

  timeAdded <string>
    TimeAdded represents the time at which the taint was added. It is only
    written for NoExecute taints.

  value <string>
    The taint value corresponding to the taint key.

命令行形式

Usage:
  kubectl taint NODE NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options]
[root@master-0 ~]# kubectl taint node slave-0.shared node-type=production:NoSchedule
node/slave-0.shared tainted
[root@master-0 ~]# kubectl get pod -owide         # 所有 pod 没有容忍度
NAME                           READY   STATUS              RESTARTS   AGE     IP       NODE             NOMINATED NODE   READINESS GATES
myapp-98skj                    0/1     ContainerCreating   0          6m27s   <none>   slave-1.shared   <none>           <none>
myapp-deploy-5d645d645-7dsg5   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
myapp-deploy-5d645d645-fm8tm   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
myapp-deploy-5d645d645-wskql   0/1     ContainerCreating   0          30s     <none>   slave-1.shared   <none>           <none>
myapp-ms6lv                    0/1     ContainerCreating   0          6m27s   <none>   slave-1.shared   <none>           <none>
[root@master-0 ~]# kubectl taint node slave-1.shared node-type=dev:NoExecute
node/slave-1.shared tainted
[root@master-0 ~]# kubectl get pod
NAME                           READY   STATUS    RESTARTS   AGE
myapp-deploy-5d645d645-dppsh   0/1     Pending   0          23s
myapp-deploy-5d645d645-pcpfp   0/1     Pending   0          23s
myapp-deploy-5d645d645-rtghf   0/1     Pending   0          23s
myapp-gmxm6                    0/1     Pending   0          23s
myapp-j8dhg                    0/1     Pending   0          23s

Tolerations

在 Pod 对象上定义容忍度时,它支持两种操作符

等值比较: 表示容忍度与污点必须在key、value 和 effect 三者之上完全匹配
存在性判断: 表示二者的 key 和 effect 必须完全匹配,而容忍度中的 value 字段要使用空值

Toleration 模板

[root@master-0 ~]# kubectl explain pods.spec.tolerations
KIND:     Pod
VERSION:  v1

RESOURCE: tolerations <[]Object>

DESCRIPTION:
    If specified, the pod's tolerations.

    The pod this Toleration is attached to tolerates any taint that matches the
    triple <key,value,effect> using the matching operator <operator>.

FIELDS:
  effect <string>
    Effect indicates the taint effect to match. Empty means match all taint
    effects. When specified, allowed values are NoSchedule, PreferNoSchedule
    and NoExecute.

  key <string>
    Key is the taint key that the toleration applies to. Empty means match all
    taint keys. If the key is empty, operator must be Exists; this combination
    means to match all values and all keys.

  operator <string>            # Equal 等值比较和 Exists 存在性比较
    Operator represents a key's relationship to the value. Valid operators are
    Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for
    value, so that a pod can tolerate all taints of a particular category.

  tolerationSeconds <integer>    # 容忍期限
    TolerationSeconds represents the period of time the toleration (which must
    be of effect NoExecute, otherwise this field is ignored) tolerates the
    taint. By default, it is not set, which means tolerate the taint forever
    (do not evict). Zero and negative values will be treated as 0 (evict
    immediately) by the system.

  value <string>
    Value is the taint value the toleration matches to. If the operator is
    Exists, the value should be empty, otherwise just a regular string.

设置等值比较的容忍列表

[root@master-0 ~]# cat deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-deploy
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      release: canary
  template:
    metadata:
      labels:
        app: myapp
        release: canary
    spec:
      containers:
      - name: myapp
        image: nginx:1.7
        ports:
        - name: http
          containerPort: 80
      tolerations:
      - key: "node-type"
        operator: "Equal"
        value: "production"
        effect: "NoSchedule"
[root@master-0 ~]# kubectl apply -f deploy.yaml
deployment.apps/myapp-deploy configured
[root@master-0 ~]# kubectl get pod  -owide
NAME                           READY   STATUS              RESTARTS   AGE     IP       NODE             NOMINATED NODE   READINESS GATES
myapp-deploy-9f9d6df86-8w6qb   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
myapp-deploy-9f9d6df86-d6vjg   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>
myapp-deploy-9f9d6df86-lhh78   0/1     ContainerCreating   0          2s      <none>   slave-0.shared   <none>           <none>

设置存在性判断的容忍列表

[root@master-0 ~]# cat deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-deploy
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      release: canary
  template:
    metadata:
      labels:
        app: myapp
        release: canary
    spec:
      containers:
      - name: myapp
        image: nginx:1.7
        ports:
        - name: http
          containerPort: 80
      tolerations:
      - key: "node-type"
        operator: "Exists"
        value: ""
        effect: ""                  # Exists 状态下 value 默认为通配符,所以可以通过 effect 来匹配节点,比如此时如果值为 NoSchedule 则 pod 会被全部调度到 slave-0 上
[root@master-0 ~]# kubectl apply -f deploy.yaml
deployment.apps/myapp-deploy configured
[root@master-0 ~]# kubectl get pod -owide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE             NOMINATED NODE   READINESS GATES
myapp-deploy-7c7968f87c-d6b69   1/1     Running   0          12s   10.244.1.24   slave-1.shared   <none>           <none>
myapp-deploy-7c7968f87c-f798g   1/1     Running   0          12s   10.244.2.21   slave-0.shared   <none>           <none>
myapp-deploy-7c7968f87c-nvf9m   1/1     Running   0          12s   10.244.2.22   slave-0.shared   <none>           <none>