目录
环境信息
集群配置:
本次搭建集群为 3 master + 3 worker(worker包含一台 3090 gpu节点),版本 v1.30.4
机器列表:
ip | 角色 | 系统 | 内核 |
10.144.11.1 | master1 | centos7.9 | 4.20.0-1 |
10.144.11.2 | master2 | centos7.9 | 4.20.0-1 |
10.144.11.3 | master3 | centos7.9 | 4.20.0-1 |
10.144.11.4 | worker | centos7.9 | 4.20.0-1 |
10.144.11.5 | worker, gpu | centos7.9 | 4.20.0-1 |
初始化环境准备
注: 初始化环境操作需在所有机器上执行
机器初始化:
关防火墙:
# systemctl stop firewalld
# systemctl disable firewalld
&&
# systemctl stop iptables
# systemctl disable iptables
关闭selinux
# sed -i 's/enforcing/disabled/' /etc/selinux/config # 永久
# setenforce 0 # 临时
关闭swap
# swapoff -a # 临时
# sed -ri 's/.*swap.*/#&/' /etc/fstab # 永久
将桥接的IPv4流量传递到iptables的链
cat > /etc/sysctl.d/k8s.conf << EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
sysctl --system # 生效
# 配置节点名称hosts解析
大多人选择修改机器名称实现集群节点以 master1 2 3 worker 1 2 3命名
但因修改机器名对司内资产统计或其他业务的影响,博主选择直接使用现有机器名作为节点名称
在hosts中配置机器名与ip地址对应关系即可:
cat > /etc/hosts << EOF
10.144.11.1 hostname001
10.144.11.2 hostname002
10.144.11.3 hostname003
10.144.11.4 hostname004
10.144.11.5 hostname005
EOF
依赖组件安装
containerd安装
提供两种安装containerd方式:
1. 离线安装(网络限制)
# 联网机器拉取后转存到集群机器
wget https://github.com/containerd/containerd/releases/download/v1.7.22/cri-containerd-cni-1.7.22-linux-amd64.tar.gz
wget https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
# 解压安装
sudo tar Cxzvf /usr/local cri-containerd-cni-1.7.22-linux-amd64.tar.gz
# 安装 systemd 服务文件
sudo mv containerd.service /usr/lib/systemd/system/
2. yum源安装(通公网)
# 添加 Docker 官方仓库
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# 更新 yum 缓存
sudo yum makecache
# 安装 containerd.io 包
sudo yum install -y containerd.io
或可查询后指定版本安装
yum list containerd.io --showduplicates| sort
调整containerd配置:
# 生成默认配置文件
sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml
# 启用 systemd cgroup 驱动(如果使用 systemd)
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
#镜像仓配置:
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."$your-registry".auth]
username = "$username"
password = "$password"
#若使用内部镜像仓,记得修改pause镜像仓地址
sandbox_image = "$your-registry/pause:3.9"
# 启动containerd:
systemctl daemon-reload
systemctl start containerd
systemctl enable containerd
踩坑:
后续cordns报错
Listen: listen tcp :53: bind: permission denied
调整containerd配置
将enable_unprivileged_ports 改成true
重启containerd
kubectl、kubelet、kubeadm安装
# 配置yum源
# 创建 Kubernetes 仓库文件-本次使用阿里源
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=0
EOF
# 查询版本列表
yum list kubeadm --showduplicates| sort
yum list kubeadm --showduplicates| sort
yum list kubeadm --showduplicates| sort
#安装
yum install kubelet-1.30.4-150500.1.1 kubeadm-1.30.4-150500.1.1 kubectl-1.30.4-150500.1.1 -y
systemctl enable kubelet
#该步骤可能会遇到报错
Error: Package: kubelet-1.30.4-150500.1.1.x86_64 (kubernetes)
Requires: conntrack
You could try using --skip-broken to work around the problem
# 安装依赖包conntrack即可
yum install conntrack-tools.x86_64
其他依赖组件安装
#需安装libseccomp包,不然会报网络问题
yum install libseccomp-2.5.1-1.el8.x86_64.rpm
#安装cni网络组件
tar -C /opt/cni/bin -xzf cni-plugins-linux-amd64-v1.5.1.tgz
#######稍后补全组件拉取路径#######
负载均衡实现-后面补充
集群搭建
集群初始化
初始化先在master1执行,后续节点通过join添加
#生成默认kubeadm配置
kubeadm config print init-defaults > kubeadm-config.yaml
#调整配置
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 10.144.11.1
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: SCSP00745
taints: null
---
apiServer:
certSANs:
- 10.144.11.1
- 10.144.11.2
- 10.144.11.3
timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
controlPlaneEndpoint: "10.144.11.248:9443"
dns: {}
etcd:
local:
dataDir: /var/lib/etcd
imageRepository: artifact.com/tfai/k8s #配置自己的镜像仓库
kind: ClusterConfiguration
kubernetesVersion: 1.30.4
networking:
dnsDomain: sail-cloud.cluster.local
serviceSubnet: 10.96.0.0/12
podSubnet: 10.244.0.0/16
#提前拉取镜像
kubeadm config images list --kubernetes-version=v1.30.4
kubeadm config images pull --kubernetes-version=v1.30.4
# 执行集群初始化
kubeadm init --config kubeadm-config.yaml --upload-certs
#记录join语句
添加master:
kubeadm join 10.144.11.248:9443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:5ca13dfb065115d496f38a2f875e11cce5d269fc780bcea15e9fc59915b43d60 \
--control-plane --certificate-key 4dae5da59f83edf9b0fb7a815ac3315d7504d6c0dd78ba3bdfb31de744a1663b
添加worker:
kubeadm join 10.144.11.248:9443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:5ca13dfb065115d496f38a2f875e11cce5d269fc780bcea15e9fc59915b43d60
语句过期或未保留,用以下语句生成token:
kubeadm token create --print-join-command
集群网络组件安装
# 网络组件calico 使用operater搭建
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.2/manifests/tigera-operator.yaml
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.2/manifests/custom-resour
ces.yaml
# kubectl create -f tigera-operator.yaml
# kubectl create -f custom-resources.yaml
#问题记录,涉及到镜像拉取问题,需修改calico镜像拉取地址
#修改crd
kubectl edit installation default -n calico-system
#新增 -与calicoNetwork 同级
registry: artifact.com/k8s/
查看集群pod状态:
新增master节点
在两个master节点上执行:
#执行加入集群命令行
kubeadm join 10.144.11.248:9443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:5ca13dfb065115d496f38a2f875e11cce5d269fc780bcea15e9fc59915b43d60 \
--control-plane --certificate-key 4dae5da59f83edf9b0fb7a815ac3315d7504d6c0dd78ba3bdfb31de744a1663b
新增worker节点
在两个worker节点上执行:
#执行加入集群命令行:
kubeadm join 10.144.11.248:9443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:5ca13dfb065115d496f38a2f875e11cce5d269fc780bcea15e9fc59915b43d60
查看节点及pod状态
后续补充gpu运行时及plugin安装步骤,以及模型使用k8s中gpu节点调度场景。。。。