文章目录
背景
bfb-build是用来根据NVIDIA提供的repo仓库,生成DPU使用的OS镜像库。具体可以参考github:https://github.com/Mellanox/bfb-build/
尝试需谨慎,坑无数。不过填萝卜多了,对bfb-build以及DPU OS构建会有不一样的理解。
本文仅记录编译过程中遇到的异常与处理策略。比较零散,仅供未来查阅,以及相关同行遇到问题后做一个辅助。核心点有几个:1. 建议使用lts版本的分支 2. 老卡建议用老的版本 3.尽量先编译linux
本文使用Bluefield 2进行实际测试,编译环境使用的是anolis的Linux,编译目标是BFB的ubuntu 20.2版本。本文先尝试直接build主干分支的anolis 8.6到DPU上,结果遇到各种问题。最后放弃,切换到了doca1.5的lts分支build。
相关信息:
# OS:
Linux one.one.one.one 4.18.0-553.22.1.0.1.an8.x86_64 #1 SMP Wed Sep 25 18:10:22 CST 2024 x86_64 x86_64 x86_64 GNU/Linux
# Bluefiled
[PN] Part number: MBF2M355A-VESOT_SS
# BFB branch
remotes/origin/lts-1.5.3
# BFB build最终命令
IMAGE_TYPE=dev ./bfb-build ubuntu 20.04
bfb-build之后直接退出
现象:
进一步添加-x查看:IMAGE_TYPE=dev INCLUDE_BF3BMC=no ./bfb-build ubuntu 20.04
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
原因:没启动docer服务器
解决办法:
sudo systemctl restart docker
docker.io无法访问
解决办法:
添加镜像加速服务
添加阿里云为例:
https://cr.console.aliyun.com/cn-hangzhou/instances/mirrors
点击镜像加速服务,然后按照操作处理:centos为例
1. 安装/升级Docker客户端
推荐安装1.10.0以上版本的Docker客户端,参考文档docker-ce
2. 配置镜像加速器
针对Docker客户端版本大于 1.10.0 的用户
您可以通过修改daemon配置文件/etc/docker/daemon.json来使用加速器
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": ["https://69mp4xz9.mirror.aliyuncs.com"]
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker
修改后:可以正常下载dockerbuild中的下载项
报错 ERROR: failed to solve: process
ERROR: failed to solve: process "/bin/sh -c echo 'excludepkgs=OpenIPMI' >> /etc/dnf/dnf.conf && dnf -y clean all && dnf install -y epel-release && dnf --exclude='mlxbf-bootimages*' -y update && dnf --exclude='mlxbf-bootimages*' -y install coreutils-single &&
dnf --exclude='mlxbf-bootimages*' -y install acpid audit chkconfig chrony containerd.io
cri-tools cryptsetup curl dhcp-client dosfstools dracut dracut-network dracut-tools e2fsprogs edac-utils efibootmgr findutils gawk glibc-langpack-en grub2
grubby i2c-tools iperf3 ipmitool iproute-tc iputils jq kexec-tools kmod
kubeadm kubelet kubernetes-cni libguestfs-tools libhugetlbfs-utils libvirt lm_sensors lm_sensors-sensord lsof ltrace lvm2 mmc-utils mokutil mstflint net-tools NetworkManager NetworkManager-config-server NetworkManager-ovs network-scripts nfs-utils
nvme-cli openssh-clients openssh-server openssl parted passwd pciutils perf
python3-pip qemu-kvm rasdaemon rsyslog sg3_utils shim sshpass sudo sysstat systemd-timesyncd system-lsb tar tcpdump unzip usbutils util-linux vim
virt-install watchdog wget which xfsprogs doca-runtime doca-devel strongswan-bf ${MLNX_FW_UPDATER} && dnf --exclude='mlxbf-bootimages*' -y reinstall bf-release && dnf -y clean all && rm -rf /var/cache/* && truncate -s0 /etc/machine-id && update-pciids" did not complete successfully: exit code: 1
网上追述:
959.9 [MIRROR] doca-sdk-telemetry-exporter-devel-2.8.0081-1.an8.aarch64.rpm: Curl error (18): Transferred a partial file for https://linux.mellanox.com/public/repo/doca/2.8.0/anolis8.6/arm64-dpu/doca-sdk-telemetry-exporter-devel-2.8.0081-1.an8.aarch64.rpm [transfer closed with 13164 bytes remaining to read]
959.9 [FAILED] doca-sdk-telemetry-exporter-devel-2.8.0081-1.an8.aarch64.rpm: No more mirrors to try - All mirrors were already tried without success
kubernotes 下载失败
816.4 (387/586): hwdata-0.314-8.22.0.1.an8.noarch.rpm 1.5 MB/s | 1.8 MB 00:01
816.5 [MIRROR] kubelet-1.29.9-150500.1.1.aarch64.rpm: Curl error (18): Transferred a partial file for https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.29/rpm/aarch64/kubelet-1.29.9-150500.1.1.aarch64.rpm [transfer closed with 15234388 bytes remaining to read]
816.5 [FAILED] kubelet-1.29.9-150500.1.1.aarch64.rpm: No more mirrors to try - All mirrors were already tried without success
816.6
816.6 The downloaded packages were saved in cache until the next successful transaction.
816.6 You can remove cached packages by executing 'dnf clean packages'.
816.9 Error: Error downloading packages:
816.9 kubelet-1.29.9-150500.1.1.aarch64: Cannot download, all mirrors were already tried without success
------
Dockerfile:24
解决办法更换阿里云的mirror
https://developer.aliyun.com/mirror/kubernetes?spm=a2c6h.13651102.0.0.73281b11UWPjI2
并且同步把docker也更换为阿里云的:
https://developer.aliyun.com/mirror/docker-ce?spm=a2c6h.13651102.0.0.57e31b11smKgIl
效果:
修改后多个repos文件修改为:
[root@localhost repos]# pwd
/root/workspace/bfb-build/anolis/8.6/repos
[root@localhost repos]# for f in `ls`; do echo ;echo \#==$f==; cat $f; done
#==AnolisOS-Experimental.repo==
[Experimental]
name=AnolisOS-$releasever - BaseOS
baseurl=http://mirrors.openanolis.cn/anolis/8.6/Experimental/$basearch/os
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-ANOLIS
gpgcheck=1
#==doca.repo==
[doca]
name=Nvidia DOCA repository
baseurl=https://linux.mellanox.com/public/repo/doca/2.8.0/anolis8.6/arm64-dpu/
gpgcheck=0
enabled=1
#==docker.repo==
[docker-ce-stable]
name=Docker CE Stable - $basearch
baseurl=https://mirrors.aliyun.com/docker-ce/linux/centos/8/$basearch/stable
enabled=1
gpgcheck=1