背景
DOCA: Data Center-on-a-Chip Architecture,片上数据中心架构。是NVIDIA针对DPU的软件框架。DOCA之于DPU,就相当于CUDA之于GPU。
202410 NVIDIA发布了LTS的DOCA 2.9,本文主要记录升级DPU端快速升级DPU中的DOCA2.9的OS,不涉及升级fw和其他固件,仅仅更新其中的OS系统(含相关库)。主要是为了升级DOCA的库。快速记录
实战
下载DOCA 2.9的bfb
https://developer.nvidia.com/doca-downloads

确认DPU设备(如果有多个BF设备)
我的目标是升级slot5上的设备,通过slot查找bdf号。

根据bdf找对应rshim设备:找到是rshim1的设备

打开系统日志、打开串口
- 设置系统日志
echo "DISPLAY_LEVEL 2" > /dev/rshim1/misc

- 打开串口
screen /dev/rshim1/console
设置bfb安装参数:
这里不升级网卡的固件fw,不升级BMC、CEC和DPU GOLDEN_IMAGE、FW_GOLDEN_IMAGE,只升级ATF和UEFI(可能有一些参数BIOS修改,另外BMC一般建议一起升级)。
[root@localhost ~]# cat bf.cfg-bak
WITH_NIC_FW_UPDATE="no"
UPDATE_ATF_UEFI="yes"
UPDATE_BMC_FW="no"
UPDATE_CEC_FW="no"
UPDATE_DPU_GOLDEN_IMAGE="no"
UPDATE_NIC_FW_GOLDEN_IMAGE="no"
bfb_modify_os()
{
log ===================== bfb_modify_os =====================
log "Disable OVS bridges creation upon boot"
}
bfb_pre_install()
{
log ===================== bfb_pre_install =====================
}
bfb_post_install()
{
log ===================== bfb_post_install =====================
}
执行bfb-install升级

同时查看串口日志:

等待升级成功
升级成功后设置用户名和密码
默认用户名:ubuntu
默认密码:ubuntu
首次登录需要输入密码,然后立即更新密码。更新密码不能和ubuntu一样。
建议临时设置一个密码后,切换到root,再把root、ubuntu用户的密码再次设置为ubuntu

再次切换为ubuntu密码方式
将root、ubuntu用户的密码再次设置为ubuntu。并建议测试一下。

最后判断成功方式
在misc中看到 INFO[MISC]: : DPU is ready

其他
报错:cat: 写入错误: 连接超时
[root@localhost ~]# sudo bfb-install --rshim rshim1 --bfb bf-bundle-2.9.0-90_24.10_ubuntu-22.04_prod.bfb -c bf.cfg
Warn: 'pv' command not found. Continue without showing BFB progress.
Pushing bfb + cfg
cat: 写入错误: 连接超时
伴随打开loglevel后看到panic
[root@localhost rshim1]# echo "DISPLAY_LEVEL 2" > /dev/rshim1/misc
[root@localhost rshim1]# cat misc
DISPLAY_LEVEL 2 (0:basic, 1:advanced, 2:log)
BOOT_MODE 1 (0:rshim, 1:emmc, 2:emmc-boot-swap)
BOOT_TIMEOUT 150 (seconds)
DROP_MODE 0 (0:normal, 1:drop)
SW_RESET 0 (1: reset)
DEV_NAME pcie-0000:05:00.1
DEV_INFO BlueField-2(Rev 1)
OPN_STR MBF2M345A-VENOT_
---------------------------------------
Log Messages
---------------------------------------
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: DDR POST passed
PANIC(BL2): PC = 0x4018bc
elr_el1 0x401000
esr_el1 0x0
far_el1 0x0
这里是因为没有开启ATF_UEFI的升级。一般建议保持系统默认配置。
将bf.cfg中 UPDATE_ATF_UEFI="no"设置为UPDATE_ATF_UEFI="yes"
报错:mlx5_cmd_out_err:835:(pid 1007): CREATE_FLOW_GROUP(0x933) op_mod(0x0) failed
该部分主要原因是fw和驱动不匹配或者是DPU硬件不支持,比如这里的CREATE_FLOW_GROUP可能是硬件BF2不支持FLOW(该问题未进一步分析)。
[ 15.578751] mlx5_core 0000:03:00.0: mlx5_cmd_out_err:835:(pid 1007): CREATE_FLOW_GROUP(0x933) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x201c1c), err(-22)
[ 15.594023] mlx5_core 0000:03:00.0: mlx5_rdma_enable_roce_steering:71:(pid 1007): Failed to create RDMA RX flow group err(-22)
升级过程中misc的全量日志
[root@localhost ~]# sudo bfb-install --rshim rshim1 --bfb bf-bundle-2.9.0-90_24.10_ubuntu-22.04_prod.bfb -c bf.cfg-bak
Warn: 'pv' command not found. Continue without showing BFB progress.
Pushing bfb + cfg
Collecting BlueField booting status. Press Ctrl+C to stop…
INFO[BL2]: start
INFO[BL2]: boot mode (rshim)
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle GA Non-Secured
INFO[BL31]: runtime
INFO[UEFI]: UPVS valid
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: PMI: updates started
INFO[UEFI]: PMI: total updates: 1
INFO[UEFI]: PMI: updates completed, status 0
INFO[UEFI]: PCIe enum start
INFO[UEFI]: PCIe enum end
INFO[UEFI]: UEFI Secure Boot (disabled)
INFO[UEFI]: Redfish enabled
WARN[UEFI]: UPVS reclaim start
WARN[UEFI]: UPVS reclaim done
INFO[UEFI]: exit Boot Service
INFO[MISC]: Found bf.cfg
INFO[MISC]: Erasing eMMC drive: /dev/mmcblk0
INFO[MISC]: Ubuntu installation started
INFO[MISC]: Running bfb_pre_install from bf.cfg
INFO[MISC]: ===================== bfb_pre_install =====================
INFO[MISC]: Installing OS image
INFO[MISC]: Running bfb_modify_os from bf.cfg
INFO[MISC]: ===================== bfb_modify_os =====================
INFO[MISC]: Disable OVS bridges creation upon boot
INFO[MISC]: Ubuntu installation completed
INFO[BL2]: start
INFO[BL2]: boot mode (emmc)
INFO[BL2]: DDR POST passed
INFO[BL2]: UEFI loaded
INFO[BL31]: start
INFO[BL31]: lifecycle GA Non-Secured
INFO[BL31]: runtime
INFO[UEFI]: UPVS valid
INFO[UEFI]: eMMC init
INFO[UEFI]: eMMC probed
INFO[UEFI]: PCIe enum start
INFO[UEFI]: PCIe enum end
INFO[UEFI]: PMI: updates started
INFO[UEFI]: PMI: total updates: 1
INFO[UEFI]: PMI: updates completed, status 0
INFO[UEFI]: PMI: updates started
INFO[UEFI]: PMI: total updates: 6
INFO[UEFI]: PMI: updates completed, status 0
INFO[UEFI]: UEFI Secure Boot (disabled)
INFO[UEFI]: Redfish enabled
INFO[UEFI

最低0.47元/天 解锁文章
1253

被折叠的 条评论
为什么被折叠?



