openstack-海光C86芯片的计算节点直通英伟达T4 GPU加速卡的操作记录


前言

来了两台台信创的机器,尝试加入计算集群供同事测试。
两台的CPU都是Hygon C86 7285 32-core Processor。
第一台没直通成功;
第二台直通成功了。


一、检查物理机上GPU加速卡状态

使用lscpi -v命令检查即可

第一台

63:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
        Subsystem: NVIDIA Corporation Device 12a2
        Physical Slot: 29
        Flags: fast devsel, IRQ 255, NUMA node 3
        Memory at <unassigned> (64-bit, prefetchable) [disabled]
        Memory at <unassigned> (64-bit, prefetchable) [disabled]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] #00 [0080]
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Capabilities: [bb0] #15
        Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau

两个memory都是disabled的,直接放弃

第二台

71:00.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
        Subsystem: NVIDIA Corporation Device 12a2
        Flags: bus master, fast devsel, latency 0, IRQ 747, NUMA node 7
        Memory at d9000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 16fd0000000 (64-bit, prefetchable) [size=256M]
        Memory at 17000000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] #00 [0080]
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Capabilities: [bb0] #15
        Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau

X86机器上的信息用于对比

d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
	Subsystem: NVIDIA Corporation Device 12a2
	Flags: bus master, fast devsel, latency 0, IRQ 372, NUMA node 1
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 39ffc0000000 (64-bit, prefetchable) [size=256M]
	Memory at 39fff0000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] #00 [0080]
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
	Capabilities: [100] Virtual Channel
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Capabilities: [bb0] #15
	Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [c14] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

二、配置直通信息(前提BIOS中打开IOMMU配置)

1.修改内核

/etc/default/grub文件增加amd_iommu=on iommu=pt 参数

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto amd_iommu=on iommu=pt  rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"

生成新的grub文件

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

2.nova-compute增加pci配置信息,修改好之后直接重启就行

/etc/kolla/nova-compute/nova.conf

[pci]
passthrough_whitelist = {"vendor_id":"10de","product_id":"1eb8"}
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8","device_type":"type-PF"}

备注:如果拿不准自己的卡是type-PF 还是type-PCI,最好不填,填错的话直通会失败。

3.控制节点增加pci配置信息

/etc/kolla/nova-api/nova.conf

scheduler_default_filters= AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,IsolatedHostsFilter


[pci]
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8","device_type":"type-PF"}

/etc/kolla/nova-conductor/nova.conf

scheduler_default_filters= AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,IsolatedHostsFilter

[pci]
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8" ,"device_type":"type-PF"}

/etc/kolla/nova-scheduler/nova.conf

[filter_scheduler]
enabled_filters=AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,IsolatedHostsFilter


[pci]
alias={"name":"Tesla T4", "vendor_id":"10de", "product_id":"1eb8","device_type":"type-PF"}

重启对应服务

docker restart nova_api nova_conductor nova_scheduler

三、确认加速卡信息已经被加载到数据库中

查看nova.pci_devices表,找到了这个卡的信息

在这里插入图片描述

四、创建加速卡专用的配置类型,配置元数据

在这里插入图片描述

元数据配置
在这里插入图片描述
pci_passthrough:alias
Tesla T4:1

五、使用专用配置创建GPU云主机,检查直通结果

创建测试云主机

在这里插入图片描述

检查云主机内直通结果

在这里插入图片描述
直通成功,交付给同事使用


总结

信创服务器和X86服务器上直通的步骤一样 就改下grub参数就行

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值