背景
Mellanox Neo-Host 是一款用于主机网络编排和管理的强大工具。常用用于定位性能问题。本文记录在neohost1.5.0版本的安装记录,如何跑起来。另外要注意的是不同的网卡,不同的Firmware需要的neohost不同。
快速回忆
# 下载neohost
从本文附件下载
# 安装python2
yum install -y python2
alternatives --install /usr/bin/python python /usr/bin/python2 50 # 注册替代方案(优先级 50)
alternatives --config python # 切换版本(交互式选择)python2
# 设置pip国内镜像源
mkdir -p ~/.pip
touch ~/.pip/pip.conf
vim ~/.pip/pip.conf
#添加下面的内容:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com
# 安装json模块,一定要先安装,避免neohost安装执行中有些失败
pip2 install jsonschema==2.6.0
# 安装neohost sdk
rpm -ivh neohost-*
# 如果执行失败查看neohost日志
/var/log/neohost.log
# 查看uuid(domain+bdb)
ethtool -i enp1s0f0
# 执行测试,注意uuid需要加上domain
python /opt/neohost/sdk/get_device_performance_counters.py --mode=shell --dev-uid=0000:01:00.1 --DEBUG
综述结论
- neohost 1.5.0 基于python2,需要python2的环境
- BF2、CX6DX、BF3等无法使用neohost 1.5.0,并且而且这些卡需要单独的token apply到设备上才可以使用
- 本文主要记录neohost 1.5.0安装使用遇到的一些问题
- 全文经历2个阶段,开始尝试在BF2环境失败,最后在CX5上成功
- 虽然neohost会逐渐被淘汰,但是他对于问题定位还是有很大帮助
BF2上安装问题记录
报错:SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(message)?
原因:默认使用了python3,需要用python2
解决办法:yum install -y python2
报错:-E- Missing option: --dev-uid.
这里的–dev-uid是网卡的PCIe的BDF号:ethtool -i enp8s0f0
解决办法指定--dev-uid=0000:08:00.0
:命令python2 get_device_performance_counters.py --mode=shell --get-analysis --run-loop --dev-uid=0000:08:00.0
报错:-E- [Errno 2] No such file or directory
该问题一直没解决,开始以为是没有安装mft、mlnx_tools这些,尝试过无效。关于mft的安装,参考兄弟篇:如何安装Mellanox固件管理工具MFT以及RPM包中的66条命令。 关于mlnx_tools安装,参考兄弟篇如何手动构建并安装Mellanox网卡的mlnx_tools工具rpm包?mlnx_tools包含哪些命令?
怀疑是neohost的backend没有正确安装
rpm -ivh neohost-backend-1.5.0-102.x86_64.rpm
安装后还是依然报错,怀疑没有启动backend的什么东西创建文件
然后到neohost的安装目录,尝试手动启动:cd /opt/neohost/backend/; sh neohost.sh
报错 -E- could not import : No module named jsonschema
报错:-E- could not import : No module named jsonschema
差一个python模块:jsonschema
安装python模块报错:Could not find a version that satisfies the requirement jsonschema
解决办法:尝试临时用国内镜像源
使用阿里云镜像源安装:pip install -i https://mirrors.aliyun.com/pypi/simple/ jsonschema
继续报错:IOError: [Errno 2] No such file or directory: ‘/tmp/pip-build-ImwDpq/jsonschema/setup.py’
报错:IOError: [Errno 2] No such file or directory: ‘/tmp/pip-build-ImwDpq/jsonschema/setup.py’
怀疑是临时镜像源不对,尝试持久方式
mkdir -p ~/.pip
touch ~/.pip/pip.conf
vim ~/.pip/pip.conf
#添加下面的内容:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com
再次尝试安装:pip install jsonschema
依然报错找不到,尝试手动下载这个,并且解压
wget https://mirrors.aliyun.com/pypi/packages/9c/99/9789c7fd0bb8876a7d624d903195ce11e5618b421bdb1bf7c975d17a9bc3/jsonschema-4.0.0.tar.gz
tar -xvf jsonschema-4.0.0.tar.gz
cd jsonschema-4.0.0/
python setup.py egg_info
依然没有效果,首先国内镜像源已经解决了是有这个文件并且能够下载下来的。 是其他原因,因为是python2 所以怀疑是版本问题,尝试更新一个老版本。
- 尝试pip install jsonschema==3.2.0:
- 尝试pip install jsonschema==2.6.0
成功安装。 再次运行报其他错误:-E- Failed to get device frequency
报错:-E- Failed to get device frequency
进一步想看下是哪个脚本报错,在sdk中并未找到。backend也为找到。
怀疑这个问题是由于BF2不支持。切换展现,在CX5上尝试。
CX5的服务器上安装
BF2上相同的基础步骤:
# 安装python2
yum install -y python2
# 设置pip国内镜像源
mkdir -p ~/.pip
touch ~/.pip/pip.conf
vim ~/.pip/pip.conf
#添加下面的内容:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com
# 安装json模块,一定要先安装,避免neohost安装执行中有些失败
pip2 install jsonschema==2.6.0
# 安装neohost sdk
rpm -ivh neohost-*
# 如果执行失败查看neohost日志
/var/log/neohost.log
# 查看uuid(bdb)
ethtool -i enp1s0f0
# 执行测试
python /opt/neohost/sdk/get_device_performance_counters.py --mode=shell --dev-uid=01:00.1 --DEBUG
实操细节:
-
安装python2
-
安装jsonschema
-
安装neohost
-
查看uuid
-
执行neohost
全量neohost信息:
截图版:
文字版:
=============================================================================================================================================================
|| Counter Name || Counter Value ||| Performance Analysis || Analysis Value [Units] ||
=============================================================================================================================================================
|| Level 0 MTT Cache Hit || 0 ||| Bandwidth ||
|| Level 0 MTT Cache Miss || 0 ||---------------------------------------------------------------------------
|| Level 1 MTT Cache Hit || 0 ||| RX BandWidth || 0 [Gb/s] ||
|| Level 1 MTT Cache Miss || 0 ||| TX BandWidth || 0 [Gb/s] ||
|| Level 0 MPT Cache Hit || 0 ||===========================================================================
|| Level 0 MPT Cache Miss || 0 ||| Memory ||
|| Level 1 MPT Cache Hit || 0 ||---------------------------------------------------------------------------
|| Level 1 MPT Cache Miss || 0 ||| RX Indirect Memory Keys Rate || 0 [Keys/Packet] ||
|| Indirect Memory Key Access || 0 ||===========================================================================
|| PCIe Internal Back Pressure || 0 ||| PCIe Bandwidth ||
|| Outbound Stalled Reads || 0 ||---------------------------------------------------------------------------
|| Outbound Stalled Writes || 0 ||| PCIe Inbound Available BW || 62.6928 [Gb/s] ||
|| ICM Cache Miss || 20,119 ||| PCIe Inbound BW Utilization || 0.0335 [%] ||
|| PCIe Read Stalled due to No Read Engines || 0 ||| PCIe Inbound Used BW || 0.021 [Gb/s] ||
|| PCIe Read Stalled due to No Completion Buffer || 0 ||| PCIe Outbound Available BW || 62.6928 [Gb/s] ||
|| PCIe Read Stalled due to Ordering || 0 ||| PCIe Outbound BW Utilization || 0.0164 [%] ||
|| Back Pressure from Packet Scatter to RX Packet Buffer || 0 ||| PCIe Outbound Used BW || 0.0103 [Gb/s] ||
|| Back Pressure from Packet Processing to RX Packet Buffer || 0 ||===========================================================================
|| RX Packet Buffer Full Port 1 || 0 ||| PCIe Latency ||
|| RX Packet Buffer Full Port 2 || 0 ||---------------------------------------------------------------------------
|| Chip Frequency || 332.0305 ||| PCIe Avg Latency || 5,909 [NS] ||
|| RX Steering Pipe 0 || 0 ||| PCIe Max Latency || 7,559 [NS] ||
|| RX Steering Pipe 1 || 0 ||| PCIe Min Latency || 243 [NS] ||
|| RX Steering Cache Hit Pipe 0 || 0 ||===========================================================================
|| RX Steering Cache Miss Pipe 0 || 0 ||| PCIe Unit Internal Latency ||
|| RX Steering Cache Hit Pipe 1 || 0 ||---------------------------------------------------------------------------
|| RX Steering Cache Miss Pipe 1 || 0 ||| PCIe Internal Avg Latency || 6 [NS] ||
|| RX Steering Cache Access Pipe 0 || 0 ||| PCIe Internal Max Latency || 6 [NS] ||
|| RX Steering Cache Access Pipe 1 || 0 ||| PCIe Internal Min Latency || 6 [NS] ||
|| RX Steering Learning Cache Lookups || 0 ||===========================================================================
|| RX Steering Learning Cache Hit || 0 ||| Packet Rate ||
|| RX Steering Learning Cache Miss || 0 ||---------------------------------------------------------------------------
|| RX Steering Learning Cache Learn || 0 ||| RX Packet Rate || 0 [Packets/Seconds] ||
|| Back Pressure from Internal MMU to RX Descriptor Handling || 0 ||| TX Packet Rate || 0 [Packets/Seconds] ||
|| Receive WQE Cache Hit || 0 ||===========================================================================
|| Receive WQE Cache Miss || 0 ||| eSwitch ||
|| Back Pressure from PCIe to Packet Scatter || 0 ||---------------------------------------------------------------------------
|| RX Steering Packets || 0 ||| RX Hops Per Packet || 0 [Hops/Packet] ||
|| RX Steering Packets Fast Path || 0 ||| RX Optimal Hops Per Packet Per Pipe || 0 [Hops/Packet] ||
|| EQ All State Machines Busy || 0 ||| RX Optimal Packet Rate Bottleneck || 0 [MPPS] ||
|| CQ All State Machines Busy || 0 ||| RX Packet Rate Bottleneck || 0 [MPPS] ||
|| MSI-X All State Machines Busy || 0 ||| TX Hops Per Packet || 0 [Hops/Packet] ||
|| CQE Compression Sessions || 0 ||| TX Optimal Hops Per Packet Per Pipe || 0 [Hops/Packet] ||
|| Compressed CQEs || 0 ||| TX Optimal Packet Rate Bottleneck || 0 [MPPS] ||
|| Compression Session Closed due to EQE || 0 ||| TX Packet Rate Bottleneck || 0 [MPPS] ||
|| Compression Session Closed due to Timeout || 0 ||===========================================================================
|| Compression Session Closed due to Mismatch || 0 ||
|| Compression Session Closed due to PCIe Idle || 0 ||
|| Compression Session Closed due to S2CQE || 0 ||
|| Compressed CQE Strides || 0 ||
|| Compression Session Closed due to LRO || 0 ||
|| TX Descriptor Handling Stopped due to Limited State || 0 ||
|| TX Descriptor Handling Stopped due to Limited VL || 0 ||
|| TX Descriptor Handling Stopped due to De-schedule || 0 ||
|| TX Descriptor Handling Stopped due to Work Done || 0 ||
|| TX Descriptor Handling Stopped due to E2E Credits || 0 ||
|| Line Transmitted Port 1 || 0 ||
|| Line Transmitted Port 2 || 0 ||
|| Line Transmitted Loop Back || 0 ||
|| TX Steering Pipe 0 || 0 ||
|| TX Steering Pipe 1 || 0 ||
|| TX Steering Hit Pipe 0 || 0 ||
|| TX Steering Miss Pipe 0 || 0 ||
|| TX Steering Hit Pipe 1 || 0 ||
|| TX Steering Miss Pipe 1 || 0 ||
|| TX Steering Cache Miss Pipe 0 || 0 ||
|| TX Steering Cache Miss Pipe 1 || 0 ||
|| TX Steering Cache Access Pipe 0 || 0 ||
|| TX Steering Cache Access Pipe 1 || 0 ||
|| TX Steering Learning Cache Lookups || 0 ||
|| TX Steering Learning Cache Hit || 0 ||
|| TX Steering Learning Cache Miss || 0 ||
|| TX Steering Learning Cache Learn || 0 ||
==================================================================================
总结
- 这种新东西,最好用2台服务器,或者两种卡综合配置,这种解决问题的方法在很多场景都用得上,而且有奇效
- 不断分析尝试,目标明确
- 尽量查看报错的日志信息等