【问题记录】Mellanox网卡性能调试工具neohost1.5.0安装遇到的问题记录?(python2、get_device_performance_counters.py)

背景

Mellanox Neo-Host 是一款用于主机网络编排和管理的强大工具。常用用于定位性能问题。本文记录在neohost1.5.0版本的安装记录,如何跑起来。另外要注意的是不同的网卡,不同的Firmware需要的neohost不同。

快速回忆

# 下载neohost
从本文附件下载
# 安装python2
yum install -y python2
alternatives --install /usr/bin/python python /usr/bin/python2 50 # 注册替代方案(优先级 50)
alternatives --config python # 切换版本(交互式选择)python2

# 设置pip国内镜像源
mkdir -p ~/.pip
touch ~/.pip/pip.conf
vim ~/.pip/pip.conf
#添加下面的内容:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com

# 安装json模块,一定要先安装,避免neohost安装执行中有些失败
pip2 install jsonschema==2.6.0

# 安装neohost sdk
rpm -ivh neohost-*

# 如果执行失败查看neohost日志
/var/log/neohost.log

# 查看uuid(domain+bdb)
ethtool -i enp1s0f0

# 执行测试,注意uuid需要加上domain
python /opt/neohost/sdk/get_device_performance_counters.py --mode=shell --dev-uid=0000:01:00.1 --DEBUG

综述结论

  1. neohost 1.5.0 基于python2,需要python2的环境
  2. BF2、CX6DX、BF3等无法使用neohost 1.5.0,并且而且这些卡需要单独的token apply到设备上才可以使用
  3. 本文主要记录neohost 1.5.0安装使用遇到的一些问题
  4. 全文经历2个阶段,开始尝试在BF2环境失败,最后在CX5上成功
  5. 虽然neohost会逐渐被淘汰,但是他对于问题定位还是有很大帮助

BF2上安装问题记录

报错:SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(message)?

在这里插入图片描述
原因:默认使用了python3,需要用python2
解决办法:yum install -y python2
在这里插入图片描述

报错:-E- Missing option: --dev-uid.

在这里插入图片描述
这里的–dev-uid是网卡的PCIe的BDF号:ethtool -i enp8s0f0
在这里插入图片描述
解决办法指定--dev-uid=0000:08:00.0:命令python2 get_device_performance_counters.py --mode=shell --get-analysis --run-loop --dev-uid=0000:08:00.0

报错:-E- [Errno 2] No such file or directory

该问题一直没解决,开始以为是没有安装mft、mlnx_tools这些,尝试过无效。关于mft的安装,参考兄弟篇:如何安装Mellanox固件管理工具MFT以及RPM包中的66条命令。 关于mlnx_tools安装,参考兄弟篇如何手动构建并安装Mellanox网卡的mlnx_tools工具rpm包?mlnx_tools包含哪些命令?
怀疑是neohost的backend没有正确安装

rpm -ivh neohost-backend-1.5.0-102.x86_64.rpm

在这里插入图片描述
安装后还是依然报错,怀疑没有启动backend的什么东西创建文件
然后到neohost的安装目录,尝试手动启动:cd /opt/neohost/backend/; sh neohost.sh
报错 -E- could not import : No module named jsonschema

报错:-E- could not import : No module named jsonschema

差一个python模块:jsonschema
在这里插入图片描述

安装python模块报错:Could not find a version that satisfies the requirement jsonschema

在这里插入图片描述
解决办法:尝试临时用国内镜像源
使用阿里云镜像源安装:pip install -i https://mirrors.aliyun.com/pypi/simple/ jsonschema
在这里插入图片描述
继续报错:IOError: [Errno 2] No such file or directory: ‘/tmp/pip-build-ImwDpq/jsonschema/setup.py’

报错:IOError: [Errno 2] No such file or directory: ‘/tmp/pip-build-ImwDpq/jsonschema/setup.py’

怀疑是临时镜像源不对,尝试持久方式

mkdir -p ~/.pip
touch ~/.pip/pip.conf
vim ~/.pip/pip.conf
#添加下面的内容:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com

在这里插入图片描述
再次尝试安装:pip install jsonschema
在这里插入图片描述
依然报错找不到,尝试手动下载这个,并且解压

wget https://mirrors.aliyun.com/pypi/packages/9c/99/9789c7fd0bb8876a7d624d903195ce11e5618b421bdb1bf7c975d17a9bc3/jsonschema-4.0.0.tar.gz
tar -xvf jsonschema-4.0.0.tar.gz
cd jsonschema-4.0.0/
python setup.py egg_info

在这里插入图片描述
依然没有效果,首先国内镜像源已经解决了是有这个文件并且能够下载下来的。 是其他原因,因为是python2 所以怀疑是版本问题,尝试更新一个老版本。

  • 尝试pip install jsonschema==3.2.0:
    在这里插入图片描述
  • 尝试pip install jsonschema==2.6.0
    在这里插入图片描述
    成功安装。 再次运行报其他错误:-E- Failed to get device frequency

报错:-E- Failed to get device frequency

在这里插入图片描述
进一步想看下是哪个脚本报错,在sdk中并未找到。backend也为找到。
怀疑这个问题是由于BF2不支持。切换展现,在CX5上尝试。

CX5的服务器上安装

BF2上相同的基础步骤:

# 安装python2
yum install -y python2
# 设置pip国内镜像源
mkdir -p ~/.pip
touch ~/.pip/pip.conf
vim ~/.pip/pip.conf
#添加下面的内容:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com

# 安装json模块,一定要先安装,避免neohost安装执行中有些失败
pip2 install jsonschema==2.6.0

# 安装neohost sdk
rpm -ivh neohost-*

# 如果执行失败查看neohost日志
/var/log/neohost.log

# 查看uuid(bdb)
ethtool -i enp1s0f0

# 执行测试
python /opt/neohost/sdk/get_device_performance_counters.py --mode=shell --dev-uid=01:00.1 --DEBUG

实操细节:

  • 安装python2
    在这里插入图片描述

  • 安装jsonschema
    在这里插入图片描述

  • 安装neohost
    在这里插入图片描述

  • 查看uuid
    在这里插入图片描述

  • 执行neohost
    在这里插入图片描述
    全量neohost信息:
    截图版:
    在这里插入图片描述

文字版:

=============================================================================================================================================================
|| Counter Name                                              || Counter Value   ||| Performance Analysis                || Analysis Value [Units]           ||
=============================================================================================================================================================
|| Level 0 MTT Cache Hit                                     || 0               |||                                Bandwidth                                ||
|| Level 0 MTT Cache Miss                                    || 0               ||---------------------------------------------------------------------------
|| Level 1 MTT Cache Hit                                     || 0               ||| RX BandWidth                        || 0             [Gb/s]             ||
|| Level 1 MTT Cache Miss                                    || 0               ||| TX BandWidth                        || 0             [Gb/s]             ||
|| Level 0 MPT Cache Hit                                     || 0               ||===========================================================================
|| Level 0 MPT Cache Miss                                    || 0               |||                                 Memory                                  ||
|| Level 1 MPT Cache Hit                                     || 0               ||---------------------------------------------------------------------------
|| Level 1 MPT Cache Miss                                    || 0               ||| RX Indirect Memory Keys Rate        || 0             [Keys/Packet]      ||
|| Indirect Memory Key Access                                || 0               ||===========================================================================
|| PCIe Internal Back Pressure                               || 0               |||                             PCIe Bandwidth                              ||
|| Outbound Stalled Reads                                    || 0               ||---------------------------------------------------------------------------
|| Outbound Stalled Writes                                   || 0               ||| PCIe Inbound Available BW           || 62.6928       [Gb/s]             ||
|| ICM Cache Miss                                            || 20,119          ||| PCIe Inbound BW Utilization         || 0.0335        [%]                ||
|| PCIe Read Stalled due to No Read Engines                  || 0               ||| PCIe Inbound Used BW                || 0.021         [Gb/s]             ||
|| PCIe Read Stalled due to No Completion Buffer             || 0               ||| PCIe Outbound Available BW          || 62.6928       [Gb/s]             ||
|| PCIe Read Stalled due to Ordering                         || 0               ||| PCIe Outbound BW Utilization        || 0.0164        [%]                ||
|| Back Pressure from Packet Scatter to RX Packet Buffer     || 0               ||| PCIe Outbound Used BW               || 0.0103        [Gb/s]             ||
|| Back Pressure from Packet Processing to RX Packet Buffer  || 0               ||===========================================================================
|| RX Packet Buffer Full Port 1                              || 0               |||                              PCIe Latency                               ||
|| RX Packet Buffer Full Port 2                              || 0               ||---------------------------------------------------------------------------
|| Chip Frequency                                            || 332.0305        ||| PCIe Avg Latency                    || 5,909         [NS]               ||
|| RX Steering Pipe 0                                        || 0               ||| PCIe Max Latency                    || 7,559         [NS]               ||
|| RX Steering Pipe 1                                        || 0               ||| PCIe Min Latency                    || 243           [NS]               ||
|| RX Steering Cache Hit Pipe 0                              || 0               ||===========================================================================
|| RX Steering Cache Miss Pipe 0                             || 0               |||                       PCIe Unit Internal Latency                        ||
|| RX Steering Cache Hit Pipe 1                              || 0               ||---------------------------------------------------------------------------
|| RX Steering Cache Miss Pipe 1                             || 0               ||| PCIe Internal Avg Latency           || 6             [NS]               ||
|| RX Steering Cache Access Pipe 0                           || 0               ||| PCIe Internal Max Latency           || 6             [NS]               ||
|| RX Steering Cache Access Pipe 1                           || 0               ||| PCIe Internal Min Latency           || 6             [NS]               ||
|| RX Steering Learning Cache Lookups                        || 0               ||===========================================================================
|| RX Steering Learning Cache Hit                            || 0               |||                               Packet Rate                               ||
|| RX Steering Learning Cache Miss                           || 0               ||---------------------------------------------------------------------------
|| RX Steering Learning Cache Learn                          || 0               ||| RX Packet Rate                      || 0             [Packets/Seconds]  ||
|| Back Pressure from Internal MMU to RX Descriptor Handling || 0               ||| TX Packet Rate                      || 0             [Packets/Seconds]  ||
|| Receive WQE Cache Hit                                     || 0               ||===========================================================================
|| Receive WQE Cache Miss                                    || 0               |||                                 eSwitch                                 ||
|| Back Pressure from PCIe to Packet Scatter                 || 0               ||---------------------------------------------------------------------------
|| RX Steering Packets                                       || 0               ||| RX Hops Per Packet                  || 0             [Hops/Packet]      ||
|| RX Steering Packets Fast Path                             || 0               ||| RX Optimal Hops Per Packet Per Pipe || 0             [Hops/Packet]      ||
|| EQ All State Machines Busy                                || 0               ||| RX Optimal Packet Rate Bottleneck   || 0             [MPPS]             ||
|| CQ All State Machines Busy                                || 0               ||| RX Packet Rate Bottleneck           || 0             [MPPS]             ||
|| MSI-X All State Machines Busy                             || 0               ||| TX Hops Per Packet                  || 0             [Hops/Packet]      ||
|| CQE Compression Sessions                                  || 0               ||| TX Optimal Hops Per Packet Per Pipe || 0             [Hops/Packet]      ||
|| Compressed CQEs                                           || 0               ||| TX Optimal Packet Rate Bottleneck   || 0             [MPPS]             ||
|| Compression Session Closed due to EQE                     || 0               ||| TX Packet Rate Bottleneck           || 0             [MPPS]             ||
|| Compression Session Closed due to Timeout                 || 0               ||===========================================================================
|| Compression Session Closed due to Mismatch                || 0               ||
|| Compression Session Closed due to PCIe Idle               || 0               ||
|| Compression Session Closed due to S2CQE                   || 0               ||
|| Compressed CQE Strides                                    || 0               ||
|| Compression Session Closed due to LRO                     || 0               ||
|| TX Descriptor Handling Stopped due to Limited State       || 0               ||
|| TX Descriptor Handling Stopped due to Limited VL          || 0               ||
|| TX Descriptor Handling Stopped due to De-schedule         || 0               ||
|| TX Descriptor Handling Stopped due to Work Done           || 0               ||
|| TX Descriptor Handling Stopped due to E2E Credits         || 0               ||
|| Line Transmitted Port 1                                   || 0               ||
|| Line Transmitted Port 2                                   || 0               ||
|| Line Transmitted Loop Back                                || 0               ||
|| TX Steering Pipe 0                                        || 0               ||
|| TX Steering Pipe 1                                        || 0               ||
|| TX Steering Hit Pipe 0                                    || 0               ||
|| TX Steering Miss Pipe 0                                   || 0               ||
|| TX Steering Hit Pipe 1                                    || 0               ||
|| TX Steering Miss Pipe 1                                   || 0               ||
|| TX Steering Cache Miss Pipe 0                             || 0               ||
|| TX Steering Cache Miss Pipe 1                             || 0               ||
|| TX Steering Cache Access Pipe 0                           || 0               ||
|| TX Steering Cache Access Pipe 1                           || 0               ||
|| TX Steering Learning Cache Lookups                        || 0               ||
|| TX Steering Learning Cache Hit                            || 0               ||
|| TX Steering Learning Cache Miss                           || 0               ||
|| TX Steering Learning Cache Learn                          || 0               ||
==================================================================================

总结

  1. 这种新东西,最好用2台服务器,或者两种卡综合配置,这种解决问题的方法在很多场景都用得上,而且有奇效
  2. 不断分析尝试,目标明确
  3. 尽量查看报错的日志信息等
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值