1. 引言
在上一篇文章:概述中,分析了ROCm性能跟踪系统的整体架构和使用流程。本文将深入探讨性能计数器查询的实现机制,详细解析各种计数器类型及其应用场景。
性能计数器查询是性能分析的第一步,它让应用程序了解GPU硬件提供了哪些监控能力,为后续选择合适的计数器奠定基础。
2. hsaKmtPmcGetCounterProperties() API详解
2.1 函数签名
HSAKMT_STATUS HSAKMTAPI hsaKmtPmcGetCounterProperties(
HSAuint32 NodeId, // 输入:GPU节点ID
HsaCounterProperties **CounterProperties // 输出:计数器属性指针
);
2.2 功能描述
该函数查询指定GPU节点的所有性能计数器属性,包括:
- 硬件块(Block)的数量和类型
- 每个块包含的计数器数量
- 每个计数器的详细属性
- 并发限制信息
2.3 返回值
| 返回值 | 说明 |
|---|---|
HSAKMT_STATUS_SUCCESS | 查询成功 |
HSAKMT_STATUS_NO_MEMORY | 内存不足或计数器系统未初始化 |
HSAKMT_STATUS_INVALID_PARAMETER | CounterProperties 为空指针 |
HSAKMT_STATUS_INVALID_NODE_UNIT | NodeId 无效 |
3. 实现原理深度解析
3.1 整体流程图
hsaKmtPmcGetCounterProperties()
↓
[1] 验证 counter_props 全局变量是否初始化
↓
[2] 验证 CounterProperties 参数
↓
[3] 验证并转换 NodeId → gpu_id
↓
[4] 检查缓存 counter_props[NodeId]
↓
已缓存? ──Yes──→ 返回缓存数据
↓ No
[5] 遍历所有硬件块,统计信息
├─ total_blocks (有效块数量)
├─ total_counters (计数器总数)
└─ total_concurrent (并发槽位总数)
↓
[6] 计算所需内存大小
↓
[7] 分配 counter_props[NodeId] 内存
↓
[8] 再次遍历,填充详细信息
├─ 块属性
├─ 计数器属性
└─ UUID映射
↓
[9] 缓存结果到 counter_props[NodeId]
↓
[10] 返回指针给调用者
3.2 源码逐步分析
步骤1-3:参数验证
HSAKMT_STATUS HSAKMTAPI hsaKmtPmcGetCounterProperties(HSAuint32 NodeId,
HsaCounterProperties **CounterProperties)
{
HSAKMT_STATUS rc = HSAKMT_STATUS_SUCCESS;
uint32_t gpu_id, i, block_id;
// 检查全局计数器属性数组是否初始化
if (!counter_props)
return HSAKMT_STATUS_NO_MEMORY;
// 检查输出参数
if (!CounterProperties)
return HSAKMT_STATUS_INVALID_PARAMETER;
// 验证NodeId并获取gpu_id
if (hsakmt_validate_nodeid(NodeId, &gpu_id)
!= HSAKMT_STATUS_SUCCESS)
return HSAKMT_STATUS_INVALID_NODE_UNIT;
关键点:
counter_props是全局静态数组,在系统初始化时由hsakmt_init_counter_props()分配NodeId是HSA层面的节点标识,需要转换为KFD层的gpu_id
步骤4:缓存检查
// 如果已经查询过,直接返回缓存的结果
if (counter_props[NodeId]) {
*CounterProperties = counter_props[NodeId];
return HSAKMT_STATUS_SUCCESS;
}
缓存策略优势:
- 避免重复查询硬件
- 减少ioctl系统调用开销
- 提高性能
步骤5:第一次遍历 - 统计信息
uint32_t total_counters = 0;
uint32_t total_concurrent = 0;
struct perf_counter_block block = {0};
uint32_t total_blocks = 0;
// 遍历所有可能的硬件块类型
for (i = 0; i < PERFCOUNTER_BLOCKID__MAX; i++) {
// 查询块的属性
rc = hsakmt_get_block_properties(NodeId, i, &block);
if (rc != HSAKMT_STATUS_SUCCESS)
return rc;
total_concurrent += block.num_of_slots; // 累加并发槽位
total_counters += block.num_of_counters; // 累加计数器数量
// num_of_slots=0 表示该块不存在
if (block.num_of_slots)
total_blocks++;
}
perf_counter_block 结构:
struct perf_counter_block {
uint32_t num_of_counters; // 该块的计数器数量
uint32_t num_of_slots; // 并发槽位数量
uint64_t counter_ids[]; // 计数器ID数组
uint32_t counter_size_in_bits; // 计数器位宽
uint64_t counter_mask; // 计数器掩码
};
步骤6-7:动态内存分配
// 计算所需的总内存大小
counter_props_size = sizeof(HsaCounterProperties) // 基础结构
+ sizeof(HsaCounterBlockProperties) * (total_blocks - 1) // 块数组
+ sizeof(HsaCounter) * (total_counters - total_blocks); // 计数器数组
// 分配内存
counter_props[NodeId] = malloc(counter_props_size);
if (!counter_props[NodeId])
return HSAKMT_STATUS_NO_MEMORY;
// 填充顶层属性
counter_props[NodeId]->NumBlocks = total_blocks;
counter_props[NodeId]->NumConcurrent = total_concurrent;
内存布局:
+---------------------------------------+
| HsaCounterProperties |
| NumBlocks = N |
| NumConcurrent = C |
| Blocks[0] | ← 第一个块的内联存储
+---------------------------------------+
| HsaCounterBlockProperties (Block 0) |
| BlockId (UUID) |
| NumCounters = M0 |
| NumConcurrent = S0 |
| Counters[0..M0-1] | ← 内联存储的计数器数组
+---------------------------------------+
| HsaCounterBlockProperties (Block 1) |
| BlockId (UUID) |
| NumCounters = M1 |
| NumConcurrent = S1 |
| Counters[0..M1-1] |
+---------------------------------------+
| ... |
+---------------------------------------+
| HsaCounterBlockProperties (Block N-1) |
| ... |
+---------------------------------------+
计算公式解析:
sizeof(HsaCounterProperties): 已包含Blocks[0],所以块数组只需额外分配(N-1)个sizeof(HsaCounter) * (total_counters - total_blocks): 每个块已包含Counters[0],所以需要减去块数量
步骤8:第二次遍历 - 填充详细信息
HsaCounterBlockProperties *block_prop = &counter_props[NodeId]->Blocks[0];
for (block_id = 0; block_id < PERFCOUNTER_BLOCKID__MAX; block_id++) {
rc = hsakmt_get_block_properties(NodeId, block_id, &block);
if (rc != HSAKMT_STATUS_SUCCESS) {
free(counter_props[NodeId]);
counter_props[NodeId] = NULL;
return rc;
}
if (!block.num_of_slots) // 跳过不存在的块
continue;
// 将内部block_id转换为标准UUID
blockid2uuid(block_id, &block_prop->BlockId);
block_prop->NumCounters = block.num_of_counters;
block_prop->NumConcurrent = block.num_of_slots;
// 填充每个计数器的属性
for (i = 0; i < block.num_of_counters; i++) {
block_prop->Counters[i].BlockIndex = block_id;
block_prop->Counters[i].CounterId = block.counter_ids[i];
block_prop->Counters[i].CounterSizeInBits = block.counter_size_in_bits;
block_prop->Counters[i].CounterMask = block.counter_mask;
block_prop->Counters[i].Flags.ui32.Global = 1;
block_prop->Counters[i].Type = HSA_PROFILE_TYPE_NONPRIV_IMMEDIATE;
}
// 移动到下一个块的位置(通过指针运算)
block_prop = (HsaCounterBlockProperties *)
&block_prop->Counters[block_prop->NumCounters];
}
指针移动技巧:
// block_prop 指向当前块
// 跳过当前块的所有计数器,指向下一个块
block_prop = (HsaCounterBlockProperties *)&block_prop->Counters[block_prop->NumCounters];
这利用了可变长度数组的特性,使得所有数据在一块连续内存中。
为什么需要两次遍历
- 第一次:统计总数,计算内存需求
- 第二次:填充详细数据
4. block ID到UUID的映射
4.1 blockid2uuid() 函数
static int blockid2uuid(enum perf_block_id block_id, HSA_UUID *uuid)
{
int rc = 0;
switch (block_id) {
case PERFCOUNTER_BLOCKID__CB:
*uuid = HSA_PROFILEBLOCK_AMD_CB;
break;
case PERFCOUNTER_BLOCKID__SQ:
*uuid = HSA_PROFILEBLOCK_AMD_SQ;
break;
// ... 20多种块类型映射 ...
default:
rc = -1; // 未知块类型(bug)
break;
}
return rc;
}
4.2 为什么需要UUID?
- 标准化: HSA标准使用UUID标识硬件块
- 兼容性: 不同GPU架构可能有不同的内部ID,但UUID保持一致
- 可移植性: 应用程序可以跨平台使用相同的UUID
4.3 主要硬件块UUID
| 块名称 | 内部ID | UUID | 功能描述 |
|---|---|---|---|
| CB | PERFCOUNTER_BLOCKID__CB | HSA_PROFILEBLOCK_AMD_CB | 颜色缓冲区 |
| DB | PERFCOUNTER_BLOCKID__DB | HSA_PROFILEBLOCK_AMD_DB | 深度缓冲区 |
| SQ | PERFCOUNTER_BLOCKID__SQ | HSA_PROFILEBLOCK_AMD_SQ | 着色器队列 |
| TCC | PERFCOUNTER_BLOCKID__TCC | HSA_PROFILEBLOCK_AMD_TCC | L2缓存 |
| TCP | PERFCOUNTER_BLOCKID__TCP | HSA_PROFILEBLOCK_AMD_TCP | L1缓存 |
| MC | PERFCOUNTER_BLOCKID__MC | HSA_PROFILEBLOCK_AMD_MC | 内存控制器 |
| GRBM | PERFCOUNTER_BLOCKID__GRBM | HSA_PROFILEBLOCK_AMD_GRBM | 寄存器总线 |
5. 计数器类型详解
5.1 HSA_PROFILE_TYPE 枚举
typedef enum {
HSA_PROFILE_TYPE_NONPRIV_IMMEDIATE = 0, // 非特权,立即读取
HSA_PROFILE_TYPE_NONPRIV_STREAMING, // 非特权,流式采样
HSA_PROFILE_TYPE_PRIVILEGED_IMMEDIATE, // 特权,立即读取
HSA_PROFILE_TYPE_PRIVILEGED_STREAMING // 特权,流式采样
} HSA_PROFILE_TYPE;
5.2 计数器类型详细说明
5.2.1 非特权立即型(NONPRIV_IMMEDIATE)
特点:
- 不需要root权限
- 读取时获取当前累计值
- 适合大多数应用场景
使用场景:
- 用户态性能分析工具
- 应用程序内嵌的性能监控
- 基准测试
示例:
// 大多数GPU硬件块的计数器都是这种类型
block_prop->Counters[i].Type = HSA_PROFILE_TYPE_NONPRIV_IMMEDIATE;
5.2.2 非特权流式型(NONPRIV_STREAMING)
特点:
- 不需要root权限
- 支持周期性采样
- 可以记录时间序列数据
使用场景:
- 时间线分析
- 性能热点定位
- 长时间运行的监控
5.2.3 特权立即型(PRIVILEGED_IMMEDIATE)
特点:
- 需要root权限或CAP_SYS_ADMIN能力
- 访问更底层的硬件计数器
- 更详细的性能信息
使用场景:
- 系统级性能调优
- 驱动开发和调试
- 深度性能分析
5.2.4 特权流式型(PRIVILEGED_STREAMING)
特点:
- 需要特权访问
- 流式采样
- 最完整的监控能力
使用场景:
- 系统级时间线分析
- 功耗和热管理分析
5.3 计数器标志位
typedef union {
struct {
unsigned int Global:1; // 全局计数器(所有CU共享)
unsigned int ReadOnly:1; // 只读计数器
unsigned int Reserved:30;
};
HSAuint32 ui32;
} HsaCounterFlags;
Global标志:
Global = 1: 计数器监控整个GPU(默认)Global = 0: 计数器监控特定CU或单元
6. HsaCounterProperties 数据结构
6.1 完整结构定义
typedef struct {
HSAuint32 NumBlocks; // 硬件块数量
HSAuint32 NumConcurrent; // 总并发槽位数
HsaCounterBlockProperties Blocks[1]; // 块数组(可变长度)
} HsaCounterProperties;
typedef struct {
HSA_UUID BlockId; // 块的UUID
HSAuint32 NumCounters; // 该块的计数器数量
HSAuint32 NumConcurrent; // 该块的并发槽位数
HsaCounter Counters[1]; // 计数器数组(可变长度)
} HsaCounterBlockProperties;
typedef struct {
HSAuint32 BlockIndex; // 所属块ID(内部使用)
HSAuint64 CounterId; // 计数器硬件ID
HSAuint32 CounterSizeInBits; // 位宽(通常32或64)
HSAuint64 CounterMask; // 有效位掩码
HsaCounterFlags Flags; // 标志位
HSA_PROFILE_TYPE Type; // 计数器类型
} HsaCounter;
6.2 遍历计数器的示例代码
HsaCounterProperties *props;
hsaKmtPmcGetCounterProperties(nodeId, &props);
printf("GPU Node %d has %d blocks, %d concurrent slots\n",
nodeId, props->NumBlocks, props->NumConcurrent);
HsaCounterBlockProperties *block = &props->Blocks[0];
for (int i = 0; i < props->NumBlocks; i++) {
printf("\nBlock %d: UUID=0x%016lx%016lx\n", i,
block->BlockId.Value[1], block->BlockId.Value[0]);
printf(" Counters: %d, Concurrent: %d\n",
block->NumCounters, block->NumConcurrent);
for (int j = 0; j < block->NumCounters; j++) {
HsaCounter *counter = &block->Counters[j];
printf(" Counter %d: ID=0x%lx, Size=%d bits, Type=%d\n",
j, counter->CounterId, counter->CounterSizeInBits,
counter->Type);
}
// 移动到下一个块
block = (HsaCounterBlockProperties *)&block->Counters[block->NumCounters];
}
7. 并发槽位限制
7.1 什么是并发槽位?
并发槽位(Concurrent Slots)是硬件对同时激活的性能计数器数量的限制。
原因:
- 硬件资源有限
- 每个激活的计数器消耗硬件逻辑和功耗
- 过多计数器会影响GPU性能
7.2 get_block_concurrent_limit() 函数
static HSAuint32 get_block_concurrent_limit(uint32_t node_id, HSAuint32 block_id)
{
uint32_t i;
HsaCounterBlockProperties *block = &counter_props[node_id]->Blocks[0];
// 遍历所有块,找到匹配的block_id
for (i = 0; i < PERFCOUNTER_BLOCKID__MAX; i++) {
if (block->Counters[0].BlockIndex == block_id)
return block->NumConcurrent;
// 移动到下一个块
block = (HsaCounterBlockProperties *)
&block->Counters[block->NumCounters];
}
return 0; // 未找到,返回0
}
7.3 应对策略
策略1:分批监控
// 第一批:监控内存相关计数器
HsaCounter batch1[] = {TCC_counter1, TCC_counter2, MC_counter1};
// 第二批:监控计算相关计数器
HsaCounter batch2[] = {SQ_counter1, SQ_counter2, SQ_counter3};
策略2:多次运行
for (int pass = 0; pass < num_passes; pass++) {
// 每次运行监控不同的计数器集合
hsaKmtPmcRegisterTrace(nodeId, num_counters[pass],
counters[pass], &traceRoot);
// ... 执行测试负载 ...
}
策略3:优先级排序
// 先监控最重要的计数器
HsaCounter priority_counters[] = {
most_important_counter,
second_important_counter,
// ...
};
8. 初始化和清理机制
8.1 hsakmt_init_counter_props()
HSAKMT_STATUS hsakmt_init_counter_props(unsigned int NumNodes)
{
// 为所有节点分配计数器属性指针数组
counter_props = calloc(NumNodes, sizeof(struct HsaCounterProperties *));
if (!counter_props) {
pr_warn("Profiling is not available.\n");
return HSAKMT_STATUS_NO_MEMORY;
}
counter_props_count = NumNodes;
return HSAKMT_STATUS_SUCCESS;
}
调用时机:在 hsaKmtOpenKFD() 或系统初始化时调用
数据结构:
counter_props → [Node0 Props Ptr] → NULL (首次查询前)
[Node1 Props Ptr] → NULL
[Node2 Props Ptr] → NULL
...
[NodeN Props Ptr] → NULL
查询后:
counter_props → [Node0 Props Ptr] → [HsaCounterProperties结构]
[Node1 Props Ptr] → [HsaCounterProperties结构]
...
8.2 hsakmt_destroy_counter_props()
void hsakmt_destroy_counter_props(void)
{
unsigned int i;
if (!counter_props)
return;
// 释放每个节点的属性结构
for (i = 0; i < counter_props_count; i++)
if (counter_props[i]) {
free(counter_props[i]);
counter_props[i] = NULL;
}
// 释放顶层数组
free(counter_props);
}
调用时机:在 hsaKmtCloseKFD() 或系统清理时调用
9. 实战示例:查询和选择计数器
9.1 完整示例代码
#include <stdio.h>
#include <hsakmt.h>
void analyze_gpu_counters(HSAuint32 nodeId)
{
HsaCounterProperties *props = NULL;
HSAKMT_STATUS status;
// 查询计数器属性
status = hsaKmtPmcGetCounterProperties(nodeId, &props);
if (status != HSAKMT_STATUS_SUCCESS) {
printf("Failed to get counter properties: %d\n", status);
return;
}
printf("=== GPU Node %d Counter Properties ===\n", nodeId);
printf("Total Blocks: %d\n", props->NumBlocks);
printf("Total Concurrent Slots: %d\n\n", props->NumConcurrent);
// 遍历所有块
HsaCounterBlockProperties *block = &props->Blocks[0];
for (int i = 0; i < props->NumBlocks; i++) {
printf("Block %d:\n", i);
printf(" UUID: 0x%016lx%016lx\n",
block->BlockId.Value[1], block->BlockId.Value[0]);
printf(" Counters: %d\n", block->NumCounters);
printf(" Concurrent Slots: %d\n", block->NumConcurrent);
// 显示前5个计数器的详细信息
int limit = block->NumCounters < 5 ? block->NumCounters : 5;
for (int j = 0; j < limit; j++) {
HsaCounter *counter = &block->Counters[j];
printf(" Counter %d:\n", j);
printf(" ID: 0x%lx\n", counter->CounterId);
printf(" Size: %d bits\n", counter->CounterSizeInBits);
printf(" Mask: 0x%lx\n", counter->CounterMask);
printf(" Type: %d\n", counter->Type);
printf(" Global: %d\n", counter->Flags.ui32.Global);
}
if (block->NumCounters > 5) {
printf(" ... and %d more counters\n",
block->NumCounters - 5);
}
printf("\n");
// 移动到下一个块
block = (HsaCounterBlockProperties *)
&block->Counters[block->NumCounters];
}
}
// 查找特定UUID的块
HsaCounterBlockProperties* find_block_by_uuid(
HsaCounterProperties *props,
HSA_UUID target_uuid)
{
HsaCounterBlockProperties *block = &props->Blocks[0];
for (int i = 0; i < props->NumBlocks; i++) {
if (block->BlockId.Value[0] == target_uuid.Value[0] &&
block->BlockId.Value[1] == target_uuid.Value[1]) {
return block;
}
block = (HsaCounterBlockProperties *)
&block->Counters[block->NumCounters];
}
return NULL;
}
// 选择SQ块的前4个计数器
int select_sq_counters(HSAuint32 nodeId, HsaCounter *selected, int max_count)
{
HsaCounterProperties *props;
if (hsaKmtPmcGetCounterProperties(nodeId, &props) != HSAKMT_STATUS_SUCCESS)
return -1;
// 查找SQ块
HsaCounterBlockProperties *sq_block =
find_block_by_uuid(props, HSA_PROFILEBLOCK_AMD_SQ);
if (!sq_block) {
printf("SQ block not found\n");
return -1;
}
// 检查并发限制
int count = sq_block->NumConcurrent < max_count ?
sq_block->NumConcurrent : max_count;
if (count > sq_block->NumCounters)
count = sq_block->NumCounters;
// 复制计数器信息
for (int i = 0; i < count; i++) {
selected[i] = sq_block->Counters[i];
}
printf("Selected %d SQ counters (max concurrent: %d)\n",
count, sq_block->NumConcurrent);
return count;
}
int main()
{
HSAKMT_STATUS status;
HSAuint32 nodeId = 0; // 假设使用第一个GPU
// 初始化HSA KMT
status = hsaKmtOpenKFD();
if (status != HSAKMT_STATUS_SUCCESS) {
printf("Failed to open KFD\n");
return 1;
}
// 分析GPU计数器
analyze_gpu_counters(nodeId);
// 选择计数器
HsaCounter selected[16];
int count = select_sq_counters(nodeId, selected, 16);
if (count > 0) {
printf("\n=== Selected Counters ===\n");
for (int i = 0; i < count; i++) {
printf("Counter %d: Block=%d, ID=0x%lx\n",
i, selected[i].BlockIndex, selected[i].CounterId);
}
}
// 清理
hsaKmtCloseKFD();
return 0;
}
9.2 输出示例
=== GPU Node 0 Counter Properties ===
Total Blocks: 8
Total Concurrent Slots: 48
Block 0:
UUID: 0x0000000000000001
Counters: 524
Concurrent Slots: 16
Counter 0:
ID: 0x0
Size: 64 bits
Mask: 0xffffffffffffffff
Type: 0
Global: 1
Counter 1:
ID: 0x1
Size: 64 bits
Mask: 0xffffffffffffffff
Type: 0
Global: 1
... and 519 more counters
Block 1:
UUID: 0x0000000000000002
Counters: 256
Concurrent Slots: 4
...
Selected 16 SQ counters (max concurrent: 16)
=== Selected Counters ===
Counter 0: Block=13, ID=0x0
Counter 1: Block=13, ID=0x1
...
Counter 15: Block=13, ID=0xf
10. 常见问题与最佳实践
10.1 常见问题
Q1: 为什么第一次查询很慢?
- 第一次需要通过ioctl查询硬件
- 需要遍历所有硬件块
- 分配和填充大量内存
Q2: 计数器属性会变化吗?
- 在GPU硬件固定的情况下不会变化
- 驱动更新可能改变可用计数器
- 不同GPU型号有不同的计数器集合
Q3: 如何处理并发限制?
- 查看
NumConcurrent属性 - 分批监控或多次运行
- 优先选择关键计数器
10.2 最佳实践
1. 缓存查询结果
static HsaCounterProperties *cached_props[MAX_NODES] = {NULL};
HsaCounterProperties* get_counter_props(HSAuint32 nodeId) {
if (!cached_props[nodeId]) {
hsaKmtPmcGetCounterProperties(nodeId, &cached_props[nodeId]);
}
return cached_props[nodeId];
}
2. 验证计数器选择
bool validate_counter_selection(HSAuint32 nodeId,
HsaCounter *counters,
int count)
{
HsaCounterProperties *props;
hsaKmtPmcGetCounterProperties(nodeId, &props);
// 统计每个块的计数器数量
int block_counts[PERFCOUNTER_BLOCKID__MAX] = {0};
for (int i = 0; i < count; i++) {
block_counts[counters[i].BlockIndex]++;
}
// 检查是否超过并发限制
HsaCounterBlockProperties *block = &props->Blocks[0];
for (int i = 0; i < props->NumBlocks; i++) {
int block_id = block->Counters[0].BlockIndex;
if (block_counts[block_id] > block->NumConcurrent) {
printf("Block %d: %d counters exceed limit %d\n",
block_id, block_counts[block_id],
block->NumConcurrent);
return false;
}
block = (HsaCounterBlockProperties *)
&block->Counters[block->NumCounters];
}
return true;
}
3. 智能计数器选择
// 根据分析目标选择计数器
typedef enum {
PROFILE_TARGET_MEMORY, // 内存性能
PROFILE_TARGET_COMPUTE, // 计算性能
PROFILE_TARGET_CACHE, // 缓存性能
PROFILE_TARGET_POWER // 功耗分析
} ProfileTarget;
int select_counters_by_target(ProfileTarget target,
HsaCounterProperties *props,
HsaCounter *selected,
int max_count)
{
int count = 0;
switch (target) {
case PROFILE_TARGET_MEMORY:
// 选择MC, TCC相关计数器
// ...
break;
case PROFILE_TARGET_COMPUTE:
// 选择SQ, CU相关计数器
// ...
break;
case PROFILE_TARGET_CACHE:
// 选择TCP, TCC相关计数器
// ...
break;
case PROFILE_TARGET_POWER:
// 选择功耗相关计数器
// ...
break;
}
return count;
}
11. 总结
本文深入探讨了ROCm性能计数器查询的实现细节:
- API设计:
hsaKmtPmcGetCounterProperties()提供简洁的查询接口 - 缓存机制:避免重复查询,提高性能
- 内存布局:连续分配,高效访问
- 计数器类型:支持多种访问模式和权限级别
- 并发限制:硬件资源管理的关键约束
- UUID映射:标准化的硬件块标识
掌握这些知识后,您可以:
- 准确查询GPU的性能监控能力
- 根据需求选择合适的计数器
- 理解并发限制并规划监控策略
- 编写高效的性能分析工具
940

被折叠的 条评论
为什么被折叠?



