Technical Documentation: HsaMemFlags Structure in ROCm HSAKMT

Overview

The HsaMemFlags structure is a fundamental part of the ROCm (Radeon Open Compute) stack, specifically within the HSAKMT (HSA Kernel Mode Thunk) layer. It encapsulates a rich set of flags that control the behavior and properties of memory allocations for both CPU and GPU usage. These flags allow fine-grained control over how memory is allocated, mapped, accessed, and managed in heterogeneous systems, supporting a wide range of use cases from compute kernels to graphics interop and system-level optimizations.

This document provides a detailed explanation of the HsaMemFlags structure, its fields, their meanings, and how they influence memory allocation and usage in ROCm environments.

Structure Definition

typedef struct _HsaMemFlags
{
    union
    {
        struct
        {
            unsigned int NonPaged    : 1;
            unsigned int CachePolicy : 2;
            unsigned int ReadOnly    : 1;
            unsigned int PageSize    : 2;
            unsigned int HostAccess  : 1;
            unsigned int NoSubstitute: 1;
            unsigned int GDSMemory   : 1;
            unsigned int Scratch     : 1;
            unsigned int AtomicAccessFull: 1;
            unsigned int AtomicAccessPartial: 1;
            unsigned int ExecuteAccess: 1;
            unsigned int CoarseGrain : 1;
            unsigned int AQLQueueMemory: 1;
            unsigned int FixedAddress : 1;
            unsigned int NoNUMABind:    1;
            unsigned int Uncached:      1;
            unsigned int NoAddress:     1;
            unsigned int OnlyAddress:   1;
            unsigned int ExtendedCoherent: 1;
            unsigned int GTTAccess:     1;
            unsigned int Contiguous:	1;
            unsigned int ExecuteBlit:	1;
            unsigned int Reserved:      8;
        } ui32;
        HSAuint32 Value;
    };
} HsaMemFlags;

Field-by-Field Explanation

1. NonPaged (1 bit)

  • Purpose: Controls whether the allocated memory is pageable or non-paged (locked in RAM).
  • Usage: Non-paged memory is required for certain GPU operations to avoid page faults and ensure deterministic access latency.

2. CachePolicy (2 bits)

  • Purpose: Specifies the caching policy for the memory (e.g., cached, non-cached, write-combined).
  • Usage: Influences performance and coherency. For example, cached memory is faster for CPU access, while non-cached may be required for device access.

3. ReadOnly (1 bit)

  • Purpose: Marks the memory as read-only.
  • Usage: Useful for buffers that should not be modified, such as constant data or code segments.

4. PageSize (2 bits)

  • Purpose: Selects the page size for the allocation (e.g., 4KB, 64KB, 2MB, 1GB).
  • Usage: Larger page sizes can reduce TLB misses and improve performance for large buffers.

5. HostAccess (1 bit)

  • Purpose: Indicates whether the memory should be accessible by the host CPU.
  • Usage: If not set, the memory is GPU-only, which can improve security and performance.

6. NoSubstitute (1 bit)

  • Purpose: Prevents fallback to system memory if the requested memory type is unavailable.
  • Usage: Ensures strict allocation semantics, failing the allocation if the preferred memory is not available.

7. GDSMemory (1 bit)

  • Purpose: Requests allocation from the Global Data Store (GDS) heap.
  • Usage: GDS is a special GPU memory region for synchronization and atomic operations.

8. Scratch (1 bit)

  • Purpose: Requests allocation from the GPU scratch area.
  • Usage: Scratch memory is used for temporary storage during kernel execution.

9. AtomicAccessFull (1 bit)

  • Purpose: Allocates and maps memory to support all atomic operations.
  • Usage: On APUs, this uses the ATC path for system memory, ensuring full atomic support.

10. AtomicAccessPartial (1 bit)

  • Purpose: Allocates memory for partial atomic support, focused on PCIe atomics for dGPUs.
  • Usage: Supports only a subset of atomic operations (SWAP, CAS, FetchAdd).

11. ExecuteAccess (1 bit)

  • Purpose: Indicates the memory will be used for executable code.
  • Usage: Influences page attributes and may be required for queue memory or code buffers.

12. CoarseGrain (1 bit)

  • Purpose: Specifies coarse-grained memory consistency.
  • Usage: Memory consistency is enforced at synchronization points, not on every access.

13. AQLQueueMemory (1 bit)

  • Purpose: Indicates the memory will be used for AQL queue storage.
  • Usage: Ensures optimal location and alignment for queue memory.

14. FixedAddress (1 bit)

  • Purpose: Requests allocation at a specific virtual address.
  • Usage: Useful for interop scenarios or when address layout is important.

15. NoNUMABind (1 bit)

  • Purpose: Prevents binding system memory to a specific NUMA node.
  • Usage: Allows the OS to allocate memory from any node, which may improve flexibility.

16. Uncached (1 bit)

  • Purpose: Requests uncached memory for fine-grained allocations.
  • Usage: Important for certain hardware platforms (A+A) where cache effects must be controlled.

17. NoAddress (1 bit)

  • Purpose: Allocates VRAM and returns a handle, but does not allocate virtual address space.
  • Usage: Used for scenarios where only a memory handle is needed, not a mapped address.

18. OnlyAddress (1 bit)

  • Purpose: Allocates virtual address space without backing VRAM.
  • Usage: Useful for address reservation or deferred allocation.

19. ExtendedCoherent (1 bit)

  • Purpose: Enables system-scope coherence for atomic instructions.
  • Usage: Ensures atomic operations are coherent across all devices.

20. GTTAccess (1 bit)

  • Purpose: Requests memory to be mapped to GART (Graphics Address Remapping Table) for MES.
  • Usage: Ensures memory is allocated in GTT space, typically for graphics or SDMA operations.

21. Contiguous (1 bit)

  • Purpose: Requests contiguous VRAM allocation.
  • Usage: Required for certain hardware features or performance optimizations.

22. ExecuteBlit (1 bit)

  • Purpose: Indicates the memory is for blit kernel objects.
  • Usage: Ensures proper allocation for graphics blit operations.

23. Reserved (8 bits)

  • Purpose: Reserved for future expansion.
  • Usage: Allows the structure to be extended without breaking ABI compatibility.

Usage Scenarios

1. GPU Buffer Allocation

When allocating a buffer for GPU computation, flags such as NonPagedCachePolicyPageSize, and HostAccess are set to ensure the buffer is accessible and performant for both CPU and GPU.

2. Queue Memory

For queue memory (used for dispatching commands to the GPU), ExecuteAccess and AQLQueueMemory are set to ensure the memory is executable and meets alignment requirements.

3. Atomic Operations

When atomic operations are required, AtomicAccessFull or AtomicAccessPartial are set to ensure the memory supports the necessary atomic semantics.

4. Graphics Interop

For graphics interop scenarios, flags like GTTAccessContiguous, and ExecuteBlit are used to allocate memory compatible with graphics engines.

5. NUMA Optimization

NoNUMABind can be used to allow the OS to allocate memory from any NUMA node, which may be beneficial for certain workloads.

6. Specialized Memory Regions

Flags like GDSMemory and Scratch are used to allocate memory from specialized GPU regions for synchronization or temporary storage.

How Flags Affect Allocation

The flags in HsaMemFlags are interpreted by the HSAKMT thunk and the underlying kernel driver (KFD). They influence:

  • Where the memory is allocated: System RAM, VRAM, GDS, scratch, etc.
  • How the memory is mapped: Page size, cache policy, atomic support, etc.
  • Who can access the memory: CPU, GPU, or both.
  • Consistency and coherency: Fine-grained vs. coarse-grained, extended coherence.
  • Performance characteristics: Contiguity, caching, NUMA binding.

The combination of flags allows the application or runtime to tailor memory allocations to the specific needs of the workload and hardware.

Example: Allocating a GPU Buffer for Compute

HsaMemFlags flags = {0};
flags.ui32.NonPaged = 1;
flags.ui32.CachePolicy = HSA_CACHING_CACHED;
flags.ui32.PageSize = HSA_PAGE_SIZE_4KB;
flags.ui32.HostAccess = 1;
flags.ui32.ExecuteAccess = 0;
flags.ui32.CoarseGrain = 1;

This configuration requests a non-paged, cached, 4KB page size buffer that is accessible by both CPU and GPU, with coarse-grained consistency.

Conclusion

The HsaMemFlags structure is a powerful and flexible mechanism for controlling memory allocation in ROCm's heterogeneous computing environment. By providing a comprehensive set of flags, it enables precise control over memory properties, access patterns, and performance characteristics. Understanding and using these flags appropriately is essential for developing high-performance, reliable, and portable applications on AMD platforms.

Whether allocating buffers for compute, graphics, synchronization, or interop, HsaMemFlags provides the necessary tools to meet the diverse requirements of modern GPU-accelerated workloads.

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

DeeplyMind

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值