Overview
The HsaMemFlags structure is a fundamental part of the ROCm (Radeon Open Compute) stack, specifically within the HSAKMT (HSA Kernel Mode Thunk) layer. It encapsulates a rich set of flags that control the behavior and properties of memory allocations for both CPU and GPU usage. These flags allow fine-grained control over how memory is allocated, mapped, accessed, and managed in heterogeneous systems, supporting a wide range of use cases from compute kernels to graphics interop and system-level optimizations.
This document provides a detailed explanation of the HsaMemFlags structure, its fields, their meanings, and how they influence memory allocation and usage in ROCm environments.
Structure Definition
typedef struct _HsaMemFlags
{
union
{
struct
{
unsigned int NonPaged : 1;
unsigned int CachePolicy : 2;
unsigned int ReadOnly : 1;
unsigned int PageSize : 2;
unsigned int HostAccess : 1;
unsigned int NoSubstitute: 1;
unsigned int GDSMemory : 1;
unsigned int Scratch : 1;
unsigned int AtomicAccessFull: 1;
unsigned int AtomicAccessPartial: 1;
unsigned int ExecuteAccess: 1;
unsigned int CoarseGrain : 1;
unsigned int AQLQueueMemory: 1;
unsigned int FixedAddress : 1;
unsigned int NoNUMABind: 1;
unsigned int Uncached: 1;
unsigned int NoAddress: 1;
unsigned int OnlyAddress: 1;
unsigned int ExtendedCoherent: 1;
unsigned int GTTAccess: 1;
unsigned int Contiguous: 1;
unsigned int ExecuteBlit: 1;
unsigned int Reserved: 8;
} ui32;
HSAuint32 Value;
};
} HsaMemFlags;
Field-by-Field Explanation
1. NonPaged (1 bit)
- Purpose: Controls whether the allocated memory is pageable or non-paged (locked in RAM).
- Usage: Non-paged memory is required for certain GPU operations to avoid page faults and ensure deterministic access latency.
2. CachePolicy (2 bits)
- Purpose: Specifies the caching policy for the memory (e.g., cached, non-cached, write-combined).
- Usage: Influences performance and coherency. For example, cached memory is faster for CPU access, while non-cached may be required for device access.
3. ReadOnly (1 bit)
- Purpose: Marks the memory as read-only.
- Usage: Useful for buffers that should not be modified, such as constant data or code segments.
4. PageSize (2 bits)
- Purpose: Selects the page size for the allocation (e.g., 4KB, 64KB, 2MB, 1GB).
- Usage: Larger page sizes can reduce TLB misses and improve performance for large buffers.
5. HostAccess (1 bit)
- Purpose: Indicates whether the memory should be accessible by the host CPU.
- Usage: If not set, the memory is GPU-only, which can improve security and performance.
6. NoSubstitute (1 bit)
- Purpose: Prevents fallback to system memory if the requested memory type is unavailable.
- Usage: Ensures strict allocation semantics, failing the allocation if the preferred memory is not available.
7. GDSMemory (1 bit)
- Purpose: Requests allocation from the Global Data Store (GDS) heap.
- Usage: GDS is a special GPU memory region for synchronization and atomic operations.
8. Scratch (1 bit)
- Purpose: Requests allocation from the GPU scratch area.
- Usage: Scratch memory is used for temporary storage during kernel execution.
9. AtomicAccessFull (1 bit)
- Purpose: Allocates and maps memory to support all atomic operations.
- Usage: On APUs, this uses the ATC path for system memory, ensuring full atomic support.
10. AtomicAccessPartial (1 bit)
- Purpose: Allocates memory for partial atomic support, focused on PCIe atomics for dGPUs.
- Usage: Supports only a subset of atomic operations (SWAP, CAS, FetchAdd).
11. ExecuteAccess (1 bit)
- Purpose: Indicates the memory will be used for executable code.
- Usage: Influences page attributes and may be required for queue memory or code buffers.
12. CoarseGrain (1 bit)
- Purpose: Specifies coarse-grained memory consistency.
- Usage: Memory consistency is enforced at synchronization points, not on every access.
13. AQLQueueMemory (1 bit)
- Purpose: Indicates the memory will be used for AQL queue storage.
- Usage: Ensures optimal location and alignment for queue memory.
14. FixedAddress (1 bit)
- Purpose: Requests allocation at a specific virtual address.
- Usage: Useful for interop scenarios or when address layout is important.
15. NoNUMABind (1 bit)
- Purpose: Prevents binding system memory to a specific NUMA node.
- Usage: Allows the OS to allocate memory from any node, which may improve flexibility.
16. Uncached (1 bit)
- Purpose: Requests uncached memory for fine-grained allocations.
- Usage: Important for certain hardware platforms (A+A) where cache effects must be controlled.
17. NoAddress (1 bit)
- Purpose: Allocates VRAM and returns a handle, but does not allocate virtual address space.
- Usage: Used for scenarios where only a memory handle is needed, not a mapped address.
18. OnlyAddress (1 bit)
- Purpose: Allocates virtual address space without backing VRAM.
- Usage: Useful for address reservation or deferred allocation.
19. ExtendedCoherent (1 bit)
- Purpose: Enables system-scope coherence for atomic instructions.
- Usage: Ensures atomic operations are coherent across all devices.
20. GTTAccess (1 bit)
- Purpose: Requests memory to be mapped to GART (Graphics Address Remapping Table) for MES.
- Usage: Ensures memory is allocated in GTT space, typically for graphics or SDMA operations.
21. Contiguous (1 bit)
- Purpose: Requests contiguous VRAM allocation.
- Usage: Required for certain hardware features or performance optimizations.
22. ExecuteBlit (1 bit)
- Purpose: Indicates the memory is for blit kernel objects.
- Usage: Ensures proper allocation for graphics blit operations.
23. Reserved (8 bits)
- Purpose: Reserved for future expansion.
- Usage: Allows the structure to be extended without breaking ABI compatibility.
Usage Scenarios
1. GPU Buffer Allocation
When allocating a buffer for GPU computation, flags such as NonPaged, CachePolicy, PageSize, and HostAccess are set to ensure the buffer is accessible and performant for both CPU and GPU.
2. Queue Memory
For queue memory (used for dispatching commands to the GPU), ExecuteAccess and AQLQueueMemory are set to ensure the memory is executable and meets alignment requirements.
3. Atomic Operations
When atomic operations are required, AtomicAccessFull or AtomicAccessPartial are set to ensure the memory supports the necessary atomic semantics.
4. Graphics Interop
For graphics interop scenarios, flags like GTTAccess, Contiguous, and ExecuteBlit are used to allocate memory compatible with graphics engines.
5. NUMA Optimization
NoNUMABind can be used to allow the OS to allocate memory from any NUMA node, which may be beneficial for certain workloads.
6. Specialized Memory Regions
Flags like GDSMemory and Scratch are used to allocate memory from specialized GPU regions for synchronization or temporary storage.
How Flags Affect Allocation
The flags in HsaMemFlags are interpreted by the HSAKMT thunk and the underlying kernel driver (KFD). They influence:
- Where the memory is allocated: System RAM, VRAM, GDS, scratch, etc.
- How the memory is mapped: Page size, cache policy, atomic support, etc.
- Who can access the memory: CPU, GPU, or both.
- Consistency and coherency: Fine-grained vs. coarse-grained, extended coherence.
- Performance characteristics: Contiguity, caching, NUMA binding.
The combination of flags allows the application or runtime to tailor memory allocations to the specific needs of the workload and hardware.
Example: Allocating a GPU Buffer for Compute
HsaMemFlags flags = {0};
flags.ui32.NonPaged = 1;
flags.ui32.CachePolicy = HSA_CACHING_CACHED;
flags.ui32.PageSize = HSA_PAGE_SIZE_4KB;
flags.ui32.HostAccess = 1;
flags.ui32.ExecuteAccess = 0;
flags.ui32.CoarseGrain = 1;
This configuration requests a non-paged, cached, 4KB page size buffer that is accessible by both CPU and GPU, with coarse-grained consistency.
Conclusion
The HsaMemFlags structure is a powerful and flexible mechanism for controlling memory allocation in ROCm's heterogeneous computing environment. By providing a comprehensive set of flags, it enables precise control over memory properties, access patterns, and performance characteristics. Understanding and using these flags appropriately is essential for developing high-performance, reliable, and portable applications on AMD platforms.
Whether allocating buffers for compute, graphics, synchronization, or interop, HsaMemFlags provides the necessary tools to meet the diverse requirements of modern GPU-accelerated workloads.
172万+

被折叠的 条评论
为什么被折叠?



