AMD rocr-libhsakmt分析系列4-2: Implementation Analysis of fmm_allocate_memory_object

DeeplyMind

已于 2025-09-29 10:44:29 修改

阅读量999

点赞数 26

CC 4.0 BY-SA版权

分类专栏： AMD ROCm runtime库实现剖析文章标签：人工智能 ROCm rocr KFD kfd AMDGPU

于 2025-09-26 17:06:00 首次发布

本文链接：https://blog.youkuaiyun.com/shenjunpeng/article/details/152129426

AMD ROCm runtime库实现剖析专栏收录该内容

29 篇文章 ¥39.90 ¥99.00

订阅专栏

超级会员免费看

Overview

The function fmm_allocate_memory_object is a critical part of the ROCm (Radeon Open Compute) stack, specifically within the HSAKMT (HSA Kernel Mode Thunk) memory management subsystem. Its primary responsibility is to allocate a memory object for a given GPU, managing both the virtual address space and the underlying device memory. This function is invoked during device memory allocation requests and is essential for tracking, managing, and releasing GPU memory resources.

This document provides a comprehensive analysis of the implementation, workflow, and technical details of fmm_allocate_memory_object, including its interactions with other components, error handling, and design considerations.

Function Signature

static vm_object_t *fmm_allocate_memory_object(
    uint32_t gpu_id,
    void *mem,
    uint64_t MemorySizeInBytes,
    manageable_aperture_t *aperture,
    uint64_t *mmap_offset,
    uint32_t ioc_flags
)

gpu_id: The target GPU identifier for the allocation.
mem: The base virtual address for the allocation.
MemorySizeInBytes: The total size of memory to allocate.
aperture: The aperture (address space region) in which the allocation is made.
mmap_offset: Pointer to an offset value used for userptr allocations.
ioc_flags: Flags controlling the allocation behavior (e.g., VRAM, userptr, etc.).
High-Level Workflow

The function follows these major steps:
Input Validation: Checks for null pointers and invalid parameters.
Preparation of IOCTL Arguments: Sets up the arguments for the kernel ioctl call to allocate GPU memory.
Handling Address Calculation: Adjusts the virtual address based on aperture and allocation flags.
Device Memory Allocation Loop: Handles large allocations by splitting them into multiple buffers if necessary.
Object Creation and Registration: Creates a vm_object_t to track the allocation and registers it in the aperture's red-black tree.
Error Handling and Cleanup: Ensures proper cleanup in case of allocation failures.
Return Value: Returns the created vm_object_t on success, or NULL on failure.

Detailed Implementation Analysis

1. Input Validation

The function begins by validating the input parameters:

if (!mem)
    return NULL;

If the memory pointer is null, the function immediately returns NULL, indicating failure.

2. Preparation of IOCTL Arguments

The function prepares the arguments for the kernel ioctl call (AMDKFD_IOC_ALLOC_MEMORY_OF_GPU), which is responsible for allocating memory on the GPU:

struct kfd_ioctl_alloc_memory_of_gpu_args args = {0};
args.gpu_id = gpu_id;
args.flags = ioc_flags | KFD_IOC_ALLOC_MEM_FLAGS_NO_SUBSTITUTE;
args.va_addr = (uint64_t)mem;

gpu_id: Specifies the target GPU.
flags: Combines the input flags with NO_SUBSTITUTE, ensuring strict allocation semantics.
va_addr: The virtual address for the allocation.

Address Adjustment

For non-dGPU (APU) systems and VRAM allocations, the address is adjusted relative to the aperture base:

if (!hsakmt_is_dgpu && (ioc_flags & KFD_IOC_ALLOC_MEM_FLAGS_VRAM))
    args.va_addr = VOID_PTRS_SUB(mem, aperture->base);

For allocations in the mem_handle_aperture, which are used for buffer handles without valid virtual addresses, the address is set to zero:

if (aperture == &mem_handle_aperture)
    args.va_addr = 0;

3. Device Memory Allocation Loop

The function supports large allocations by splitting them into multiple buffers, each smaller than a predefined maximum (BIGGEST_SINGLE_BUF_SIZE). This is necessary because the kernel driver may not support very large single allocations.

Initialization

total_size = 0;
if (ioc_flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
    size = MemorySizeInBytes < BIGGEST_SINGLE_BUF_SIZE ?
        MemorySizeInBytes : BIGGEST_SINGLE_BUF_SIZE;
    offset = *mmap_offset;
    args.mmap_offset = *mmap_offset;
} else {
    size = MemorySizeInBytes;
}

For userptr allocations, the function uses the mmap offset and splits the allocation.
For other allocations, the entire size is used.

Allocation Loop

do {
    args.size = size;

    if (hsakmt_ioctl(hsakmt_kfd_fd, AMDKFD_IOC_ALLOC_MEMORY_OF_GPU, &args))
        goto err_hsakmt_ioctl_failed;

    // Object creation and registration
    ...
    args.va_addr += size;
    offset += size;

    if (ioc_flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR)
        args.mmap_offset = offset;

    total_size += size;
    if (total_size + BIGGEST_SINGLE_BUF_SIZE > MemorySizeInBytes)
        size = MemorySizeInBytes - total_size;
    else
        size = BIGGEST_SINGLE_BUF_SIZE;
} while (total_size < MemorySizeInBytes);

The loop continues until the total allocated size matches the requested size.
Each iteration allocates a chunk of memory and updates the address and offset for the next chunk.

4. Object Creation and Registration

After a successful allocation, the function creates a vm_object_t to track the allocation:

if (!vm_obj) {
    pthread_mutex_lock(&aperture->fmm_mutex);
    vm_obj = aperture_allocate_object(aperture, mem, args.handle,
            MemorySizeInBytes, mflags);

    pthread_mutex_unlock(&aperture->fmm_mutex);
    if (!vm_obj)
        goto err_object_allocation_failed;

    if (mmap_offset)
        *mmap_offset = args.mmap_offset;
} else {
    vm_obj->handles[vm_obj->handle_num++] = args.handle;
}

The first chunk creates the object and registers it in the aperture's red-black tree.
Subsequent chunks add their handles to the object's handle array.

Example Workflow

User requests allocation of 10GB VRAM on GPU 0.
fmm_allocate_memory_object is called with appropriate parameters.
The function splits the request into multiple 512GB chunks (or smaller, as defined).
For each chunk:
- Prepares ioctl arguments.
- Calls the kernel to allocate memory.
- On success, adds the handle to the object's handle array.
After all chunks are allocated, the object is registered in the aperture's tree.
The function returns the object for further use.

Conclusion

fmm_allocate_memory_object is a robust, thread-safe function that manages the allocation of GPU memory objects in the ROCm HSAKMT subsystem. It abstracts the complexities of chunked allocations, aperture management, and kernel interactions, providing a reliable foundation for higher-level memory management operations. Its careful design ensures efficient resource tracking, error handling, and integration with both hardware and kernel drivers.

This function is central to the ROCm memory management strategy, enabling advanced features such as multi-GPU support, userptr allocations, and handle-based memory management, all while maintaining high reliability and performance.

If this is helpful, please like, save, and follow.