Suggestion Description
hsa_amd_ipc_memory_create permanently pins GPU memory at the HSA driver level. The ROCm HSA runtime provides no API to release this pin from the creating process. hsa_amd_ipc_memory_detach only works on the receiving (attaching) side.
This causes GPU memory to be unrecoverable after the application is done with it, even after the tensors are freed, the caches are emptied, and GC has run.
In vLLM (LLM serving engine), KV caches can consume 120+ GB of GPU memory. When the engine registers these buffers with UCX for potential inter-node transfers, hsa_amd_ipc_memory_create is called during uct_rocm_ipc_pack_key (via ucp_mem_map → mem_reg). After engine shutdown, this memory is permanently leaked at the driver level — even though PyTorch has freed all tensors and the CUDA cache is empty.
This makes it impossible to reuse the GPU within the same process (e.g., running multiple test iterations, reinitializing the engine, etc.).
Reproduction
import torch
import ctypes
import gc
hsa = ctypes.CDLL('/opt/rocm/lib/libhsa-runtime64.so.1')
class hsa_amd_ipc_memory_t(ctypes.Structure):
_fields_ = [('handle', ctypes.c_uint32 * 8)]
# Measure baseline
free_before, total = torch.cuda.mem_get_info()
print(f"Before: {free_before / 1024**3:.1f} GB free / {total / 1024**3:.1f} GB total")
# Allocate 50 GB GPU tensor
t = torch.zeros(50 * 1024**3 // 4, dtype=torch.float32, device='cuda')
# Create IPC handle (this is what UCX does internally during ucp_mem_map)
handle = hsa_amd_ipc_memory_t()
status = hsa.hsa_amd_ipc_memory_create(
ctypes.c_void_p(t.data_ptr()),
ctypes.c_size_t(t.nelement() * t.element_size()),
ctypes.byref(handle),
)
print(f"hsa_amd_ipc_memory_create status: {status}")
# Free everything
del t, handle
gc.collect()
torch.cuda.empty_cache()
# Measure after cleanup
free_after, _ = torch.cuda.mem_get_info()
leaked = (free_before - free_after) / 1024**3
print(f"After: {free_after / 1024**3:.1f} GB free")
print(f"Leaked: {leaked:.1f} GB")
# Expected: Leaked: ~50 GB (memory is permanently pinned)
Requested API
A function to release the IPC pin from the creating process, e.g.:
hsa_status_t hsa_amd_ipc_memory_destroy(void *ptr, size_t len);
or
hsa_status_t hsa_amd_ipc_memory_release(hsa_amd_ipc_memory_t *handle);
This would allow UCX (and other IPC consumers) to:
- Create IPC handles when needed for sharing
- Release them when the memory is no longer being shared
- Free the GPU memory normally afterward
Without this API, any GPU memory that has ever been IPC-shared is permanently pinned for the lifetime of the process.
Operating System
Ubuntu 22.04
GPU
MI355
ROCm Component
vLLM
Suggestion Description
hsa_amd_ipc_memory_createpermanently pins GPU memory at the HSA driver level. The ROCm HSA runtime provides no API to release this pin from the creating process.hsa_amd_ipc_memory_detachonly works on the receiving (attaching) side.This causes GPU memory to be unrecoverable after the application is done with it, even after the tensors are freed, the caches are emptied, and GC has run.
In vLLM (LLM serving engine), KV caches can consume 120+ GB of GPU memory. When the engine registers these buffers with UCX for potential inter-node transfers,
hsa_amd_ipc_memory_createis called duringuct_rocm_ipc_pack_key(viaucp_mem_map→mem_reg). After engine shutdown, this memory is permanently leaked at the driver level — even though PyTorch has freed all tensors and the CUDA cache is empty.This makes it impossible to reuse the GPU within the same process (e.g., running multiple test iterations, reinitializing the engine, etc.).
Reproduction
Requested API
A function to release the IPC pin from the creating process, e.g.:
or
This would allow UCX (and other IPC consumers) to:
Without this API, any GPU memory that has ever been IPC-shared is permanently pinned for the lifetime of the process.
Operating System
Ubuntu 22.04
GPU
MI355
ROCm Component
vLLM