Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX …#208
Conversation
…Spark) On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory. Fix: detect UMA by comparing NVML total against system virtual memory total. If total >= 90% of system RAM, treat as unified memory and use system virtual memory (MemAvailable) for display instead. Preserves existing behavior for discrete GPUs. Note: requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.
|
Hi @parallelArchitect, thanks for raising this! Before we investigate this, can you elaborate on the context for the current behavior of the GB10 machines? You mean from nvitop import *
print(host.virtual_memory())
print(Device(0).as_snapshot()) |
|
Thanks for the quick response. One additional note: the existing unified memory path returns vm.total (MemTotal) as the display total. On GB10, MemAvailable would be more accurate since it reflects actually allocatable memory after kernel reservations and page cache. Happy to update that too if you agree, but wanted to flag it separately rather than expanding the PR scope without discussion. |
Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark)
On GB10 / DGX Spark,
nvmlDeviceGetMemoryInforeturnsNVML_SUCCESSwithtotalequal to systemMemTotal(~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory.The existing
NVMLError_NotSupportedpath correctly handles some UMA platforms, but GB10 returnsNVML_SUCCESS— notNOT_SUPPORTED— so it falls through to the discrete GPU path and displays wrong values.Issue Type
Description
Detect coherent UMA by comparing NVML-reported
totalagainst system virtual memory total. If total >= 90% of system RAM, classify as unified memory and use system virtual memory (MemAvailable) for display instead.Preserves existing behavior for discrete GPUs.
Motivation and Context
Same root cause documented and fixed in:
Note
Requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.