Fix NVML memory reporting regression on coherent UMA platforms (Fixes…#463
Fix NVML memory reporting regression on coherent UMA platforms (Fixes…#463parallelArchitect wants to merge 1 commit into
Conversation
|
For anyone needing the NVML memory fix now while this PR is under review — the fix is available in this fork: https://github.com/parallelArchitect/nvml-unified-shim |
…449) On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal (~121GB). This prevents has_unified_memory from being set, causing incorrect VRAM reporting and broken memory graph since 3.3.1. Fix: detect UMA by comparing NVML total against /proc/meminfo MemTotal. If total >= 90% of system RAM, classify as unified memory and use MemAvailable instead of MemTotal for display. Note: requires validation on GB10 / DGX Spark hardware. Author does not have access to a coherent UMA system.
e756faf to
e436c6b
Compare
|
Hey, Sorry I merged #466 before coming to your patch and resolved the merge conflicts. I'm a bit concerned about the workaround (>= 90% of RAM) as it could probably falsely detect as UMA many systems, e.g. 32GB of RAM and a dedicated GPU with 32GB of GDDR. Maybe for now the only system like this has 128GB of RAM (and discrete GPU afaik don't go that far) so we could differentiate on this metric. I guess that we'll have to see how NVIDIA is going to expose this through their NVML library to be able to avoid this scenario. |
|
Valid concern on the threshold. The proper detection is via CUDA device attributes: cudaDevAttrPageableMemoryAccessUsesHostPageTables — true on GB10 Both together identify hardware-coherent UMA without relying on memory size comparison. SM 12.1 check is also an option but that's too narrow. The NVML path is the problem — nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal on GB10, which is correct behavior for UMA but breaks the has_unified_memory detection. The fix should detect UMA via device attributes first, then use /proc/meminfo MemAvailable for display on those platforms. I don't have direct GB10 access but azampatti and dustin1925 in the community do — happy to coordinate validation if you want to iterate on a cleaner fix. |
… #449)
On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal (~121GB). This prevents has_unified_memory from being set, causing incorrect VRAM reporting and broken memory graph since 3.3.1.
Fix: detect UMA by comparing NVML total against /proc/meminfo MemTotal. If total >= 90% of system RAM, classify as unified memory and use MemAvailable instead of MemTotal for display.
Note: requires validation on GB10 / DGX Spark hardware. Author does not have access to a coherent UMA system.
References
NVML API documentation on SOC/UMA behavior: https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html
Community NVML shim for GB10 UMA: https://forums.developer.nvidia.com/t/nvml-support-for-dgx-spark-grace-blackwell-unified-memory-community-solution/358869
NVML memory fix at the shim layer: https://github.com/parallelArchitect/nvml-unified-shim
btop PR: aristocratos/btop#1611
nvitop PR: XuehaiPan/nvitop#208