Skip to content

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX …#208

Open
parallelArchitect wants to merge 2 commits into
XuehaiPan:mainfrom
parallelArchitect:fix/gb10-coherent-uma-memory-reporting
Open

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX …#208
parallelArchitect wants to merge 2 commits into
XuehaiPan:mainfrom
parallelArchitect:fix/gb10-coherent-uma-memory-reporting

Conversation

@parallelArchitect
Copy link
Copy Markdown

@parallelArchitect parallelArchitect commented Apr 16, 2026

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark)

On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total equal to system MemTotal (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory.

The existing NVMLError_NotSupported path correctly handles some UMA platforms, but GB10 returns NVML_SUCCESS — not NOT_SUPPORTED — so it falls through to the discrete GPU path and displays wrong values.

Issue Type

  • Bug fix

Description

Detect coherent UMA by comparing NVML-reported total against system virtual memory total. If total >= 90% of system RAM, classify as unified memory and use system virtual memory (MemAvailable) for display instead.

Preserves existing behavior for discrete GPUs.

Motivation and Context

Same root cause documented and fixed in:

Note

Requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.

…Spark)

On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with
total == system MemTotal (~121GB). This causes nvitop to display full
system RAM as GPU memory instead of actually allocatable memory.

Fix: detect UMA by comparing NVML total against system virtual memory
total. If total >= 90% of system RAM, treat as unified memory and use
system virtual memory (MemAvailable) for display instead.

Preserves existing behavior for discrete GPUs.

Note: requires validation on GB10 / DGX Spark hardware. The fix has
not been independently validated on a coherent UMA system.
@XuehaiPan XuehaiPan self-requested a review April 16, 2026 08:39
@XuehaiPan XuehaiPan self-assigned this Apr 16, 2026
@XuehaiPan XuehaiPan added bug Something isn't working upstream Something upstream related pynvml Something related to the `nvidia-ml-py` package api Something related to the core APIs labels Apr 16, 2026
@XuehaiPan
Copy link
Copy Markdown
Owner

Hi @parallelArchitect, thanks for raising this!

Before we investigate this, can you elaborate on the context for the current behavior of the GB10 machines? You mean memory_info.total is constantly returning the wrong metrics with system MemTotal (~121GB)? I'm wondering what the difference is between system MemTotal and virtual_memory.total? Could you add a screenshot or the hardware specs?

from nvitop import *

print(host.virtual_memory())
print(Device(0).as_snapshot())

@parallelArchitect
Copy link
Copy Markdown
Author

parallelArchitect commented Apr 16, 2026

Thanks for the quick response.
On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with memory_info.total approximately equal to the full system MemTotal (~121GB on a 128GB system). This is documented in the community NVML shim project: https://github.com/CINOAdam/nvml-unified-shim
system MemTotal and virtual_memory().total should be equivalent — both reflect the total installed RAM. The issue is that on GB10, NVML reports this full value as GPU memory rather than returning NVML_ERROR_NOT_SUPPORTED or zero, which is what the existing code path expects for unified memory detection.
I don't have GB10 hardware to run the requested script. The fix is based on analysis of GB10 NVML behavior documented by community members with hardware. If someone with a GB10 can run the script you suggested, that would confirm the behavior.
The same fix has been applied to nvtop (Syllo/nvtop#463) and btop (aristocratos/btop#1611) where the root cause is identical.

One additional note: the existing unified memory path returns vm.total (MemTotal) as the display total. On GB10, MemAvailable would be more accurate since it reflects actually allocatable memory after kernel reservations and page cache. Happy to update that too if you agree, but wanted to flag it separately rather than expanding the PR scope without discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Something related to the core APIs bug Something isn't working pynvml Something related to the `nvidia-ml-py` package upstream Something upstream related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants