Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX … by parallelArchitect · Pull Request #208 · XuehaiPan/nvitop

parallelArchitect · 2026-04-16T08:19:56Z

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark)

On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total equal to system MemTotal (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory.

The existing NVMLError_NotSupported path correctly handles some UMA platforms, but GB10 returns NVML_SUCCESS — not NOT_SUPPORTED — so it falls through to the discrete GPU path and displays wrong values.

Issue Type

Bug fix

Description

Detect coherent UMA by comparing NVML-reported total against system virtual memory total. If total >= 90% of system RAM, classify as unified memory and use system virtual memory (MemAvailable) for display instead.

Preserves existing behavior for discrete GPUs.

Motivation and Context

Same root cause documented and fixed in:

nvtop PR: Fix NVML memory reporting regression on coherent UMA platforms (Fixes… Syllo/nvtop#463
btop PR: Fix NVML memory reporting regression on coherent UMA platforms aristocratos/btop#1611
NVML shim workaround: https://github.com/parallelArchitect/nvml-unified-shim

Note

Requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.

…Spark) On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory. Fix: detect UMA by comparing NVML total against system virtual memory total. If total >= 90% of system RAM, treat as unified memory and use system virtual memory (MemAvailable) for display instead. Preserves existing behavior for discrete GPUs. Note: requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.

XuehaiPan · 2026-04-16T08:45:12Z

Hi @parallelArchitect, thanks for raising this!

Before we investigate this, can you elaborate on the context for the current behavior of the GB10 machines? You mean memory_info.total is constantly returning the wrong metrics with system MemTotal (~121GB)? I'm wondering what the difference is between system MemTotal and virtual_memory.total? Could you add a screenshot or the hardware specs?

from nvitop import *

print(host.virtual_memory())
print(Device(0).as_snapshot())

parallelArchitect · 2026-04-16T17:04:24Z

Thanks for the quick response.
On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with memory_info.total approximately equal to the full system MemTotal (~121GB on a 128GB system). This is documented in the community NVML shim project: https://github.com/CINOAdam/nvml-unified-shim
system MemTotal and virtual_memory().total should be equivalent — both reflect the total installed RAM. The issue is that on GB10, NVML reports this full value as GPU memory rather than returning NVML_ERROR_NOT_SUPPORTED or zero, which is what the existing code path expects for unified memory detection.
I don't have GB10 hardware to run the requested script. The fix is based on analysis of GB10 NVML behavior documented by community members with hardware. If someone with a GB10 can run the script you suggested, that would confirm the behavior.
The same fix has been applied to nvtop (Syllo/nvtop#463) and btop (aristocratos/btop#1611) where the root cause is identical.

One additional note: the existing unified memory path returns vm.total (MemTotal) as the display total. On GB10, MemAvailable would be more accurate since it reflects actually allocatable memory after kernel reservations and page cache. Happy to update that too if you agree, but wanted to flag it separately rather than expanding the PR scope without discussion.

XuehaiPan self-requested a review April 16, 2026 08:39

XuehaiPan self-assigned this Apr 16, 2026

XuehaiPan added bug Something isn't working upstream Something upstream related pynvml Something related to the `nvidia-ml-py` package api Something related to the core APIs labels Apr 16, 2026

fix: replace UMA acronym in comment to pass spell check

2ca5797

parallelArchitect mentioned this pull request Apr 16, 2026

Fix NVML memory reporting regression on coherent UMA platforms (Fixes… Syllo/nvtop#463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX …#208

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX …#208
parallelArchitect wants to merge 2 commits into
XuehaiPan:mainfrom
parallelArchitect:fix/gb10-coherent-uma-memory-reporting

parallelArchitect commented Apr 16, 2026 •

edited

Loading

Uh oh!

XuehaiPan commented Apr 16, 2026

Uh oh!

parallelArchitect commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

parallelArchitect commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Type

Description

Motivation and Context

Note

Uh oh!

XuehaiPan commented Apr 16, 2026

Uh oh!

parallelArchitect commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

parallelArchitect commented Apr 16, 2026 •

edited

Loading

parallelArchitect commented Apr 16, 2026 •

edited

Loading