Skip to content

fix: add missing delwin() calls in free_device_windows#467

Merged
Syllo merged 1 commit intoSyllo:masterfrom
danbedford:fix/memory-leaks-in-free-device-windows
May 6, 2026
Merged

fix: add missing delwin() calls in free_device_windows#467
Syllo merged 1 commit intoSyllo:masterfrom
danbedford:fix/memory-leaks-in-free-device-windows

Conversation

@danbedford
Copy link
Copy Markdown
Contributor

@danbedford danbedford commented May 1, 2026

shader_cores, l2_cache_size, and exec_engines were allocated with newwin() in alloc_device_window() but never freed in free_device_windows(). This caused a memory leak every time device windows were torn down (e.g. on refresh or hot-unplug).

The function now frees all windows it allocates, preventing resource leaks in long-running sessions.

shader_cores, l2_cache_size, and exec_engines were allocated with
newwin() in alloc_device_window() but never freed in
free_device_windows(). This caused a memory leak every time device
windows were torn down (e.g. on refresh or hot-unplug).

The function now frees all windows it allocates, preventing resource
leaks in long-running sessions.
danbedford added a commit to danbedford/nvtop that referenced this pull request May 3, 2026
…plot_window, and unsigned underflow guard

- free_device_windows: delwin() for shader_cores, l2_cache_size,
  exec_engines (upstream PR Syllo#467 fix/memory-leaks-in-free-device-windows)
- delete_all_windows: delwin() for plots[i].plot_window
  (upstream PR Syllo#468 fix/plot-window-memory-leak)
- nvtop_get_nvlink_info: guard against unsigned underflow in CLI
  throughput delta if hardware counter wraps or resets
danbedford added a commit to danbedford/nvtop that referenced this pull request May 3, 2026
Upstream: Syllo/nvtop (commit 095d91c "Remove unused function in ixml")
Fork: danbedford/nvtop, branch `nvlink`
GPU Tested: `NVIDIA GeForce RTX 3090`
Scope: 5 files changed, 706 insertions(+), 19 deletions(-)

---

## Overview

Extends nvtop with per-GPU NVLink info in unused space of the existing interface. When no NVLink-connected GPU is detected, layout and behavior are identical to upstream -- no visual or functional difference. The goal is to bring useful data and throughput to all users of nvtop with NVLink-supported hardware, from consumer (2080, 3090 series) to datacenter (Ampere, Hopper, Blackwell series).

### Main bar (row 2, shown by default)

Appended at end after `power_info` -- NVLink version, link count, and aggregate throughput displayed. Two display states:

NVLink supported device - No bridge or no active links (0-link case, no row 2 padding compaction applied):

    NVL5 0x

With active links (row 2 padding compaction applied, throughput displayed). Example (theoretical fully saturated GB200 with NVLink 5.0):

    NVL518x 1.636 TiB/s

When NVLink is supported but no bridge is connected or links are inactive, only the version and link count display -- no compaction is applied to reclaim space on row 2 since there is no throughput to display. The `NVL5 0x` text extends past the panel edge without affecting the layout. Only when active links exist does fan field compaction kick in (11 to 8 characters) to make room for the throughput value.

- **Label**: `NVL` to represent minimal Label for NVLink.
- **Version**: Marketing NVLink version via `nvmlDeviceGetNvLinkVersion` (raw NVML enum values require remapping):
  - Raw 1 -> NVLink V1.0 -> Display 1
  - Raw 2 -> V2.0 -> 2
  - Raw 3 -> V2.2 -> 2
  - Raw 4 -> V3.0 -> 3
  - Raw 5 -> V3.1 -> 3
  - Raw 6 -> V4.0 -> 4
  - Raw 7 -> V5.0 -> 5
  - Raw 8 -> V6.0 -> 6 (assumed Rubin)
  Display shows single-digit major version due to limited space.
- **Link count**: Total physical links on the device (static hardware property). Maximum is 36 to future-proof for planned Nvidia Rubin.
- **Throughput**: Aggregate Transmit plus Receive utilization, currently read via `nvidia-smi` CLI fallback for all NVLink-connected GPUs. This carries measurable overhead from forking a full binary and parsing its text output, but providing real throughput visibility to consumer GPU users outweighs the cost, and all other non-NVLink users are isolated. The 2-second interval is hardcoded and independent of global nvtop refresh rate to cap CLI calls regardless of display speed. Uses "r" (raw) counters which include payload plus protocol overhead, reflecting true bandwidth utilization. Parses "Link N: Raw Tx: NNNNN KiB" / "Raw Rx" per link. Delta = `(current - previous) / time_delta` per link, summed for aggregate; unsigned underflow guard checks `new >= old` before subtraction. No smoothing applied -- raw accuracy over display smoothness. **TODO:** On datacenter GPUs with `nvmlDeviceGetNvLinkUtilizationCounter`, replace with direct API call; keep CLI fallback for consumer GPUs.
- **Layout compaction**: The Fan field shrinks from 11 to 8 characters ONLY when `any_device_has_nvlink_active` is true (at least one monitored GPU has active NVLink links). GPUs with NVLink hardware but no bridge (0-link case) do NOT get compaction -- `NVL3 0x` extends past the panel edge without needing reclaimed space. Panel width is determined by device name length (`device_name` column = `largest_device_name + 11`), so longer names produce more room for NVLink.
- **Throughput display**: Uses `print_data_at_scale()` (renamed from `print_pcie_at_scale()`) with IEC binary prefixes. Array bounds check extended from `< 5` to `< 6` to support up to Terbibytes/s (TiB/s) for Blackwell NVLink 5.0 devices at ~1.636 TiB/s aggregate. The `memory_prefix[]` array already contains entries up to "Pi" -- only the loop guard needed updating.

### Extra GPU info bar (row 4, not shown by default)

Appended at end after `exec_engines` -- error and correction counters since nvtop launch. Example with zeroed counters:

    NVL E:00000 C:00000

Example with non-zero counters (errors in red, corrections in yellow):

    NVL E:00420 C:00069

- **Label**: `NVL` to represent minimal Label for NVLink.
- **Error counters**: Replay, recovery, CRC FLIT, and CRC DATA errors via `nvmlDeviceGetNvLinkErrorCounter`, summed across all links. Baseline subtraction ensures counters start at zero on nvtop launch.
- **CRC corrections**: Per-lane CRC flit corrections via `nvmlDeviceGetFieldValues` (field IDs 32-247 for links 0-35), summed across all links. Uses modern signature `(device, valuesCount, fieldValues)` with field IDs populated in-place in the `nvmlFieldValue_t` buffer (48 bytes on NVML 11.515+: fieldId at offset 0, scopeId at 4, timestamp at 8, latencyUsec at 16, valueType at 24, nvmlReturn at 28, value.union at 32). Offsets are manually parsed since `nvml.h` is not exposed in the nvtop build.
- Error counters read during refresh cycle (`gpuinfo_nvidia_refresh_dynamic_info()`), not during startup probe (`nvtop_probe_nvlink_list()` calls `nvtop_get_nvlink_info()` before display is drawn). This ensures the baseline is established at the moment of first display refresh, guaranteeing counters read zero on launch. `nvtop_get_nvlink_info()` does NOT read error counters in the display path.

---

## Files Changed

### include/nvtop/extract_gpuinfo_common.h (+31 lines, -1 line)

- `NVTOP_NVLINK_MAX_LINKS` defined to 36
- Flat struct `nvlink_info`: `num_links`, `version`, `supported`, `has_throughput`, `aggregate_tx`, `aggregate_rx`, `total_errors`, `total_corrections`
- `nvtop_get_nvlink_info()`: return cached NVLink data; vendor guard skips non-NVIDIA GPUs before `container_of()`
- `nvtop_get_nvlink_error_counts()`: public getter for display-ready error/correction counts; bridges `interface.c` to per-device error state in `extract_gpuinfo_nvidia.c`
- `nvtop_probe_nvlink_list()`: probe all devices for NVLink support before curses init; short-circuits if `any_device_has_nvlink` already true
- `nvtop_set_nvlink_probe()`: set `any_device_has_nvlink` global flag only (leaves `any_device_has_nvlink_active` untouched)
- `nvtop_reset_nvlink_cache()`: reset all per-device NVLink caching (probe flag, cached linkcount, cached version, cached info struct) on monitored GPU set change; vendor guard for non-NVIDIA

### include/nvtop/interface_internal_common.h (+4 lines, -1 line)

- `WINDOW *nvlink_info` added to `struct device_window` (row 2 throughput)
- `WINDOW *nvlink_errors` added to `struct device_window` (row 4 errors)
- `device_nvlink_errors` added to `enum device_field` with size 19

### src/extract_gpuinfo_nvidia.c (+451 insertions, -1 deletion)

- Four NVML (NVIDIA Management Library) symbols via `dlsym()`: `nvmlDeviceGetNvLinkState`, `nvmlDeviceGetNvLinkVersion`, `nvmlDeviceGetNvLinkErrorCounter`, `nvmlDeviceGetFieldValues` (modern 3-param signature)
- Per-device state: `device_index`, `cli_poll_active`, per-link CLI counters, baseline/display error fields, probe cache (`nvlink_probed`, `nvlink_cached_linkcount`, `nvlink_cached_version`), full struct cache (`cached_nvlink_info`, `cached_nvlink_info_populated`)
- Link discovery: probes links 0-35 via `nvmlDeviceGetNvLinkState`, counts consecutive successes, stops on first hard error or `NVML_ERROR_NOT_SUPPORTED`; only active links (`isActive == 1`) are counted -- physical slots with no bridge are excluded
- Caching: 3 layers -- (1) link count/version via `nvlink_probe_and_cache()`, (2) full struct via `nvlink_refresh_cached_info()`, (3) list-level probe short-circuit in `nvtop_probe_nvlink_list()`; all reset by `nvtop_reset_nvlink_cache()` on GPU set change
- Throughput: `nvidia-smi nvlink --getthroughput r -i <dev>` every 2 seconds (hardcoded, independent of display refresh rate), delta-based rate computation with unsigned underflow guard
- Error reading via `nvlink_read_errors()`: called from `gpuinfo_nvidia_refresh_dynamic_info()` (not `nvtop_get_nvlink_info()`) to ensure baseline is established at first display refresh; reads errors via `nvmlDeviceGetNvLinkErrorCounter` and corrections via `nvmlDeviceGetFieldValues`; unsigned underflow guard prevents counter wrap artifacts

### src/interface.c (+220 insertions, -20 deletions)

- Conditional layout: `any_device_has_nvlink` controls window allocation; `any_device_has_nvlink_active` controls fan compaction (shrinks from 11 to 8 chars only when active links exist -- 0-link devices do not get compaction)
- `device_length()` always uses base layout (clock + mem_clock + temp + fan + power + 5) regardless of NVLink state; NVLink window on line 2 extends past nominal panel edge, which ncurses handles gracefully
- `nvtop_adjust_field_sizes_for_nvlink()` checks `any_device_has_nvlink_active` (not `any_device_has_nvlink`) for fan compaction
- `interface_check_monitored_gpu_change()` resets ALL mutable NVLink state: both global flags plus `sizeof_device_field[device_fan_speed] = 11`, then calls per-device `nvtop_reset_nvlink_cache()`
- Fan N/A fallback branch uses `any_device_has_nvlink_active` for correct 11-character format on 0-link devices
- NVLink info window (row 2): displays `print_data_at_scale()`-formatted throughput (renamed from `print_pcie_at_scale()`; bounds check extended to `< 6` for TiB/s ceiling)
- NVLink errors window (row 4): reads via `nvtop_get_nvlink_error_counts()` (does NOT call `nvtop_get_nvlink_info()` in display path)
- Memory leak fixes: added missing `delwin()` for `shader_cores`, `l2_cache_size`, `exec_engines`, `plots[i].plot_window`, and `nvlink_errors`. Two of these are also submitted as standalone upstream PRs: `free_device_windows()` fix (PR Syllo#467) and `plots[i].plot_window` fix (PR Syllo#468).

### src/nvtop.c (+5 lines, -1 line)

- `nvtop_probe_nvlink_list()` and `nvtop_set_nvlink_probe()` called before curses init (first layout pass)
- Re-evaluated in main loop after `interface_check_monitored_gpu_change()` for GPU hotplug

---

## Design Decisions

### Flat struct over nested

Single struct per device. Error and correction counters are cumulative totals (unsigned long long) summed across all links. Avoids per-link arrays and dynamic allocation in the hot refresh path.

### Two-tier error state: baseline plus display

Five fields in `struct gpu_info_nvidia` track error state: `baseline_errors`, `baseline_corrections`, `nvlink_errors_baseline_read` (bool), `display_errors`, `display_corrections`. Baselines persist for the entire process lifetime. Display values computed each refresh as `cumulative - baseline`.

### total_errors / total_corrections retained for API compatibility

Populated from `display_errors`/`display_corrections` in `nvlink_refresh_cached_info()`. Primary display path uses `nvtop_get_nvlink_error_counts()`, but both carry the same data.

### No new dependencies

Uses only NVML symbols already in `nvidia-ml` driver and `nvidia-smi` binary already on the system.

---

## What Was Not Changed

Process listing, memory/GPU charts, configuration options, keyboard shortcuts, menu behavior, and all non-NVLink display fields remain identical to upstream.

---

## Testing

Dual RTX 3090 Founders Edition 24GB with 3-slot NVLink Bridge (RTXA6000NVLINK3S-KIT). Displays in UI as: `NVIDIA GeForce RTX 3090`. 4 physical links per GPU. `enum nvmlNvlinkVersion_t` returns `5` representing NVLink v3.1. When idle, NVLink shows ~1.2 MiB/s aggregate residual throughput from assumed protocol keep-alives/link maintenance. Errors/corrections correctly display `E:00000 C:00000` on every launch, incrementing only when new errors occur (no errors experienced to fully confirm).
@danbedford danbedford mentioned this pull request May 3, 2026
@Syllo Syllo merged commit b7be412 into Syllo:master May 6, 2026
3 checks passed
@Syllo
Copy link
Copy Markdown
Owner

Syllo commented May 6, 2026

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants