Add NVLink#469
Conversation
- Add nvlink_info and nvlink_link_info structs to extract_gpuinfo_common.h - Add NVML function pointers for NVLink (link count, state, throughput, errors, ECC) - Add nvtop_get_nvlink_info() function in extract_gpuinfo_nvidia.c - Track throughput counters for delta-based rate calculation - Gracefully handle missing NVLink support (no hard failure on consumer GPUs)
- Add nvlink_info window to device_window struct - Allocate NVLink window on line 2 of device info block - Shift all subsequent rows down by 1 to accommodate - Add NVLink rendering: per-link status (A/x), TX/RX throughput, error indicators - Color coding: green=active, red=inactive or errors present - Update device_header_rows from 3/4 to 4/5 in layout calculation
- Replace nvmlDeviceGetNvLinkLinkCount (doesn't exist) with link discovery via nvmlDeviceGetNvLinkState probe loop - Replace nvmlDeviceGetNvLinkThroughput (doesn't exist) with nvmlDeviceGetNvLinkUtilizationCounter (returns both RX and TX) - Remove nvmlDeviceGetNvLinkRemoteDeviceInfo (doesn't exist) - Remove nvmlDeviceGetNvLinkEccCounter (doesn't exist, covered by nvmlDeviceGetNvLinkErrorCounter with type DL_ECC_DATA) - Skip throughput display on consumer GPUs where utilization counters return NVML_ERROR_NOT_SUPPORTED - Show all 4 links (L0A L1A L2A L3A) on RTX 3090 instead of N/A
- Flatten struct nvlink_info (no nested link_info array) - Throughput via nvidia-smi CLI (poll every 2s) instead of NVML utilization counters - Conditional layout: any_device_has_nvlink flag controls spacing - Revert to exact upstream layout when no NVLink GPU detected - Marketing version remapping with device name overrides for RTX 3090 - Reuse print_pcie_at_scale() for throughput formatting - Only 2 dlsym'd NVML symbols: GetNvLinkState, GetNvLinkVersion
…e scale function Add NVLink error and CRC correction counter display on line 4 of the device panel, showing cumulative errors (replay, recovery, CRC FLIT, CRC DATA) and per-lane CRC flit corrections with conditional coloring (errors in red, corrections in yellow). Counters use baseline subtraction so they start at zero on nvtop launch and only increment when new errors/corrections occur. Display format: NVL E:00000 C:00000 (19 chars), with "NVL" in cyan and numeric values conditionally colored. Window is allocated only for devices with NVLink support. - src/extract_gpuinfo_nvidia.c: Add nvlink_read_errors() function using nvmlDeviceGetNvLinkErrorCounter and nvmlDeviceGetFieldValues with baseline tracking per-device. - include/nvtop/extract_gpuinfo_common.h: Add total_errors and total_corrections fields to struct nvlink_info. - src/interface.c: Add nvlink_errors window allocation, deallocation, and display logic in draw_devices(). - include/nvtop/interface_internal_common.h: Add nvlink_errors window pointer and device_nvlink_errors enum entry. - Rename print_pcie_at_scale() to print_data_at_scale() and extend loop from 5 to 6 to support TiB/s (future-proofing for NVLink 5.0). - Fix FAN field width (8 -> 11 chars) and reduce spacing to fit all fields within the device panel width. - Fix device_length() to use max() across all three device lines instead of only lines 1 and 2.
…ayout, update display getter Move nvlink_read_errors() out of nvtop_get_nvlink_info() and into gpuinfo_nvidia_refresh_dynamic_info() so the baseline is not established during the startup probe. This ensures E:00000 C:00000 on every nvtop launch. - Add display_errors/display_corrections fields to struct gpu_info_nvidia - Add nvtop_get_nvlink_error_counts() public getter (extract_gpuinfo_common.h) - Update interface.c to use the getter instead of nvtop_get_nvlink_info() - Fix nvmlFieldValue_t struct offsets: 48 bytes (not 12), ullVal at offset 32 - Fix dlsym signature for nvmlDeviceGetFieldValues (remove fieldIds parameter) - Populate fieldId in-place in the raw buffer before calling GetFieldValues
…hroughput r) Use raw (payload + protocol overhead) counters instead of data-only for the nvidia-smi CLI fallback path. This ensures fully saturated links show the rated link speed (e.g. ~14.062 GB/s per link on NVLink 3.0) rather than roughly half that from data-only counters. - Changed --getthroughput d to --getthroughput r - Updated parsing from 'Data Tx/Rx' to 'Raw Tx/Rx' - Added explanatory code comments
…ment Add a code comment guiding future developers with datacenter NVLink hardware (A100, H100) to replace the CLI fallback with the NVML nvmlDeviceGetNvLinkUtilizationCounter API, while keeping the CLI as a consumer GPU fallback. Also remove the misleading 'EMA smoothing' comment on the aggregate throughput output since no smoothing is actually applied.
…ccuracy Remove Exponential Moving Average smoothing from the nvidia-smi CLI throughput fallback path. Raw delta/time_delta is used directly without smoothing — accuracy is more important than display smoothness for a monitoring tool.
…nterval - NVTOP_NVLINK_MAX_LINKS and NVML_NVLINK_MAX_LINKS_INTERNAL increased from 18 to 36 for future-proof support of devices with up to 36 NVLink links. - Add explicit comment: 2-second nvidia-smi CLI poll interval is hardcoded and independent of global refresh rate, minimizing resource usage for this resource- heavy process (full binary fork + text parsing). - Update code comments with expanded field ID range (32-247 for links 0-35).
…plot_window, and unsigned underflow guard - free_device_windows: delwin() for shader_cores, l2_cache_size, exec_engines (upstream PR Syllo#467 fix/memory-leaks-in-free-device-windows) - delete_all_windows: delwin() for plots[i].plot_window (upstream PR Syllo#468 fix/plot-window-memory-leak) - nvtop_get_nvlink_info: guard against unsigned underflow in CLI throughput delta if hardware counter wraps or resets
- print_data_at_scale: change parameter from unsigned int to unsigned long long to prevent 32-bit truncation on high-throughput NVLink hardware (e.g. B100/GB200) - Remove duplicate #include <stdio.h> / #include <string.h> mid-file (already included at the top of extract_gpuinfo_nvidia.c) - Remove redundant forward declaration of struct gpu_info_nvidia (struct already fully defined earlier in the same file) - Remove unused NVM_LVALUE_VALUE_TYPE_OFF macro
…efresh cycle - Add nvlink_cached_linkcount and nvlink_cached_version to struct gpu_info_nvidia (static hardware properties, probe once) - Add nvlink_probe_and_cache() helper that probes all links and caches results on first call, returns cached value thereafter - Replace inline probe loop in refresh_dynamic_info() with nvlink_probe_and_cache() call (also fixes hardcoded limit of 18 -> now uses full NVML_NVLINK_MAX_LINKS_INTERNAL of 36) - Replace inline probe loop in nvtop_get_nvlink_info() with nvlink_probe_and_cache() call, reads version from cache - Eliminates up to 36 NVML API calls per GPU per refresh cycle
…t-change reset - Add early return in nvtop_probe_nvlink_list() when any_device_has_nvlink is already true — NVLink support is a static hardware property, no need to re-probe every refresh cycle - Reset any_device_has_nvlink in interface_check_monitored_gpu_change() when monitored set changes, so the user can switch between NVLink and non-NVLink GPUs without the cache becoming stale
…th for non-hot-swap hardware
Add case 8: return 6 to nvlink_marketing_version() to handle NVLink 6.0 raw NVML enum value from NVIDIA Rubin platform. Also adds descriptive comments to existing version mapping cases.
Probe NVLink version before link state loop so "supported but no bridge" is detected. Display shows "NVL3 0x" for GPUs with NVLink hardware but no bridge connected. Layout compaction only applies when active links are present (0-link display needs no padding reduction). Adds any_device_has_nvlink_active flag to distinguish NVLink hardware support from active connections.
…ctive links When NVLink is supported but no bridge connected, the NVLink info window is still allocated on line 2 (displaying 'NVL3 0x'). The old code only expanded the panel width when links were active, causing the NVLink window to overflow the panel boundary in the 0-link case. Now device_length() checks any_device_has_nvlink (not any_device_has_nvlink_active) to include the NVLink window width in the panel calculation. Fan field padding (11 chars) is preserved since no throughput display is needed.
nvtop_probe_nvlink_list() correctly sets any_device_has_nvlink_active=false when NVLink hardware is present but no links are active. But nvtop_set_nvlink_probe() then blindly overwrites it with the return value (true), destroying the distinction. This caused fan field to shrink to 8 chars even with 0 active links, making line 3 bar charts (GPU/MEM/Enc/Dec) expand incorrectly.
Panel width (device_length) controls all rows including line 3 bar charts. Expanding it for the 0-link case was making GPU/MEM/Enc/Dec bars too wide. Now panel width only expands when any_device_has_nvlink_active (actual links with throughput to display). For the 0-link "NVL3 0x" case, the NVLink window extends past the nominal panel edge which is fine — ncurses handles overlapping windows correctly and line 3 bars stay at proper width.
…any_device_has_nvlink
For NVLink-supported GPUs with 0 active links (no bridge), the fan field was
using compact format ("FAN %3u%%") instead of the upstream padded format
(" FAN %3u%% "). Changed all three fan format conditionals from
any_device_has_nvlink to any_device_has_nvlink_active so the 0-link case
preserves the standard spacing and field width.
…n non-NVIDIA GPUs nvtop_get_nvlink_info(), nvtop_get_nvlink_error_counts(), and nvtop_reset_nvlink_cache() use container_of() to cast gpu_info to gpu_info_nvidia. On a non-NVIDIA device this is undefined behavior. Add a strcmp() guard at the top of each function to return early for non-NVIDIA GPUs. This avoids the unsafe cast entirely and makes the code correct for mixed-vendor or NVIDIA-free systems.
Line 3 bar charts (GPU/MEM/Enc/Dec) were 6 chars too wide with NVLink bridge installed. device_length() expanded panel to 90 (line2 with pcie field) instead of 84 (base layout). NVLink window on line 2 can extend past nominal panel edge — ncurses handles it fine, same as the 0-link case. Reverting to base layout keeps line 3 bar charts at the correct width.
Three related fixes: 1. fan N/A fallback (line 912) uses any_device_has_nvlink_active instead of any_device_has_nvlink for consistent layout compaction. 2. Reset any_device_has_nvlink_active in interface_check_monitored_gpu_change() alongside any_device_has_nvlink to prevent stale flags from causing incorrect nvlink_errors window allocation during window rebuild. 3. Reset fan field width to 11 in interface_check_monitored_gpu_change() so initialize_curses() allocates fan_speed windows at the correct default width after a monitored-set change.
Upstream: Syllo/nvtop (commit 095d91c "Remove unused function in ixml") Fork: danbedford/nvtop, branch `nvlink` GPU Tested: `NVIDIA GeForce RTX 3090` Scope: 5 files changed, 706 insertions(+), 19 deletions(-) --- ## Overview Extends nvtop with per-GPU NVLink info in unused space of the existing interface. When no NVLink-connected GPU is detected, layout and behavior are identical to upstream -- no visual or functional difference. The goal is to bring useful data and throughput to all users of nvtop with NVLink-supported hardware, from consumer (2080, 3090 series) to datacenter (Ampere, Hopper, Blackwell series). ### Main bar (row 2, shown by default) Appended at end after `power_info` -- NVLink version, link count, and aggregate throughput displayed. Two display states: NVLink supported device - No bridge or no active links (0-link case, no row 2 padding compaction applied): NVL5 0x With active links (row 2 padding compaction applied, throughput displayed). Example (theoretical fully saturated GB200 with NVLink 5.0): NVL518x 1.636 TiB/s When NVLink is supported but no bridge is connected or links are inactive, only the version and link count display -- no compaction is applied to reclaim space on row 2 since there is no throughput to display. The `NVL5 0x` text extends past the panel edge without affecting the layout. Only when active links exist does fan field compaction kick in (11 to 8 characters) to make room for the throughput value. - **Label**: `NVL` to represent minimal Label for NVLink. - **Version**: Marketing NVLink version via `nvmlDeviceGetNvLinkVersion` (raw NVML enum values require remapping): - Raw 1 -> NVLink V1.0 -> Display 1 - Raw 2 -> V2.0 -> 2 - Raw 3 -> V2.2 -> 2 - Raw 4 -> V3.0 -> 3 - Raw 5 -> V3.1 -> 3 - Raw 6 -> V4.0 -> 4 - Raw 7 -> V5.0 -> 5 - Raw 8 -> V6.0 -> 6 (assumed Rubin) Display shows single-digit major version due to limited space. - **Link count**: Total physical links on the device (static hardware property). Maximum is 36 to future-proof for planned Nvidia Rubin. - **Throughput**: Aggregate Transmit plus Receive utilization, currently read via `nvidia-smi` CLI fallback for all NVLink-connected GPUs. This carries measurable overhead from forking a full binary and parsing its text output, but providing real throughput visibility to consumer GPU users outweighs the cost, and all other non-NVLink users are isolated. The 2-second interval is hardcoded and independent of global nvtop refresh rate to cap CLI calls regardless of display speed. Uses "r" (raw) counters which include payload plus protocol overhead, reflecting true bandwidth utilization. Parses "Link N: Raw Tx: NNNNN KiB" / "Raw Rx" per link. Delta = `(current - previous) / time_delta` per link, summed for aggregate; unsigned underflow guard checks `new >= old` before subtraction. No smoothing applied -- raw accuracy over display smoothness. **TODO:** On datacenter GPUs with `nvmlDeviceGetNvLinkUtilizationCounter`, replace with direct API call; keep CLI fallback for consumer GPUs. - **Layout compaction**: The Fan field shrinks from 11 to 8 characters ONLY when `any_device_has_nvlink_active` is true (at least one monitored GPU has active NVLink links). GPUs with NVLink hardware but no bridge (0-link case) do NOT get compaction -- `NVL3 0x` extends past the panel edge without needing reclaimed space. Panel width is determined by device name length (`device_name` column = `largest_device_name + 11`), so longer names produce more room for NVLink. - **Throughput display**: Uses `print_data_at_scale()` (renamed from `print_pcie_at_scale()`) with IEC binary prefixes. Array bounds check extended from `< 5` to `< 6` to support up to Terbibytes/s (TiB/s) for Blackwell NVLink 5.0 devices at ~1.636 TiB/s aggregate. The `memory_prefix[]` array already contains entries up to "Pi" -- only the loop guard needed updating. ### Extra GPU info bar (row 4, not shown by default) Appended at end after `exec_engines` -- error and correction counters since nvtop launch. Example with zeroed counters: NVL E:00000 C:00000 Example with non-zero counters (errors in red, corrections in yellow): NVL E:00420 C:00069 - **Label**: `NVL` to represent minimal Label for NVLink. - **Error counters**: Replay, recovery, CRC FLIT, and CRC DATA errors via `nvmlDeviceGetNvLinkErrorCounter`, summed across all links. Baseline subtraction ensures counters start at zero on nvtop launch. - **CRC corrections**: Per-lane CRC flit corrections via `nvmlDeviceGetFieldValues` (field IDs 32-247 for links 0-35), summed across all links. Uses modern signature `(device, valuesCount, fieldValues)` with field IDs populated in-place in the `nvmlFieldValue_t` buffer (48 bytes on NVML 11.515+: fieldId at offset 0, scopeId at 4, timestamp at 8, latencyUsec at 16, valueType at 24, nvmlReturn at 28, value.union at 32). Offsets are manually parsed since `nvml.h` is not exposed in the nvtop build. - Error counters read during refresh cycle (`gpuinfo_nvidia_refresh_dynamic_info()`), not during startup probe (`nvtop_probe_nvlink_list()` calls `nvtop_get_nvlink_info()` before display is drawn). This ensures the baseline is established at the moment of first display refresh, guaranteeing counters read zero on launch. `nvtop_get_nvlink_info()` does NOT read error counters in the display path. --- ## Files Changed ### include/nvtop/extract_gpuinfo_common.h (+31 lines, -1 line) - `NVTOP_NVLINK_MAX_LINKS` defined to 36 - Flat struct `nvlink_info`: `num_links`, `version`, `supported`, `has_throughput`, `aggregate_tx`, `aggregate_rx`, `total_errors`, `total_corrections` - `nvtop_get_nvlink_info()`: return cached NVLink data; vendor guard skips non-NVIDIA GPUs before `container_of()` - `nvtop_get_nvlink_error_counts()`: public getter for display-ready error/correction counts; bridges `interface.c` to per-device error state in `extract_gpuinfo_nvidia.c` - `nvtop_probe_nvlink_list()`: probe all devices for NVLink support before curses init; short-circuits if `any_device_has_nvlink` already true - `nvtop_set_nvlink_probe()`: set `any_device_has_nvlink` global flag only (leaves `any_device_has_nvlink_active` untouched) - `nvtop_reset_nvlink_cache()`: reset all per-device NVLink caching (probe flag, cached linkcount, cached version, cached info struct) on monitored GPU set change; vendor guard for non-NVIDIA ### include/nvtop/interface_internal_common.h (+4 lines, -1 line) - `WINDOW *nvlink_info` added to `struct device_window` (row 2 throughput) - `WINDOW *nvlink_errors` added to `struct device_window` (row 4 errors) - `device_nvlink_errors` added to `enum device_field` with size 19 ### src/extract_gpuinfo_nvidia.c (+451 insertions, -1 deletion) - Four NVML (NVIDIA Management Library) symbols via `dlsym()`: `nvmlDeviceGetNvLinkState`, `nvmlDeviceGetNvLinkVersion`, `nvmlDeviceGetNvLinkErrorCounter`, `nvmlDeviceGetFieldValues` (modern 3-param signature) - Per-device state: `device_index`, `cli_poll_active`, per-link CLI counters, baseline/display error fields, probe cache (`nvlink_probed`, `nvlink_cached_linkcount`, `nvlink_cached_version`), full struct cache (`cached_nvlink_info`, `cached_nvlink_info_populated`) - Link discovery: probes links 0-35 via `nvmlDeviceGetNvLinkState`, counts consecutive successes, stops on first hard error or `NVML_ERROR_NOT_SUPPORTED`; only active links (`isActive == 1`) are counted -- physical slots with no bridge are excluded - Caching: 3 layers -- (1) link count/version via `nvlink_probe_and_cache()`, (2) full struct via `nvlink_refresh_cached_info()`, (3) list-level probe short-circuit in `nvtop_probe_nvlink_list()`; all reset by `nvtop_reset_nvlink_cache()` on GPU set change - Throughput: `nvidia-smi nvlink --getthroughput r -i <dev>` every 2 seconds (hardcoded, independent of display refresh rate), delta-based rate computation with unsigned underflow guard - Error reading via `nvlink_read_errors()`: called from `gpuinfo_nvidia_refresh_dynamic_info()` (not `nvtop_get_nvlink_info()`) to ensure baseline is established at first display refresh; reads errors via `nvmlDeviceGetNvLinkErrorCounter` and corrections via `nvmlDeviceGetFieldValues`; unsigned underflow guard prevents counter wrap artifacts ### src/interface.c (+220 insertions, -20 deletions) - Conditional layout: `any_device_has_nvlink` controls window allocation; `any_device_has_nvlink_active` controls fan compaction (shrinks from 11 to 8 chars only when active links exist -- 0-link devices do not get compaction) - `device_length()` always uses base layout (clock + mem_clock + temp + fan + power + 5) regardless of NVLink state; NVLink window on line 2 extends past nominal panel edge, which ncurses handles gracefully - `nvtop_adjust_field_sizes_for_nvlink()` checks `any_device_has_nvlink_active` (not `any_device_has_nvlink`) for fan compaction - `interface_check_monitored_gpu_change()` resets ALL mutable NVLink state: both global flags plus `sizeof_device_field[device_fan_speed] = 11`, then calls per-device `nvtop_reset_nvlink_cache()` - Fan N/A fallback branch uses `any_device_has_nvlink_active` for correct 11-character format on 0-link devices - NVLink info window (row 2): displays `print_data_at_scale()`-formatted throughput (renamed from `print_pcie_at_scale()`; bounds check extended to `< 6` for TiB/s ceiling) - NVLink errors window (row 4): reads via `nvtop_get_nvlink_error_counts()` (does NOT call `nvtop_get_nvlink_info()` in display path) - Memory leak fixes: added missing `delwin()` for `shader_cores`, `l2_cache_size`, `exec_engines`, `plots[i].plot_window`, and `nvlink_errors`. Two of these are also submitted as standalone upstream PRs: `free_device_windows()` fix (PR Syllo#467) and `plots[i].plot_window` fix (PR Syllo#468). ### src/nvtop.c (+5 lines, -1 line) - `nvtop_probe_nvlink_list()` and `nvtop_set_nvlink_probe()` called before curses init (first layout pass) - Re-evaluated in main loop after `interface_check_monitored_gpu_change()` for GPU hotplug --- ## Design Decisions ### Flat struct over nested Single struct per device. Error and correction counters are cumulative totals (unsigned long long) summed across all links. Avoids per-link arrays and dynamic allocation in the hot refresh path. ### Two-tier error state: baseline plus display Five fields in `struct gpu_info_nvidia` track error state: `baseline_errors`, `baseline_corrections`, `nvlink_errors_baseline_read` (bool), `display_errors`, `display_corrections`. Baselines persist for the entire process lifetime. Display values computed each refresh as `cumulative - baseline`. ### total_errors / total_corrections retained for API compatibility Populated from `display_errors`/`display_corrections` in `nvlink_refresh_cached_info()`. Primary display path uses `nvtop_get_nvlink_error_counts()`, but both carry the same data. ### No new dependencies Uses only NVML symbols already in `nvidia-ml` driver and `nvidia-smi` binary already on the system. --- ## What Was Not Changed Process listing, memory/GPU charts, configuration options, keyboard shortcuts, menu behavior, and all non-NVLink display fields remain identical to upstream. --- ## Testing Dual RTX 3090 Founders Edition 24GB with 3-slot NVLink Bridge (RTXA6000NVLINK3S-KIT). Displays in UI as: `NVIDIA GeForce RTX 3090`. 4 physical links per GPU. `enum nvmlNvlinkVersion_t` returns `5` representing NVLink v3.1. When idle, NVLink shows ~1.2 MiB/s aggregate residual throughput from assumed protocol keep-alives/link maintenance. Errors/corrections correctly display `E:00000 C:00000` on every launch, incrementing only when new errors occur (no errors experienced to fully confirm).
danbedford
left a comment
There was a problem hiding this comment.
Thanks for cleaning up the comments, all good here. 😁
| if (nvmlDeviceGetFieldValues) { | ||
| for (unsigned int link = 0; link < linkCount; link++) { | ||
| int base_field_id = 32 + link * 6; | ||
| char raw[6 * NVM_LVALUE_SIZE]; |
There was a problem hiding this comment.
Please declare and use a struct nvmlFieldValue_t instead of doing these raw bytes copy shenanigans.
The fact that it may change in the future is ok (usually the changes are backward compatible if implemented correctly by NVIDIA and the nvml library has had a pretty good record on doing exactly this).
Field layout is platform/target dependent so it's not a good idea to hard code the offsets and use memcpy like this.
There was a problem hiding this comment.
Recreated the nvmlFieldValue_t typedef along with its dependencies (nvmlValue_t, nvmlValueType_t, nvmlReturn_t). All field access now uses proper struct member syntax (.fieldId, .value.ullVal, .nvmlReturn) instead of byte offsets.
See commit 47a6cf8.
| unsigned long long total_corrections; // Cumulative-since-launch CRC corrections across all links | ||
| }; | ||
|
|
||
| unsigned nvtop_get_nvlink_info(struct gpu_info *gpu_info, struct nvlink_info *nvlink_info); |
There was a problem hiding this comment.
You need to define these functions in another file (fallback ones returning unsupported/false/0) for when nvtop is not being compiled with NVIDIA support enabled. Currently they are only available in extract_gpuinfo_nvidia.c.
I would advise creating a new .c file and compile it only when NVIDIA_SUPPORT is OFF
There was a problem hiding this comment.
Done. Created src/nvlink_nvidia_disabled.c with 4 no-op stub functions:
nvtop_get_nvlink_info()— returns 0nvtop_get_nvlink_error_counts()— returns falsenvtop_probe_nvlink_list()— returns falsenvtop_reset_nvlink_cache()— no-op
The stub file is wired into src/CMakeLists.txt in the else(NVIDIA_SUPPORT) branch so it is only compiled when NVIDIA support is disabled. You originally mentioned 5 functions in your review, but the 5th (nvtop_set_nvlink_probe) was removed entirely per your suggestion in Comment 7, so there are now 4.
See commit 666ffed.
| memset(raw, 0, sizeof(raw)); | ||
| for (int i = 0; i < 6; i++) { | ||
| unsigned int fid = (unsigned int)(base_field_id + i); | ||
| memcpy(raw + i * NVM_LVALUE_SIZE + NVM_LVALUE_FIELD_ID_OFF, &fid, sizeof(fid)); |
There was a problem hiding this comment.
Please use the enum values of the field instead of this hard coded 32 + link * 6
Also from the same link there is a special value for fetching the error count for all lanes:
#define [NVML_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL](https://docs.nvidia.com/deploy/nvml-api/group__nvmlFieldValueEnums.html#group__nvmlFieldValueEnums_1g03448b3fc6f250afe4e70782a1e6ea2c) 38
NVLink flow control CRC Error Counter total for all Lanes.
There seems to also be this for ECC errors:
#define [NVML_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL](https://docs.nvidia.com/deploy/nvml-api/group__nvmlFieldValueEnums.html#group__nvmlFieldValueEnums_1ge51d113266f33da0ca06bd85cc7b6818) 160
NVLink data ECC Error Counter total for all Links.
There was a problem hiding this comment.
Done. The hardcoded 32 + link * 6 pattern has been removed. The batched nvmlDeviceGetFieldValues call now uses official enum constants with #ifndef guards so it works with both older and newer driver headers:
NVML_FI_DEV_NVLINK_THROUGHPUT_RAW_TX(140)NVML_FI_DEV_NVLINK_THROUGHPUT_RAW_RX(141)NVML_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL(38)NVML_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL(160)
I also followed your suggestion to add field 160 for ECC data errors. All 4 fields are queried in a single batch[4] call. Field 38 and field 160 both use scopeId=0 since they are already per-device aggregates across all lanes/links.
See commits 666ffed and a226ed7.
| // Returns number of links parsed (0 on failure) | ||
| static unsigned nvlink_cli_get_throughput(int device_index, unsigned int link_count, | ||
| unsigned long long *tx_out, unsigned long long *rx_out) { | ||
| char cmd[256]; |
There was a problem hiding this comment.
I don't like this CLI call and parsing.
Query nvmlDeviceGetFieldValues with the following IDs:
#define NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_RX 139
#define NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_TX 138
NVLink throughput counters field values
Link ID needs to be specified in the scopeId field in [nvmlFieldValue_t](https://docs.nvidia.com/deploy/nvml-api/structnvmlFieldValue__t.html#structnvmlFieldValue__t). A scopeId of UINT_MAX returns aggregate value summed up across all links for the specified counter type in fieldId.
Also you can probably do a single call to nvmlDeviceGetFieldValues for the RX, TX, and error counts). The struct has a nvmlReturn field to check if the requested values are valid. If not we just report nothing as usual.
There was a problem hiding this comment.
Done. All popen/pclose CLI code has been removed. Throughput and error counters are now read via a single batched nvmlDeviceGetFieldValues call with 4 fields (RAW TX, RAW RX, CRC corrections via field 38, and ECC data errors via field 160). Fields returning NVML_ERROR_NOT_SUPPORTED (e.g. throughput on consumer GPUs) are silently skipped.
See commit 666ffed.
| // Called from refresh_dynamic_info on every refresh cycle (refresh path). | ||
| // GPUs are non-hot-swappable, so all NVLink data is computed here and cached — | ||
| // nvtop_get_nvlink_info() in the draw path just returns the cached copy. | ||
| static void nvlink_refresh_cached_info(struct gpu_info_nvidia *gpu_info, unsigned int linkCount) { |
There was a problem hiding this comment.
Please remove this along with the CLI fallback (see previous comment in this file)
There was a problem hiding this comment.
Done. The entire nvlink_cli_get_throughput() function and all associated struct fields have been removed: device_index, cli_poll_active, per-link CLI counters (nvlink_cli_tx[], nvlink_cli_rx[]), aggregate CLI values (cli_agg_tx, cli_agg_rx), and last_nvlink_cli_time.
See commit 666ffed.
| struct nvlink_info nvl; | ||
| memset(&nvl, 0, sizeof(nvl)); |
There was a problem hiding this comment.
| struct nvlink_info nvl; | |
| memset(&nvl, 0, sizeof(nvl)); | |
| struct nvlink_info nvl = {0}; |
There was a problem hiding this comment.
Done. All struct nvlink_info declarations now use = {0} initializer:
nvtop_probe_nvlink_list()ininterface.c(line 82)draw_devices()ininterface.c(line 950)
The old pattern of separate declaration followed by memset has been removed. The same {0} initializer is also applied to nvmlFieldValue_t declarations in extract_gpuinfo_nvidia.c (the batch[4] array and any single-field queries).
See commits b97dac8 and 666ffed.
| } | ||
| } | ||
|
|
||
| bool nvtop_probe_nvlink_list(struct list_head *devices) { |
There was a problem hiding this comment.
This function calls nvtop_adjust_field_sizes_for_nvlink, which is used by initialize_all_windows (inside initialize_all_windows).
I think it would be more fitting to call this inside initialize_all_windows instead of the spread out fixes that are required calling it in nvtop.c
That way nvtop_set_nvlink_probe is not needed anymore.
There was a problem hiding this comment.
Done. nvtop_adjust_field_sizes_for_nvlink() is now called at the top of initialize_all_windows() in interface.c. nvtop_set_nvlink_probe() has been removed entirely — both call sites in nvtop.c now invoke nvtop_probe_nvlink_list() directly.
See commit 666ffed.
Replace CLI-based NVLink throughput with NVML API and refactor probe/layout initialization per maintainer feedback. Comment Syllo#1: include nvml.h, remove local typedefs, add #ifndef guards for enum constants, update dlsym function pointer to use proper nvmlFieldValue_t type, remove raw memcpy offset macros. Comment Syllo#2: wire nvlink_nvidia_disabled.c stub file into CMakeLists.txt else-branch for non-NVIDIA builds. Comment Syllo#3: remove per-lane CRC corrections loop from nvlink_read_errors() (Phase 2) - corrections now read in batched call in nvlink_refresh_cached_info(). Comment Syllo#4: replace nvidia-smi CLI fallback with single batched nvmlDeviceGetFieldValues call for RAW TX (140), RAW RX (141), and CRC corrections (38). Use scopeId=UINT_MAX for throughput aggregate, scopeId=0 for per-device corrections. Comment Syllo#5: remove nvlink_cli_get_throughput() function and CLI struct fields (device_index, cli_poll_active, nvlink_cli_tx/rx, last_nvlink_cli_time, cli_agg_tx/rx). Replace with nvlink_last_tx, nvlink_last_rx, nvlink_last_poll_time. Comment Syllo#6: use struct nvlink_info nvl = {0} initializer in nvtop_probe_nvlink_list(). Comment Syllo#7: move nvtop_adjust_field_sizes_for_nvlink() into initialize_all_windows(), remove nvtop_set_nvlink_probe() entirely, swap probe and interface_check_monitored_gpu_change() call order in nvtop.c main loop, add re-probe in monitored-set-change handler.
nvml.h cannot be included directly — nvtop uses dlsym function pointers for all NVML functions, and including nvml.h would conflict with 373 function prototypes and 12 struct/enum typedefs. Instead, manually declare nvmlFieldValue_t and its dependencies (nvmlValue_t, nvmlValueType_t, nvmlReturn_t, nvmlDevice_t) inline. This satisfies the maintainer requirement to use the proper struct type instead of raw memcpy offsets, without breaking the dlsym architecture. Also removes the unused find_path(NVML_INCLUDE_DIR) from CMakeLists.txt and fixes the forward declaration of nvlink_read_errors to use nvmlDevice_t instead of struct nvmlDevice*.
Per maintainer suggestion in PR Syllo#469 Comment 3, add field 160 (NVML_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL) to the existing batched nvmlDeviceGetFieldValues call alongside throughput and CRC corrections. - Batch expanded from 3 to 4 fields (scopeId=0, per-device aggregate) - Added total_ecc_errors to struct nvlink_info - Added baseline_ecc_errors/display_ecc_errors to struct gpu_info_nvidia - Display format: NVL E:00000 C:00000 X:00000 (window width 19->28) - Updated nvtop_get_nvlink_error_counts() to return ECC count - Updated stub file and cache reset accordingly
- FL: FLIT errors (red if >0) - EE: ECC data errors (red if >0) - CR: CRC corrections (yellow if >0) - Errors grouped together (FL/EE), corrections last (CR) - Window width expanded from 28 to 31 chars
NVTop NVLink Fork - Changelog
Upstream: Syllo/nvtop (commit 095d91c "Remove unused function in ixml")
Fork: danbedford/nvtop, branch
nvlinkGPU Tested:
NVIDIA GeForce RTX 3090Scope: 5 files changed, 706 insertions(+), 19 deletions(-)
Overview
Extends nvtop with per-GPU NVLink info in unused space of the existing interface. When no NVLink-supported GPU is detected, layout and behavior are identical to upstream -- no visual or functional difference. The goal is to bring useful data and throughput to all users of nvtop with NVLink-supported hardware, from consumer (2080, 3090 series) to datacenter (Ampere, Hopper, Blackwell series).
NVLink Supported Device Example:
NVLink Connected Device Example:
Main bar (row 2, shown by default)
Appended at end after
power_info-- NVLink version, link count, and aggregate throughput displayed. Two display states:NVLink supported device - No bridge or no active links (0-link case, no row 2 padding compaction applied):
With active links (row 2 padding compaction applied, throughput displayed). Example (theoretical fully saturated GB200 with NVLink 5.0):
When NVLink is supported but no bridge is connected or links are inactive, only the version and link count display -- no compaction is applied to reclaim space on row 2 since there is no throughput to display. The
NVL5 0xtext extends past the panel edge without affecting the layout. Only when active links exist does fan field compaction kick in (11 to 8 characters) to make room for the throughput value.NVLto represent minimal Label for NVLink.nvmlDeviceGetNvLinkVersion(raw NVML enum values require remapping):Display shows single-digit major version due to limited space.
nvidia-smiCLI fallback for all NVLink-connected GPUs. This carries measurable overhead from forking a full binary and parsing its text output, but providing real throughput visibility to consumer GPU users outweighs the cost, and all other non-NVLink users are isolated. The 2-second interval is hardcoded and independent of global nvtop refresh rate to cap CLI calls regardless of display speed. Uses "r" (raw) counters which include payload plus protocol overhead, reflecting true bandwidth utilization. Parses "Link N: Raw Tx: NNNNN KiB" / "Raw Rx" per link. Delta =(current - previous) / time_deltaper link, summed for aggregate; unsigned underflow guard checksnew >= oldbefore subtraction. No smoothing applied -- raw accuracy over display smoothness. TODO: On datacenter GPUs withnvmlDeviceGetNvLinkUtilizationCounter, replace with direct API call; keep CLI fallback for consumer GPUs.any_device_has_nvlink_activeis true (at least one monitored GPU has active NVLink links). GPUs with NVLink hardware but no bridge (0-link case) do NOT get compaction --NVL3 0xextends past the panel edge without needing reclaimed space. Panel width is determined by device name length (device_namecolumn =largest_device_name + 11), so longer names produce more room for NVLink.print_data_at_scale()(renamed fromprint_pcie_at_scale()) with IEC binary prefixes. Array bounds check extended from< 5to< 6to support up to Terbibytes/s (TiB/s) for Blackwell NVLink 5.0 devices at ~1.636 TiB/s aggregate. Thememory_prefix[]array already contains entries up to "Pi" -- only the loop guard needed updating.Extra GPU info bar (row 4, not shown by default)
Appended at end after
exec_engines-- error and correction counters since nvtop launch. Example with zeroed counters:Example with non-zero counters (errors in red, corrections in yellow):
NVLto represent minimal Label for NVLink.nvmlDeviceGetNvLinkErrorCounter, summed across all links. Baseline subtraction ensures counters start at zero on nvtop launch.nvmlDeviceGetFieldValues(field IDs 32-247 for links 0-35), summed across all links. Uses modern signature(device, valuesCount, fieldValues)with field IDs populated in-place in thenvmlFieldValue_tbuffer (48 bytes on NVML 11.515+: fieldId at offset 0, scopeId at 4, timestamp at 8, latencyUsec at 16, valueType at 24, nvmlReturn at 28, value.union at 32). Offsets are manually parsed sincenvml.his not exposed in the nvtop build.gpuinfo_nvidia_refresh_dynamic_info()), not during startup probe (nvtop_probe_nvlink_list()callsnvtop_get_nvlink_info()before display is drawn). This ensures the baseline is established at the moment of first display refresh, guaranteeing counters read zero on launch.nvtop_get_nvlink_info()does NOT read error counters in the display path.Files Changed
include/nvtop/extract_gpuinfo_common.h (+31 lines, -1 line)
NVTOP_NVLINK_MAX_LINKSdefined to 36nvlink_info:num_links,version,supported,has_throughput,aggregate_tx,aggregate_rx,total_errors,total_correctionsnvtop_get_nvlink_info(): return cached NVLink data; vendor guard skips non-NVIDIA GPUs beforecontainer_of()nvtop_get_nvlink_error_counts(): public getter for display-ready error/correction counts; bridgesinterface.cto per-device error state inextract_gpuinfo_nvidia.cnvtop_probe_nvlink_list(): probe all devices for NVLink support before curses init; short-circuits ifany_device_has_nvlinkalready truenvtop_set_nvlink_probe(): setany_device_has_nvlinkglobal flag only (leavesany_device_has_nvlink_activeuntouched)nvtop_reset_nvlink_cache(): reset all per-device NVLink caching (probe flag, cached linkcount, cached version, cached info struct) on monitored GPU set change; vendor guard for non-NVIDIAinclude/nvtop/interface_internal_common.h (+4 lines, -1 line)
WINDOW *nvlink_infoadded tostruct device_window(row 2 throughput)WINDOW *nvlink_errorsadded tostruct device_window(row 4 errors)device_nvlink_errorsadded toenum device_fieldwith size 19src/extract_gpuinfo_nvidia.c (+451 insertions, -1 deletion)
dlsym():nvmlDeviceGetNvLinkState,nvmlDeviceGetNvLinkVersion,nvmlDeviceGetNvLinkErrorCounter,nvmlDeviceGetFieldValues(modern 3-param signature)device_index,cli_poll_active, per-link CLI counters, baseline/display error fields, probe cache (nvlink_probed,nvlink_cached_linkcount,nvlink_cached_version), full struct cache (cached_nvlink_info,cached_nvlink_info_populated)nvmlDeviceGetNvLinkState, counts consecutive successes, stops on first hard error orNVML_ERROR_NOT_SUPPORTED; only active links (isActive == 1) are counted -- physical slots with no bridge are excludednvlink_probe_and_cache(), (2) full struct vianvlink_refresh_cached_info(), (3) list-level probe short-circuit innvtop_probe_nvlink_list(); all reset bynvtop_reset_nvlink_cache()on GPU set changenvidia-smi nvlink --getthroughput r -i <dev>every 2 seconds (hardcoded, independent of display refresh rate), delta-based rate computation with unsigned underflow guardnvlink_read_errors(): called fromgpuinfo_nvidia_refresh_dynamic_info()(notnvtop_get_nvlink_info()) to ensure baseline is established at first display refresh; reads errors vianvmlDeviceGetNvLinkErrorCounterand corrections vianvmlDeviceGetFieldValues; unsigned underflow guard prevents counter wrap artifactssrc/interface.c (+220 insertions, -20 deletions)
any_device_has_nvlinkcontrols window allocation;any_device_has_nvlink_activecontrols fan compaction (shrinks from 11 to 8 chars only when active links exist -- 0-link devices do not get compaction)device_length()always uses base layout (clock + mem_clock + temp + fan + power + 5) regardless of NVLink state; NVLink window on line 2 extends past nominal panel edge, which ncurses handles gracefullynvtop_adjust_field_sizes_for_nvlink()checksany_device_has_nvlink_active(notany_device_has_nvlink) for fan compactioninterface_check_monitored_gpu_change()resets ALL mutable NVLink state: both global flags plussizeof_device_field[device_fan_speed] = 11, then calls per-devicenvtop_reset_nvlink_cache()any_device_has_nvlink_activefor correct 11-character format on 0-link devicesprint_data_at_scale()-formatted throughput (renamed fromprint_pcie_at_scale(); bounds check extended to< 6for TiB/s ceiling)nvtop_get_nvlink_error_counts()(does NOT callnvtop_get_nvlink_info()in display path)delwin()forshader_cores,l2_cache_size,exec_engines,plots[i].plot_window, andnvlink_errors. Two of these are also submitted as standalone upstream PRs:free_device_windows()fix (PR fix: add missing delwin() calls in free_device_windows #467) andplots[i].plot_windowfix (PR fix: free plot_window in delete_all_windows #468).src/nvtop.c (+5 lines, -1 line)
nvtop_probe_nvlink_list()andnvtop_set_nvlink_probe()called before curses init (first layout pass)interface_check_monitored_gpu_change()for GPU hotplugDesign Decisions
Flat struct over nested
Single struct per device. Error and correction counters are cumulative totals (unsigned long long) summed across all links. Avoids per-link arrays and dynamic allocation in the hot refresh path.
Two-tier error state: baseline plus display
Five fields in
struct gpu_info_nvidiatrack error state:baseline_errors,baseline_corrections,nvlink_errors_baseline_read(bool),display_errors,display_corrections. Baselines persist for the entire process lifetime. Display values computed each refresh ascumulative - baseline.total_errors / total_corrections retained for API compatibility
Populated from
display_errors/display_correctionsinnvlink_refresh_cached_info(). Primary display path usesnvtop_get_nvlink_error_counts(), but both carry the same data.No new dependencies
Uses only NVML symbols already in
nvidia-mldriver andnvidia-smibinary already on the system.What Was Not Changed
Process listing, memory/GPU charts, configuration options, keyboard shortcuts, menu behavior, and all non-NVLink display fields remain identical to upstream.
Testing
Dual RTX 3090 Founders Edition 24GB with 3-slot NVLink Bridge (RTXA6000NVLINK3S-KIT). Displays in UI as:
NVIDIA GeForce RTX 3090. 4 physical links per GPU.enum nvmlNvlinkVersion_treturns5representing NVLink v3.1. When idle, NVLink shows ~1.2 MiB/s aggregate residual throughput from assumed protocol keep-alives/link maintenance. Errors/corrections correctly displayE:00000 C:00000on every launch, incrementing only when new errors occur (no errors experienced to fully confirm).