Skip to content

feat(metrics): enable RETIRED_SBE and RETIRED_DBE counters by default#666

Open
75asu wants to merge 1 commit into
NVIDIA:mainfrom
75asu:feat/etc-enable-retired-sbe-dbe-metrics
Open

feat(metrics): enable RETIRED_SBE and RETIRED_DBE counters by default#666
75asu wants to merge 1 commit into
NVIDIA:mainfrom
75asu:feat/etc-enable-retired-sbe-dbe-metrics

Conversation

@75asu

@75asu 75asu commented May 30, 2026

Copy link
Copy Markdown

Resolves #646

  • Uncomments DCGM_FI_DEV_RETIRED_SBE and DCGM_FI_DEV_RETIRED_DBE in etc/dcp-metrics-included.csv, etc/default-counters.csv, and tests/integration/testdata/default-counters.csv.
  • These are the standard ECC retired-page counters that fleet operators use to detect GPUs starting to fail. Available on all datacenter GPUs where ECC is enabled.
  • Git history (c6a7730b, 2021-02-25) shows these lines were inherited as commented-out defaults rather than deliberately disabled. No Go code references them, so the generic CSV-loaded counter path is sufficient.
  • DCGM_FI_DEV_RETIRED_PENDING (adjacent line) left commented -- not requested in the issue, avoiding scope creep.
  • Note: the issue also mentions DCGM_FI_DEV_XID_ERRORS as missing, but that metric is already enabled by default in default-counters.csv and has dedicated handling in internal/pkg/collector/xid_collector.go. If a user isn't seeing it, that's likely environmental (custom counters file, version mismatch, or GPU/driver not reporting). Recommend a separate issue with version + hardware details to investigate.

Signed-off-by: Asutosh Panda <asutosh.pda@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing metrics DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS

1 participant