Skip to content

[draft] Add OtlpSink wrapper for ComponentStats OTLP/gRPC export (Phase 2)#1184

Draft
yutongzhang-microsoft wants to merge 3 commits into
sonic-net:masterfrom
yutongzhang-microsoft:feature/otlp-sink-wrapper
Draft

[draft] Add OtlpSink wrapper for ComponentStats OTLP/gRPC export (Phase 2)#1184
yutongzhang-microsoft wants to merge 3 commits into
sonic-net:masterfrom
yutongzhang-microsoft:feature/otlp-sink-wrapper

Conversation

@yutongzhang-microsoft
Copy link
Copy Markdown

@yutongzhang-microsoft yutongzhang-microsoft commented Apr 28, 2026

What

Introduce swss::OtlpSink — a thin C++ wrapper that converts a snapshot of ComponentStats counters into an OTLP/gRPC metric batch destined for a local OpenTelemetry Collector or Geneva mdm container. This is the OTLP half of the dual-sink design described in the Component Statistics HLD (sonic-net/SONiC#2312).

This PR is stacked on top of #1183 (build plumbing) and is intentionally opened as a draft — see Status below.

Why

Phase 1 (#1180 + sonic-net/sonic-swss#4516) already lands the in-memory counters and the COUNTERS_DB sink. Phase 2 needs to fan the same snapshot out over OTLP/gRPC so SONiC components can be observed by the standard OpenTelemetry pipeline (local Collector or Geneva mdm → any OTLP backend).

Splitting the wrapper out of the writer-thread integration keeps each PR small and reviewable:

What this PR adds

common/component_stats_otlp.h

A small public header that exposes Config, DataPoint, exportBatch() and shutdown(), and intentionally hides every OpenTelemetry C++ SDK type behind the PIMPL idiom so callers (notably ComponentStats) do not transitively include any OTel headers.

common/component_stats_otlp.cpp

  • Constructs an OtlpGrpcMetricExporter lazily and shares one gRPC channel for the lifetime of the sink.
  • Groups data points by metric name so that entity is exported as a label, not as part of the metric name (HLD §7.7). Final metric name is sonic.<componentName>.<metric>.
  • Maps isMonotonic=trueDELTA Sum; isMonotonic=falseGauge with last-value semantics.
  • Cumulative-to-delta conversion is internal: callers pass the cumulative in-memory counter (so the Phase 1 COUNTERS_DB sink and this sink share the exact same input); the sink maintains a per-(entity,metric) cache and emits delta = current − last with a counter-reset guard (no uint64_t underflow).
  • Never throws: all SDK exceptions and Export() error results are caught, logged with SWSS_LOG_WARN, and converted to a false return so a dead Collector cannot stall the ComponentStats writer thread or affect the DB sink (HLD requirement R9).
  • Move-only; shutdown() is idempotent.
  • Per-export deadline is 500 ms by default — short enough not to overrun the 1 s writer tick.

Why DELTA, not CUMULATIVE

Geneva mdm currently rejects OTLP metrics whose Sum points carry AGGREGATION_TEMPORALITY_CUMULATIVE and silently drops them:

Raw metrics data were dropped because OTLP metrics with cumulative aggregation temporality is not supported. Data Dropped Count: 1

To make the very first production deployment functional rather than a silent no-op, this sink emits DELTA out of the box. Each MetricData's start_ts is the previous export's end_ts (or sink-creation time on the first export of a series) and end_ts is the current wall-clock — matching the OTLP delta contract.

tests/component_stats_otlp_ut.cpp

Seven smoke tests covering the contract:

  1. construct/destruct does not crash;
  2. empty batch is a no-op;
  3. exporting to an unreachable Collector does not throw;
  4. shutdown() is idempotent;
  5. a moved-from instance is harmless;
  6. three consecutive cumulative snapshots convert to deltas without crashing;
  7. a counter reset (current < last) does not underflow.

A real in-process gRPC mock-server test is deferred to a follow-up.

Build wiring

Both common/Makefile.am and tests/Makefile.am add an if OTLP … endif block mirroring the existing YANGMODS pattern, so the new sources and OPENTELEMETRY_LIBS are only compiled and linked when --enable-otlp is passed.

Status — why this PR is a Draft

  • opentelemetry-cpp is not yet packaged in the SONiC build environment. CI will run with --disable-otlp and therefore exercise none of the new code; this PR is non-mergeable until the SDK is available in sonic-buildimage.
  • This PR is stacked on top of Add --enable-otlp configure flag and 'otlp' debian build profile #1183. Please review Add --enable-otlp configure flag and 'otlp' debian build profile #1183 first.
  • The wire-up to ComponentStats lives in PR C and will land after this PR.
  • An alternative Phase 2 deployment path — telegraf reading COUNTERS_DB and exporting OTLP to the Geneva mdm container — keeps swss-common code unchanged and may be pursued in parallel; this PR remains useful for the eventual application-direct-emit path.

Compatibility

  • Default builds are unaffected. --enable-otlp is opt-in and ships disabled.
  • No public header outside this PR is changed.
  • No ABI or API of existing classes is touched.

Checklist

yutongzhang-microsoft and others added 2 commits April 28, 2026 15:14
Wire the build system to optionally pull in opentelemetry-cpp so a follow-up
PR can add an OTLP sink to ComponentStats. The default build is unchanged:

  * configure.ac gains a new --enable-otlp option (default: disabled).
    When enabled, the build probes for opentelemetry-cpp via pkg-config and
    falls back to a header check + a hard-coded -l<lib> list for SDKs that
    are not packaged with .pc files. HAVE_OTLP is defined and OTLP is
    exposed as an automake conditional, plus OPENTELEMETRY_CFLAGS /
    OPENTELEMETRY_LIBS substitutions for use by Makefile.am in later PRs.

  * debian/rules gains a new 'otlp' build profile. When the profile is
    active, --enable-otlp is passed to configure; otherwise --disable-otlp
    is passed, which is the current behaviour.

This is the build-system half of Phase 2 in the Component Statistics HLD
(sonic-net/SONiC#2312). It does not add any source files, does not change
the public API, and does not affect any default build path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Introduce a thin C++ wrapper, swss::OtlpSink, that converts a snapshot of
ComponentStats counters into an OTLP/gRPC batch destined for a local
OpenTelemetry Collector. The class is the OTLP half of the dual-sink
design described in the Component Statistics HLD (sonic-net/SONiC#2312).

What the wrapper provides:

  * A small public header (common/component_stats_otlp.h) that exposes
    a Config struct, a DataPoint struct, exportBatch(), and shutdown(),
    and intentionally hides every OpenTelemetry C++ SDK type behind PIMPL
    so callers (notably ComponentStats) do not transitively include any
    OTel headers.

  * An implementation (common/component_stats_otlp.cpp) that:
      - constructs an OtlpGrpcMetricExporter lazily and shares one gRPC
        channel for the lifetime of the sink;
      - groups data points by metric name so that 'entity' is exported as
        a label rather than as part of the metric name (HLD section 7.7);
      - maps isMonotonic=true to a CUMULATIVE Sum, isMonotonic=false to a
        Gauge with last-value semantics;
      - never throws: all SDK exceptions and Export() error results are
        caught, logged, and converted to a 'false' return so a dead
        Collector cannot stall the ComponentStats writer thread or affect
        the DB sink (HLD requirement R9);
      - is move-only and idempotent on shutdown.

  * Five smoke tests (tests/component_stats_otlp_ut.cpp) covering the
    contract: construct/destruct, empty batch is a no-op, exporting to an
    unreachable Collector does not throw, shutdown is idempotent, and a
    moved-from instance is harmless. A real in-process gRPC mock server
    test is deferred to a follow-up.

Build wiring:

  * common/Makefile.am: when --enable-otlp is active, append
    component_stats_otlp.cpp to libswsscommon and link OPENTELEMETRY_LIBS.
  * tests/Makefile.am:  when --enable-otlp is active, append the unit
    test and link OPENTELEMETRY_LIBS.

Default builds are unaffected because --enable-otlp is opt-in and ships
disabled by default (added in sonic-net#1183).

Phase 2 follow-ups:

  * PR C: connect OtlpSink to ComponentStats writer-thread fan-out.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Geneva mdm rejects OTLP metrics whose Sum points carry CUMULATIVE
aggregation temporality (mdm log: 'Raw metrics data were dropped because
OTLP metrics with cumulative aggregation temporality is not supported.
Data Dropped Count: 1'). Switch the sink to DELTA so the very first
production deployment is not silently a no-op.

Behaviour change inside Impl::exportBatch():

  * Per-series cache (lastValue, lastEndTs) keyed by '<entity>\\x1f<metric>'.
  * For each Sum point: delta = current - lastValue, with a counter-reset
    guard (current < lastValue is treated as 'delta = current', no
    uint64_t underflow). cache then advances unconditionally so a
    transient Export() failure costs at most one batch.
  * Per-metric MetricData.start_ts is the previous end_ts (or
    creationTs on the first export of that series), end_ts is now -
    matching the OTLP delta contract.
  * Gauge points are unchanged (LastValuePointData has no temporality);
    their MetricData.aggregation_temporality is set to Unspecified.

API stays cumulative: callers (ComponentStats in PR C) still pass the
cumulative in-memory counter, so the Phase 1 COUNTERS_DB sink and this
sink share the exact same input. Cumulative-to-delta conversion is the
sink's responsibility, not the caller's.

Tests: two new GTest cases cover (1) three consecutive snapshots of the
same series across exportBatch() calls, and (2) a counter-reset
(current < last) without underflow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants