Skip to content

test(smp): Agent vs Agent+ADP 1Hz dogstatsd benchmark#1592

Draft
jszwedko wants to merge 12 commits into
luke/configurable-aggregationfrom
jszwedko/agent-vs-adp-1hz-benchmark
Draft

test(smp): Agent vs Agent+ADP 1Hz dogstatsd benchmark#1592
jszwedko wants to merge 12 commits into
luke/configurable-aggregationfrom
jszwedko/agent-vs-adp-1hz-benchmark

Conversation

@jszwedko
Copy link
Copy Markdown
Collaborator

@jszwedko jszwedko commented May 5, 2026

Summary

Adds an SMP regression benchmark comparing the Datadog Agent (baseline) against a converged Agent+Agent-Data-Plane image (comparison) for raw dogstatsd ingest at a 1-second aggregation bucket interval. Stacked on #1459 so the single `aggregator_bucket_size_seconds: 1` knob in `datadog.yaml` reaches both Agent core and ADP. Reuses `test/smp/regression/adp/` in place — OTLP and quality_gates cases removed; the 15 `dsd_uds_*` cases rewritten to drive the Agent entrypoint and a converged Dockerfile.

Based on PR that did comparison of tag filter experiments: #1327

Draft because the meaningful review signal is the SMP report comment, which CI posts after the benchmark runs.

Test plan

  • CI green: `build-adp-baseline-image`, `build-adp-comparison-image`, `build-agent-adp-baseline-image`, `build-agent-adp-comparison-image`, `run-benchmarks-adp`, `binary-size-analysis`.
  • SMP report posted as a PR comment under header "Regression Detector (Agent vs Agent+ADP, 1Hz dogstatsd)".
  • Spot-check the report: comparison side shows ADP-side metrics (`sub_agent: adp` series), 1Hz bucketing visible (flush rate / output volume).

References

Stacked on #1459 (`luke/configurable-aggregation`). Modeled after #1327 but slimmer — no tag filtering, reuses existing dsd_uds cases.

🤖 Generated with Claude Code

jszwedko and others added 11 commits May 5, 2026 15:53
Stacks on PR #1459 (luke/configurable-aggregation). Reuses test/smp/regression/adp/
in-place: deletes OTLP + quality_gates cases, rewrites dsd_uds_* to drive a
Datadog Agent baseline vs converged Agent+ADP comparison with
aggregator_bucket_size_seconds: 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nine tasks covering case file rewrites, Dockerfile build args, GitLab
CI build jobs, run-benchmarks-adp repointing, and the draft PR open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The adp/ SMP target directory will be repurposed for an Agent-vs-Agent+ADP
1Hz dogstatsd comparison. Only the dsd_uds_* cases are kept; everything
else moves out of scope on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The converged Agent+ADP image boots ADP via the Agent's s6 supervisor
(not via target.command), so /etc/agent-data-plane/empty.yaml and the
IPC cert are no longer referenced. Remove them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f1e04a-full

Per user request, base off the converged Agent dev image carrying Luke's
configurable-aggregation patch on top of 15f1e04a, instead of the
upstream 15f1e04a-py3-jmx tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single source of truth for aggregator_bucket_size_seconds. Agent core
reads the file directly; ADP reads it via config-stream gRPC when
DD_DATA_PLANE_USE_NEW_CONFIG_STREAM_ENDPOINT=true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Targets the Agent's /bin/entrypoint.sh instead of the ADP binary directly,
so the same case files drive both the Datadog Agent baseline image and the
converged Agent+ADP comparison image. Memory allotment bumped from 2GiB to
3200MiB to fit Agent + ADP + JVM in one container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add 127.0.0.1:9092 blackhole for the process-agent's process_dd_url.
- Scrape Agent core telemetry on 127.0.0.1:5000 alongside ADP on :5102,
  tagging both with sub_agent so the SMP report can attribute metrics.
- Switch dogstatsd UDS path to /tmp/dsd.socket to match datadog.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the converged Agent+ADP image bake in the env vars ADP needs to boot
in non-standalone, config-stream-driven mode (REMOTE_AGENT_ENABLED=true,
USE_NEW_CONFIG_STREAM_ENDPOINT=true, DOGSTATSD_ENABLED=true). The build
job in .gitlab/benchmark.yml supplies the values for the SMP comparison
image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build-agent-adp-baseline-image retags the upstream Datadog Agent dev
image (luke-configurable-aggregation-15f1e04a-full, from datadog-agent#49676)
as the SMP baseline. build-agent-adp-comparison-image builds
Dockerfile.datadog-agent on top of the comparison ADP image with all
DD_DATA_PLANE_* knobs set to drive non-standalone, config-stream-driven
mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now consumes BASELINE_AGENT_IMG (vanilla Datadog Agent dev image) and
COMPARISON_AGENT_IMG (converged Agent+ADP image) from the new build
jobs. Same SMP target dir; the dsd_uds cases are now driven through the
Agent entrypoint with aggregator_bucket_size_seconds: 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dd-octo-sts dd-octo-sts Bot added area/ci CI/CD, automated testing, etc. area/docs Reference documentation. area/test All things testing: unit/integration, correctness, SMP regression, etc. labels May 5, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 5, 2026

Binary Size Analysis (Agent Data Plane)

Target: dc8f4b9 (baseline) vs a25ec14 (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 37.11 MiB
Comparison Size: 37.00 MiB
Size Change: -111.49 KiB (-0.29%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module File Size Symbols
figment -118.82 KiB 125
otlp_protos::otlp_include::opentelemetry -41.79 KiB 103
hyper +25.69 KiB 76
prost +24.60 KiB 66
hyper_util -14.05 KiB 13
hashbrown +11.34 KiB 72
h2 -8.86 KiB 92
[sections] -8.52 KiB 6
tonic -7.37 KiB 33
core -7.09 KiB 878
serde_core +6.51 KiB 85
serde +6.39 KiB 18
tower +5.15 KiB 11
async_compression +4.62 KiB 19
tokio_util +3.94 KiB 14
alloc +3.76 KiB 50
saluki_components::sources::otlp +3.75 KiB 17
tokio +3.67 KiB 124
saluki_core::data_model::event -3.40 KiB 8
futures_channel -3.35 KiB 7

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +18.5Ki  [NEW] +18.3Ki    saluki_components::transforms::apm_stats::span_concentrator::SpanConcentrator::flush::h4cec187aab531472
  [NEW] +16.5Ki  [NEW] +16.4Ki    saluki_components::transforms::apm_stats::span_concentrator::SpanConcentrator::add_span::hf95b8e429cc5bbe0
  [NEW] +16.1Ki  [NEW] +16.0Ki    saluki_components::transforms::apm_stats::span_concentrator::SpanConcentrator::new_stat_span_from_span::h831751497d326aef
  +283% +15.9Ki  +288% +15.9Ki    h2::proto::connection::DynConnection<B>::recv_frame::h9d7adeb5727e1522
  [NEW] +12.3Ki  [NEW] +12.1Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::hf9a832fc7767a946
  [NEW] +9.39Ki  [NEW] +9.23Ki    _<hyper::proto::h2::server::Server<T,S,B,E> as core::future::future::Future>::poll::h7fad89436d42473e
  +750% +6.74Ki  +828% +6.74Ki    prost::encoding::message::merge_repeated::hc52fda914c63fb75
  [NEW] +6.50Ki  [NEW] +6.28Ki    saluki_components::common::datadog::apm::_::_<impl serde_core::de::Deserialize for saluki_components::common::datadog::apm::ApmConfiguration>::deserialize::h2b55df90d15c8dc3
  +739% +6.39Ki  +819% +6.39Ki    prost::encoding::message::merge_repeated::h125609fe5afef278
  [DEL] -6.56Ki  [DEL] -6.41Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h14607bccbe25f0f5
 -24.0% -8.01Ki -24.1% -7.98Ki    _<saluki_components::transforms::apm_stats::ApmStats as saluki_core::components::transforms::Transform>::run::_{{closure}}::h63fe22416badd464
  [DEL] -8.13Ki  [DEL] -8.00Ki    figment::value::de::_<impl figment::value::value::Value>::deserialize_from::hc178e2144edf2db7
 -67.1% -9.24Ki -67.7% -9.24Ki    saluki_components::transforms::trace_obfuscation::sql::obfuscate_sql_string::hbc6c7c370aac7ff9
  [DEL] -9.49Ki  [DEL] -9.34Ki    _<figment::value::magic::Tagged<T> as figment::value::magic::Magic>::deserialize_from::h44f59c3078bb5bee
  [DEL] -9.74Ki  [DEL] -9.59Ki    _<figment::value::magic::RelativePathBuf as figment::value::magic::Magic>::deserialize_from::h795d206f112d7dfd
 -81.7% -10.1Ki -82.6% -10.1Ki    _<core::pin::Pin<P> as core::future::future::Future>::poll::h901e76ef802f2f4c
  [DEL] -11.6Ki  [DEL] -11.5Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::h262edabf8ad9b351
  [DEL] -15.4Ki  [DEL] -15.3Ki    _<figment::value::magic::Tagged<T> as figment::value::magic::Magic>::deserialize_from::hf4491bd4db3bec82
  [DEL] -15.9Ki  [DEL] -15.7Ki    _<figment::value::magic::RelativePathBuf as figment::value::magic::Magic>::deserialize_from::hf41c2cc956726b9d
  [DEL] -32.1Ki  [DEL] -32.0Ki    saluki_components::transforms::apm_stats::ApmStats::process_trace::h7d2f794f20a992a4
  -1.4% -83.5Ki  -1.4% -69.0Ki    [4675 Others]
  -0.3%  -111Ki  -0.3% -96.7Ki    TOTAL

The adp/ SMP target dir is generated from experiments.yaml via
generate_experiments.py, and CI's check-smp-experiments verifies
cases/ stays in sync. The hand-rewritten case files for the
Agent-vs-Agent+ADP 1Hz benchmark were drifting from that source of
truth — running the check would fail.

This commit makes experiments.yaml the source of truth again:

- Rewrites the global block to target the Datadog Agent entrypoint
  (name: datadog-agent, command: /bin/entrypoint.sh, files:
  datadog.yaml sourced from shared/datadog.yaml).
- Bumps memory_allotment to 3200MiB and trims env to DD_API_KEY +
  DD_HOSTNAME — the Agent reads the rest from datadog.yaml.
- Drops ADP-standalone-only env from the dsd_base template and
  switches the unix_datagram path to /tmp/dsd.socket.
- Drops OTLP and quality_gates experiments + templates (out of scope
  on this branch).
- Adds shared/datadog.yaml (the 1Hz config the Agent reads;
  aggregator_bucket_size_seconds: 1 reaches ADP via config-stream).
- Adds a `${EXPERIMENT_NAME}` placeholder substitution to
  generate_experiments.py so DD_INTERNAL_PROFILING_EXTRA_TAGS gets
  the per-case expanded name without duplicating each experiment
  three times.

After regen, `make check-smp-experiments` passes against the 15
dsd_uds_* cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 5, 2026

Regression Detector (Agent vs Agent+ADP, 1Hz dogstatsd)

Regression Detector Results

Run ID: 6e12d29a-e079-4e47-b394-db9d8f0fb254

Baseline: 15f1e04a
Comparison: a25ec14
Diff

Optimization Goals: ❌ Regression(s) detected

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_512kb_3k_contexts_memory memory utilization +5.96 [+5.74, +6.18] 1 bounds checks dashboard
dsd_uds_10mb_3k_contexts_memory memory utilization -15.26 [-15.43, -15.09] 1 bounds checks dashboard
dsd_uds_500mb_3k_contexts_throughput ingress throughput -21.53 [-21.68, -21.39] 1 bounds checks dashboard
dsd_uds_100mb_3k_contexts_cpu % cpu utilization -46.76 [-70.35, -23.18] 1 bounds checks dashboard
dsd_uds_500mb_3k_contexts_cpu % cpu utilization -56.79 [-65.52, -48.07] 1 bounds checks dashboard
dsd_uds_100mb_3k_contexts_memory memory utilization -56.99 [-57.14, -56.84] 1 bounds checks dashboard
dsd_uds_500mb_3k_contexts_memory memory utilization -76.10 [-76.21, -75.99] 1 bounds checks dashboard

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_512kb_3k_contexts_memory memory utilization +5.96 [+5.74, +6.18] 1 bounds checks dashboard
dsd_uds_1mb_3k_contexts_memory memory utilization +3.90 [+3.68, +4.12] 1 bounds checks dashboard
dsd_uds_512kb_3k_contexts_throughput ingress throughput +0.00 [-0.06, +0.06] 1 bounds checks dashboard
dsd_uds_10mb_3k_contexts_throughput ingress throughput +0.00 [-0.06, +0.06] 1 bounds checks dashboard
dsd_uds_1mb_3k_contexts_throughput ingress throughput -0.00 [-0.06, +0.06] 1 bounds checks dashboard
dsd_uds_100mb_3k_contexts_throughput ingress throughput -0.05 [-0.21, +0.10] 1 bounds checks dashboard
dsd_uds_10mb_3k_contexts_memory memory utilization -15.26 [-15.43, -15.09] 1 bounds checks dashboard
dsd_uds_10mb_3k_contexts_cpu % cpu utilization -18.44 [-79.85, +42.98] 1 bounds checks dashboard
dsd_uds_500mb_3k_contexts_throughput ingress throughput -21.53 [-21.68, -21.39] 1 bounds checks dashboard
dsd_uds_512kb_3k_contexts_cpu % cpu utilization -26.11 [-92.52, +40.30] 1 bounds checks dashboard
dsd_uds_1mb_3k_contexts_cpu % cpu utilization -42.31 [-106.56, +21.93] 1 bounds checks dashboard
dsd_uds_100mb_3k_contexts_cpu % cpu utilization -46.76 [-70.35, -23.18] 1 bounds checks dashboard
dsd_uds_500mb_3k_contexts_cpu % cpu utilization -56.79 [-65.52, -48.07] 1 bounds checks dashboard
dsd_uds_100mb_3k_contexts_memory memory utilization -56.99 [-57.14, -56.84] 1 bounds checks dashboard
dsd_uds_500mb_3k_contexts_memory memory utilization -76.10 [-76.21, -75.99] 1 bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci CI/CD, automated testing, etc. area/docs Reference documentation. area/test All things testing: unit/integration, correctness, SMP regression, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant