DADP-71 Add ADP point telemetry to Agent telemetry#50750
DADP-71 Add ADP point telemetry to Agent telemetry#50750gh-worker-dd-mergequeue-cf854d[bot] merged 8 commits into
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
Files inventory check summaryFile checks results against ancestor a0e32982: Results for datadog-agent_7.80.0~devel.git.796.ad45fbb.pipeline.113160850-1_amd64.deb:No change detected |
| - name: point.dropped | ||
| aggregate_tags: | ||
| - domain | ||
| - remote_agent |
There was a problem hiding this comment.
My understanding is these will change only the COAT version of the metrics to start including these two tags. Since COAT is internal, strict compatibility on the shape of these metrics is not required
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
12 successful checks with minimal change (< 2 KiB)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 6d04d3c Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +4.10 | [+1.12, +7.08] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +4.10 | [+1.12, +7.08] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | +0.56 | [+0.40, +0.73] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.56 | [+0.36, +0.76] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.40 | [-0.60, +1.40] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.28 | [+0.10, +0.46] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.23 | [+0.17, +0.29] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.12 | [+0.06, +0.17] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.01 | [-0.10, +0.08] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.22, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.02 | [-0.42, +0.39] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.02 | [-0.46, +0.42] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.17, +0.13] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.04 | [-0.25, +0.17] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.09 | [-0.59, +0.41] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.10 | [-0.33, +0.14] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.11 | [-0.31, +0.09] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.13 | [-0.23, -0.03] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.13 | [-0.18, -0.08] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.21 | [-0.26, -0.17] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | -0.28 | [-0.33, -0.23] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.41 | [-0.57, -0.25] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -0.45 | [-0.70, -0.20] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | -0.53 | [-0.63, -0.43] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 684 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 247.30MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 727 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 144.69MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 481.02MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 175.81MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 343.77 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 366.86MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
|
🎯 Code Coverage (details) 🔗 Commit SHA: ad45fbb | Docs | Datadog PR Page | Give us feedback! |
| // do not prevent the customer-facing telemetry check from reporting Core Agent default telemetry values. | ||
| var regularMfs []*dto.MetricFamily | ||
| if gathered, err := c.telemetry.Gather(false); err != nil { | ||
| log.Warnf("failed to gather regular telemetry metrics for default telemetry merge: %v", err) |
There was a problem hiding this comment.
If this fails it could get pretty noisy as it would emit every 15 seconds, wondering if I should remove it or make it debug level?
There was a problem hiding this comment.
I would vouch for debug level
There was a problem hiding this comment.
I think if it is only every 15s it seems reasonable to me. The impact is that this telemetry is missing which is significant.
| // Remote Agent Registry telemetry lives in the regular registry. Gather it on a best-effort basis so failures there | ||
| // do not prevent the customer-facing telemetry check from reporting Core Agent default telemetry values. | ||
| var regularMfs []*dto.MetricFamily | ||
| if gathered, err := c.telemetry.Gather(false); err != nil { |
There was a problem hiding this comment.
Note that this will now call out to remote agents via RAR every 15 seconds (default interval on telemetry check). Per @tobz we do not expect this to be a significant runtime cost.
There was a problem hiding this comment.
But why is it important to change if it will be sent out only every 15m?
There was a problem hiding this comment.
COAT is emitted every 15m, but this PR also includes the telemetry in the point.sent / point.dropped that is sent to the customer's org where we want more frequent reporting.
jeremy-hanna
left a comment
There was a problem hiding this comment.
👍 for agent-runtime owned files
iglendd
left a comment
There was a problem hiding this comment.
Looks good from comp/core/agenttelemetry perspective. One thing worth to reconsider and/or explain (or reach out offline), is increased frequency of polling metrics from remote registry, since it will be emitted every 15m it will be wasteful.
| - name: point.dropped | ||
| aggregate_tags: | ||
| - domain | ||
| - remote_agent |
| // Remote Agent Registry telemetry lives in the regular registry. Gather it on a best-effort basis so failures there | ||
| // do not prevent the customer-facing telemetry check from reporting Core Agent default telemetry values. | ||
| var regularMfs []*dto.MetricFamily | ||
| if gathered, err := c.telemetry.Gather(false); err != nil { |
There was a problem hiding this comment.
But why is it important to change if it will be sent out only every 15m?
| // do not prevent the customer-facing telemetry check from reporting Core Agent default telemetry values. | ||
| var regularMfs []*dto.MetricFamily | ||
| if gathered, err := c.telemetry.Gather(false); err != nil { | ||
| log.Warnf("failed to gather regular telemetry metrics for default telemetry merge: %v", err) |
There was a problem hiding this comment.
I would vouch for debug level
|
|
||
| // coalesceMetricFamilies merges compatible metric families with the same name. | ||
| // | ||
| // The regular and default telemetry registries are gathered separately. Coalescing lets profile aggregation see all |
|
Tested this out locally and observed:
|
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
Human Summary
We are working with a handful of design partners to start using the Agent Data Plane (ADP) to handle DogStatsD metrics in customer orgs. This will shift custom metric payloads from the core agent to ADP, which will then forward them along to the Datadog backend. This will affect the
datadog.agent.point.sentanddatadog.agent.point.droppedmetrics which are currently sent both to the customer org and to Datadog via Cross-Org Agent Telemetry (COAT). These can be sourced from ADP via the Remote Agent Registry (RAR) and ADP's TelemetryProvider.However, the sources of these two metrics will not be entirely shifted to ADP. Core Agent will still submit some points on its own from checks and other internal metrics such as
datadog.agent.running. This presents a new functional requirement where we need to be able to merge both the Core Agent and ADP versions of these metrics before forwarding them along.This PR does three things:
Within the customer-facing
telemetrycheck flow, "regular" metrics are now gathered. This includes RAR-sourced metrics from remote agents. Selected regular metrics (currently just these two) are merged into the existing "default" metric set before being forwarded.Within the internal COAT flow (
agenttelemetry), we perform a similar merge. A big difference here is thatagenttelemetryis already sourcing "regular" metrics, but previously there were no cases where a single metric came from both the core agent and RAR. Now that's supported and metrics are merged.The COAT versions of these two metrics are updated to include their
domainandremote_agenttags. This will allow us to differentiate Core Agent traffic from ADP traffic for Datadog internal telemetry.See DADP-71 for further rationale.
Rationale on Required Labels and QA Actions
This is my first Agent PR so it's entirely possible I misunderstood some of the intention behind these, but here's my first pass:
changelog/no-changelog. This change is intended to maintain compatibility with existing behavior when enabling the Agent Data Plane, so user-facing functionality is unaffected. Datadog-internal COAT metrics have some tagging changes which does not seem changelog-worthy.qa/done. See this PR on Saluki, behavior was verified using differential tests comparing the Agent with and without Agent Data Plane enabled: feat(correctness): add agent telemetry correctness test saluki#1637need-change/agenttelemetry-governanceand commented on existing governance card ASUP-31 pinging owner @carlosromanAgentic Summary
What does this PR do?
Adds Core Agent support for ADP point telemetry from Remote Agent Registry in both Agent telemetry paths:
domainandremote_agenttags forpoint.sent/point.droppedby updating the defaultlogs-and-metricsprofile.point.sentand Core Agentpoint.sent, from overwriting each other in the Agent telemetry payload map.point.sentandpoint.dropped, grouped bydomain.datadog.agent.point.sent{domain:...}datadog.agent.point.dropped{domain:...}Motivation
DADP-71: when Agent Data Plane forwards metrics, Core Agent forwarder telemetry alone undercounts
point.sentandpoint.dropped. ADP exposes equivalent point counts through RAR; Core Agent needs to include them in customer-facing Agent telemetry while preserving customer metric compatibility, and in COAT while retainingremote_agentattribution.Describe how you validated your changes
dda inv install-toolsdda inv test --targets=./pkg/collector/corechecks/telemetrydda inv test --targets=./comp/core/agenttelemetry/impldda inv test --targets=./comp/core/agenttelemetry/impl --test-run-name TestRunTestCoalescesDefaultAndNoDefaultMetricFamiliesBeforeAggregationfails without the COAT metric-family coalescing change and passes with it.Additional Notes
The customer path intentionally keeps RAR/regular-registry metrics allowlisted. It does not expose arbitrary remote-agent telemetry to customer orgs.