enhancement(metrics): support for metrics v3 protocol#1175
Conversation
Binary Size Analysis (Agent Data Plane)Baseline: 1ad7cde · Comparison: f27ed86 · diff ✅ Binary size difference within thresholdChanges by Module
Detailed Symbol Changes |
Regression Detector (Agent Data Plane)Run ID: Optimization Goals: ✅ No significant changes detectedFine details of change detection per experiment (35)Experiments configured
Bounds Checks: ✅ Passed (5)
ExplanationA change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression ( |
|
This is temporarily blocked on there being a version of the Datadog Agent for us to test against in correctness tests that has up-to-date v3 metrics support. Currently, we're hitting an issue related to rate intervals being delta encoded when they shouldn't be. That bug is fixed in DataDog/datadog-agent#45825 but won't be released until 7.77: roughly 2 weeks from now before an RC is available to use. We can potentially do a hacky image build or something for keep going in the meantime and then switch back to a proper Agent version once available, we'll see. |
79cdda1 to
59636cd
Compare
|
We've temporarily handled the issue of correctness tests by using a "dev" container image ( We can't merge this as-is: we need to wait for at least an RC build of Datadog Agent 7.77 so we can pin to a non-development image. In the meantime, I'm going to work on making sure we've integrated all of the same small fixes/changes that have been steadily being made upstream in the Datadog Agent repository for V3 support. |
30ee642 to
898021d
Compare
be9a81c to
a9f5109
Compare
a9f5109 to
31b5f82
Compare
| }); | ||
| let v3_flushed = if let Some(v3_metrics) = maybe_v3_metrics { | ||
| if v2_flushed || v3_metrics.len() >= v3_endpoint_config.max_metrics_per_payload() { | ||
| encode_and_flush_v3_metrics(endpoint, &v3_endpoint_config, v3_metrics, &telemetry, &mut payloads_tx, batch_id.as_ref(), v3_payload_info).await?; |
There was a problem hiding this comment.
This doesn't seem to observe any intake payload size limits, or am I missing anything?
There was a problem hiding this comment.
That's correct.
Right now, we're either flushing with the V2 encoder determines it needs to flush (so that we generate an equivalent payload in terms of the contained metrics between the two) or if we exceed the configured maximum metrics per payload limit.
In V2/V3 mode, I suppose it's entirely possible to have the V3 payload exceed the payload limits, although it would be incredibly unlikely. In V3 only mode, it's obviously a much more likely risk.
My thought process was that we would improve this -- make V3 encoding aware of the payload limits -- at the same time we added incremental compression to match the behavior of the Agent... since back when this was originally written many weeks ago, it seemed like we'd have enough time between then and "V3 only in production / for customers" to do the follow-up work.
I guess the question I have is: do you feel like we still have that sort of time before we want to be running V3 only?
| Ok(encoded) => { | ||
| match create_v3_request("/api/intake/metrics/v3/series", encoded, ep_config.compression_scheme()).await { | ||
| Ok(request) => { | ||
| flush_payload(request, events, payloads_tx, batch_id, 0, 1, payload_info).await?; |
There was a problem hiding this comment.
Is it intentional that batch_seq and batch_len are hard-coded as 0 and 1?
There was a problem hiding this comment.
It is intentional, but only in the context of us not currently splitting V3 payloads: there can literally only ever be a single V3 payload in each batch.
(Mostly related to the question you left about not obeying intake payload size limits.)
There was a problem hiding this comment.
That doesn't sound right: there can be multiple v2 payloads constructed from even a single event buffer, so even if they mach one to one, there should be multiple v3 ones. Or are we using word batch differently here?
There was a problem hiding this comment.
There's probably some missing context/mismatched terminology here, yeah.
Every time we receive an "event buffer", we process all of the metrics in the event buffer, incrementally encoding them via the request builder. As we're doing that, we might reach the point where a payload is "full" (aka adding another metric would cause the final payload to exceed the (un)compressed size limits) and we have to flush it before encoding the next metric in the event buffer. That scenario is what I'm referring to here.
Unlike the Core Agent, we don't immediately emit partial payloads for the remainder of an event buffer: if we finish processing an event buffer, and there's still remaining metrics in the request builder, we wait for a period of time for additional event buffers to come in and eventually time out and flush that partial payload.
ca5ebc6 to
16afa2e
Compare
This comment has been minimized.
This comment has been minimized.
…v3-payload-support
## Summary <!-- Please provide a brief summary about what this PR does. This should help the reviewers give feedback faster and with higher quality. --> This pr adds metric units encoding into V3 series payloads to match the Datadog Agent V3 serializer. This change keeps sketch behavior unchanged: V3 sketches do not encode units, matching the Agent sketch path. ## Change Type - [x] Bug fix - [ ] New feature - [ ] Non-functional (chore, refactoring, docs) - [ ] Performance ## How did you test this PR? <!-- Please how you tested these changes here --> unit tests / ci ## References <!-- Please list any issues closed by this PR. --> <!-- - Closes: <issue link> --> <!-- Any other issues or PRs relevant to this PR? Feel free to list them here. -->
## Summary <!-- Please provide a brief summary about what this PR does. This should help the reviewers give feedback faster and with higher quality. --> Adds V3 metrics payload splitting so ADP no longer assumes each V3 flush produces a single request. The encoder now enforces V3 series/sketch payload limits, drops unsendable zero-point or oversized metrics, and emits correct `X-Metrics-Request-Seq` / `X-Metrics-Request-Len` headers when a flush produces multiple requests. ## Change Type - [ ] Bug fix - [x] New feature - [ ] Non-functional (chore, refactoring, docs) - [ ] Performance ## How did you test this PR? <!-- Please how you tested these changes here --> unit tests / ci ## References <!-- Please list any issues closed by this PR. --> <!-- - Closes: <issue link> --> <!-- Any other issues or PRs relevant to this PR? Feel free to list them here. -->
## Summary <!-- Please provide a brief summary about what this PR does. This should help the reviewers give feedback faster and with higher quality. --> Match the Datadog Agent’s V3 resource mapping behavior for series and sketches. For V3 series, this promotes resource-style tags into V3 resources: - `device:<value>` becomes a `device` resource and is removed from tags. - Valid `dd.internal.resource:<type>:<name>` tags become resources and are removed from tags. - Malformed `dd.internal.resource:*` tags are dropped. - Resource order matches the Agent: `host`, `device`, then promoted resources. For V3 sketches, this keeps `device:*` and `dd.internal.resource:*` as normal tags and only emits `host` as a resource, matching the Agent sketch path. ## Change Type - [x] Bug fix - [ ] New feature - [ ] Non-functional (chore, refactoring, docs) - [ ] Performance ## How did you test this PR? <!-- Please how you tested these changes here --> ## References <!-- Please list any issues closed by this PR. --> <!-- - Closes: <issue link> --> <!-- Any other issues or PRs relevant to this PR? Feel free to list them here. -->

Summary
Change Type
How did you test this PR?
References