Replace dogstatsd metrics with opentelemetry metrics#3874
Open
moskyb wants to merge 22 commits into
Open
Conversation
Remove deprecated CLI flags
Newline after meta data get
…vars Remove deprecated plugin env vars
…verse-for-tear-down-hooks fix: Reverse ordering for post- hooks
…`cancel-signal-timeout` and `cancel-cleanup-timeout` Previously, `cancel-grace-period` (default 10s) was the *total* time budget covering both process shutdown and agent-side cleanup (log uploads, artifact uploads, disconnects). `signal-grace-period-seconds` (default -1) controlled how much of that budget went to the process, using negative-relative arithmetic: -1 meant "`cancel-grace-period` minus 1", so the process got 9s and the agent got 1s. This made configuration confusing — the flag that *sounded* like the process's grace period (`cancel-grace-period`) was actually the total, and the actual process grace period required subtracting a negative number from it. Validation was also complex because the two values had to be checked against each other, and invalid combinations (e.g. `signal-grace-period-seconds` >= `cancel-grace-period`) returned errors. The new model uses two independent, positive durations: `--cancel-signal-timeout` (default 9s): how long the subprocess gets to handle the cancel signal before receiving SIGKILL. This is the value users actually think about when configuring cancellation. `--cancel-cleanup-timeout` (default 1s): how long the agent gets after the process exits or is killed to upload logs and artifacts. The total grace period is simply their sum. There is no validation logic because the values cannot conflict. Both flags accept Go duration syntax (e.g. "30s", "1m30s") via `cli.DurationFlag` instead of integer seconds, matching the convention used by other timeout flags in the agent (`wait-for-ec2-tags-timeout`, `kubernetes-container-start-timeout`,ó etc.). The defaults produce the same effective behaviour as before: 9s for the process, 1s for agent cleanup, 10s total.
Replace `cancel-grace-period` and `signal-grace-period-seconds` with `cancel-signal-timeout` and `cancel-cleanup-timeout`
Rip out opentracing tracing backend
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
c2357ca to
dccaacc
Compare
Flatten tracing config to be a single bool flag
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Since datadog metrics were added in (checks notes) 2018 (thanks @lox!), OpenTelemetry has emerged as a standard for metrics that's portable between vendors. Our customers work in all sorts of environments, and not all of them are datadog subscribers.
This in mind, let's pull out the datadog-specific metrics, and replace them with OpenTelemetry ones.
Context
I've wanted to do this since the day i joined Buildkite.
Changes
jobs.successandjobs.failedcounters withjobs.finished— failure or success can be inferred with theexit_statustag that's applied to the metricTesting
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)I've also tested this locally and confirmed that metrics are coming through!
Disclosures / Credits
It's another Amp jobbie.