Skip to content

Replace dogstatsd metrics with opentelemetry metrics#3874

Open
moskyb wants to merge 22 commits into
v4from
opentel-metrics
Open

Replace dogstatsd metrics with opentelemetry metrics#3874
moskyb wants to merge 22 commits into
v4from
opentel-metrics

Conversation

@moskyb
Copy link
Copy Markdown
Contributor

@moskyb moskyb commented Apr 30, 2026

Description

Since datadog metrics were added in (checks notes) 2018 (thanks @lox!), OpenTelemetry has emerged as a standard for metrics that's portable between vendors. Our customers work in all sorts of environments, and not all of them are datadog subscribers.

This in mind, let's pull out the datadog-specific metrics, and replace them with OpenTelemetry ones.

Context

I've wanted to do this since the day i joined Buildkite.

Changes

  • Replaces datadog metrics with opentel ones
  • Replaces the jobs.success and jobs.failed counters with jobs.finished — failure or success can be inferred with the exit_status tag that's applied to the metric

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

I've also tested this locally and confirmed that metrics are coming through!

Disclosures / Credits

It's another Amp jobbie.

DrJosh9000 and others added 19 commits April 29, 2026 12:41
…verse-for-tear-down-hooks

fix: Reverse ordering for post- hooks
…`cancel-signal-timeout` and `cancel-cleanup-timeout`

Previously, `cancel-grace-period` (default 10s) was the *total* time budget covering both process shutdown and agent-side cleanup (log uploads, artifact uploads, disconnects). `signal-grace-period-seconds` (default -1) controlled how much of that budget went to the process, using negative-relative arithmetic: -1 meant "`cancel-grace-period` minus 1", so the process got 9s and the agent got 1s. This made configuration confusing — the flag that *sounded* like the process's grace period (`cancel-grace-period`) was actually the total, and the actual process grace period required subtracting a negative number from it. Validation was also complex because the two values had to be checked against each other, and invalid combinations (e.g.  `signal-grace-period-seconds` >= `cancel-grace-period`) returned errors.

The new model uses two independent, positive durations:

  `--cancel-signal-timeout` (default 9s): how long the subprocess gets to
  handle the cancel signal before receiving SIGKILL. This is the value
  users actually think about when configuring cancellation.

  `--cancel-cleanup-timeout` (default 1s): how long the agent gets after
  the process exits or is killed to upload logs and artifacts.

The total grace period is simply their sum. There is no validation logic because the values cannot conflict. Both flags accept Go duration syntax (e.g. "30s", "1m30s") via `cli.DurationFlag` instead of integer seconds, matching the convention used by other timeout flags in the agent (`wait-for-ec2-tags-timeout`, `kubernetes-container-start-timeout`,ó etc.).

The defaults produce the same effective behaviour as before: 9s for the process, 1s for agent cleanup, 10s total.
Replace `cancel-grace-period` and `signal-grace-period-seconds` with `cancel-signal-timeout` and `cancel-cleanup-timeout`
Rip out opentracing tracing backend
@moskyb moskyb requested review from a team as code owners April 30, 2026 07:55
@socket-security
Copy link
Copy Markdown

@DrJosh9000 DrJosh9000 force-pushed the v4 branch 5 times, most recently from c2357ca to dccaacc Compare May 7, 2026 00:08
Flatten tracing config to be a single bool flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants