Skip to content

feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup#1777

Open
jszwedko wants to merge 13 commits into
mainfrom
jszwedko/forwarder-outdated-file
Open

feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup#1777
jszwedko wants to merge 13 commits into
mainfrom
jszwedko/forwarder-outdated-file

Conversation

@jszwedko
Copy link
Copy Markdown
Collaborator

@jszwedko jszwedko commented May 29, 2026

Summary

Adds support for forwarder_outdated_file_in_days (default 10), matching the core Agent's startup behavior. When disk persistence is enabled, ADP now scans forwarder_storage_path at startup and removes retry-*.json files whose filesystem mtime exceeds the configured age, preventing unbounded disk growth from stale retry data after extended outages. Set to 0 to disable cleanup. The config key is moved from the unsupported registry to the supported forwarder registry.

Closes #1360

Test plan

  • cargo check --workspace && cargo check --workspace --tests passes
  • cargo test -p saluki-components --lib outdated passes (3 new tests)
  • cargo test -p saluki-components --lib config_registry passes
  • With forwarder_storage_max_size_in_bytes set and forwarder_outdated_file_in_days: 10, old retry files are removed at startup; recent files and non-retry files are untouched
  • With forwarder_outdated_file_in_days: 0, no files are removed

🤖 Generated with Claude Code

@dd-octo-sts dd-octo-sts Bot added area/components Sources, transforms, and destinations. area/docs Reference documentation. labels May 29, 2026
@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 Bot commented May 29, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

Semantic PR Title Check | Check For Semantic PR Title   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Unknown scope 'forwarder' in pull request title. Valid scopes required: agent-data-plane, aggregate, airlock, etc.

DataDog/saluki | run-benchmarks-adp   View in Datadog   GitLab

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: d24e244 | Docs | Datadog PR Page | Give us feedback!

// If the storage size is set, enable disk persistence for the retry queue.
if config.retry().storage_max_size_bytes() > 0 {
// Remove stale retry files before opening the queue.
remove_outdated_retry_files(config.retry().storage_path(), config.retry().outdated_file_in_days()).await;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the Agent's behavior of only checking at start-up.

@jszwedko jszwedko changed the title feat(forwarder): support forwarder_outdated_file_in_days for stale retry file cleanup feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup May 29, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 29, 2026

Binary Size Analysis (Agent Data Plane)

Baseline: b078b0d · Comparison: d24e244 · diff
Analysis Configuration: stripped binaries · Pass/Fail Threshold: +5%
Sizes: 37.93 MiB (baseline) vs 38.04 MiB (comparison)
Size Change: +111.26 KiB (+0.29%)

✅ Binary size difference within threshold

Changes by Module
Module File Size Symbols
anyhow +27.99 KiB 276
piecemeal +13.00 KiB 15
tokio +12.78 KiB 226
figment +10.09 KiB 93
saluki_components::common::datadog +8.68 KiB 32
core +6.79 KiB 1317
saluki_common::cache::CacheBuilder<K,V,W,H> +6.52 KiB 3
saluki_components::encoders::datadog -6.18 KiB 19
saluki_components::sources::otlp -5.35 KiB 19
serde_json +5.33 KiB 13
bytes +5.32 KiB 23
&mut serde_json -4.90 KiB 13
saluki_io::net::util +4.71 KiB 24
serde +4.57 KiB 6
hyper_util -4.21 KiB 13
otlp_protos::otlp_include::opentelemetry -4.12 KiB 78
h2 +4.07 KiB 79
[sections] -4.05 KiB 8
anon.d933a6d672e016e78db8d7364bd54640.56.llvm.2344071001195518020 +3.99 KiB 1
anon.74ec343e7df4c5afe576d99fcc874370.67.llvm.13153431456681701250 -3.99 KiB 1
Detailed Symbol Changes
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +2.0%  +130Ki  +1.6% +80.6Ki    [6759 Others]
 +12e2% +6.92Ki  [ = ]       0    core::ptr::drop_in_place<http_body_util::combinators::map_err::MapErr<tonic::body::Body,axum_core::error::Error::new<tonic::status::Status>>>::h6b978571211c4eed
  [NEW] +6.88Ki  [NEW] +6.78Ki    h2::proto::connection::Connection<T,P,B>::poll::h44139f071de446d5
  [NEW] +6.63Ki  [NEW] +6.53Ki    h2::proto::connection::Connection<T,P,B>::poll::h5a36798488cd5681
  [NEW] +6.63Ki  [NEW] +6.53Ki    h2::proto::connection::Connection<T,P,B>::poll::h7f780e2c1564a908
  [NEW] +6.45Ki  [NEW]    +340    core::ptr::drop_in_place<http_body_util::combinators::map_err::MapErr<http_body_util::combinators::map_frame::MapFrame<tonic::body::Body,tonic::codec::decode::Streaming<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>::new<tonic::body::Body,tonic_prost::codec::ProstDecoder<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>>::{{closure}}>,tonic::codec::decode::Streaming<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>::new<tonic::body::Body,tonic_prost::codec::ProstDecoder<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>>::{{closure}}>>::h57e6d59f419e6880
  +878% +5.68Ki +11e2% +5.68Ki    saluki_components::sources::dogstatsd::replay::capture::DogStatsDCaptureControl::start_capture::h80b33ec2e2c1b5d8
   +13% +5.62Ki   +13% +5.62Ki    saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::h4c31dc59c83ba86c
  [NEW] +4.68Ki  [NEW] +4.52Ki    _<core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::fold::h1210d20d3cee0b37
  [NEW] +4.47Ki  [NEW] +4.30Ki    _<saluki_io::deser::framing::NestedFramer<Inner,Outer> as saluki_io::deser::framing::Framer>::next_frame::hab58ad7434044a92
 -82.1% -4.41Ki -85.7% -4.41Ki    saluki_components::common::datadog::transaction::_::_<impl serde_core::ser::Serialize for saluki_components::common::datadog::transaction::Transaction<B>>::serialize::hdd0848d8474f34d4
  [DEL] -4.68Ki  [DEL] -4.52Ki    core::ops::function::impls::_<impl core::ops::function::FnMut<A> for &mut F>::call_mut::h6f18fe8453bdd4e1
 -86.3% -5.20Ki -88.0% -5.20Ki    saluki_components::sources::otlp::metrics::cache::PointsCache::from_config::hc0fcdd066e6ef178
  [DEL] -5.35Ki  [DEL] -5.22Ki    saluki_components::sources::dogstatsd::replay::writer::TrafficCaptureWriter::start_capture::h1c1a09bbc468b634
 -34.1% -5.60Ki -34.6% -5.60Ki    _<saluki_components::encoders::datadog::traces::TraceEndpointEncoder as saluki_components::common::datadog::request_builder::EndpointEncoder>::encode::h854bf035420c3910
  -3.7% -6.26Ki  -3.7% -6.26Ki    [section .rodata]
 -83.5% -6.71Ki -25.0%    -340    core::ptr::drop_in_place<tonic::body::Body>::h62f35fa9bcf1e4d2
  [DEL] -7.25Ki  [DEL]    -341    core::ptr::drop_in_place<http_body_util::combinators::map_err::MapErr<http_body_util::combinators::map_err::MapErr<http_body_util::combinators::map_frame::MapFrame<tonic::body::Body,tonic::codec::decode::Streaming<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>::new<tonic::body::Body,tonic_prost::codec::ProstDecoder<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>>::{{closure}}>,tonic::codec::decode::Streaming<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>::new<tonic::body::Body,tonic_prost::codec::ProstDecoder<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>>::{{closure}}>,tonic::status::Status::map_error<tonic::status::Status>>>::h45015946cfdf719a
  [DEL] -9.01Ki  [DEL] -8.92Ki    h2::server::Connection<T,B>::poll_closed::hae1db10ac11f7523
  [DEL] -9.11Ki  [DEL] -9.01Ki    h2::server::Connection<T,B>::poll_closed::h7fc5e50f37a1a1fe
  [DEL] -9.51Ki  [DEL] -9.42Ki    h2::server::Connection<T,B>::poll_closed::h3f3ed7a68096333d
  +0.3%  +111Ki  +0.2% +61.7Ki    TOTAL

@jszwedko jszwedko changed the title feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup feat(forwarder): support forwarder_outdated_file_in_days for stale retry file cleanup May 29, 2026
@jszwedko jszwedko changed the title feat(forwarder): support forwarder_outdated_file_in_days for stale retry file cleanup feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup May 29, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 29, 2026

Regression Detector (Agent Data Plane)

Run ID: f0b37c5e-358c-4107-9c76-8ef694e90893
Baseline: b078b0db · Comparison: d24e244c · diff

Optimization Goals: ❌ 2 regressions detected

experiment goal Δ mean % links
otlp_ingest_metrics_5mb_memory memory 🔴 +8.18 metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic) cpu 🔴 +6.50 metrics profiles logs
Fine details of change detection per experiment (33)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment goal Δ mean % links
dsd_uds_10mb_3k_contexts_cpu (erratic) cpu ⚪ +3.90 metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored) memory ⚪ +2.62 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic) cpu ⚪ +1.91 metrics profiles logs
otlp_ingest_traces_5mb_throughput throughput ⚪ -1.78 metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored) cpu ⚪ +1.46 metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput throughput ⚪ -1.19 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput throughput ⚪ -1.02 metrics profiles logs
dsd_uds_100mb_3k_contexts_memory memory ⚪ +0.42 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory memory ⚪ +0.37 metrics profiles logs
quality_gates_rss_dsd_heavy memory ⚪ +0.30 metrics profiles logs
quality_gates_rss_dsd_medium memory ⚪ +0.27 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory memory ⚪ +0.17 metrics profiles logs
dsd_uds_500mb_3k_contexts_memory memory ⚪ +0.16 metrics profiles logs
dsd_uds_1mb_3k_contexts_memory memory ⚪ +0.15 metrics profiles logs
quality_gates_rss_dsd_low memory ⚪ +0.11 metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory ⚪ +0.11 metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput throughput ⚪ -0.01 metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored) throughput ⚪ +0.01 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput throughput ⚪ +0.01 metrics profiles logs
otlp_ingest_traces_5mb_memory memory ⚪ -0.02 metrics profiles logs
otlp_ingest_metrics_5mb_throughput throughput ⚪ +0.03 metrics profiles logs
dsd_uds_512kb_3k_contexts_memory memory ⚪ -0.03 metrics profiles logs
quality_gates_rss_idle memory ⚪ -0.14 metrics profiles logs
dsd_uds_10mb_3k_contexts_memory memory ⚪ -0.28 metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic) cpu ⚪ -0.44 metrics profiles logs
otlp_ingest_traces_5mb_cpu (erratic) cpu ⚪ -0.64 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic) cpu ⚪ -1.36 metrics profiles logs
dsd_uds_512kb_3k_contexts_cpu (erratic) cpu ⚪ -2.35 metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic) cpu 🟢 -5.76 metrics profiles logs
dsd_uds_1mb_3k_contexts_cpu (erratic) cpu 🟢 -6.88 metrics profiles logs
Bounds Checks: ✅ Passed (5)
experiment check replicates observed links
quality_gates_rss_dsd_heavy memory_usage 10/10 ✅ 126 MiB ≤ 140 MiB metrics profiles logs
quality_gates_rss_dsd_low memory_usage 10/10 ✅ 40 MiB ≤ 50 MiB metrics profiles logs
quality_gates_rss_dsd_medium memory_usage 10/10 ✅ 61.9 MiB ≤ 75 MiB metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 ✅ 182 MiB ≤ 200 MiB metrics profiles logs
quality_gates_rss_idle memory_usage 10/10 ✅ 26.7 MiB ≤ 40 MiB metrics profiles logs
Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

@dd-octo-sts dd-octo-sts Bot added the area/io General I/O and networking. label May 29, 2026
Comment thread lib/saluki-components/src/common/datadog/retry.rs Outdated
Comment thread lib/saluki-io/src/net/util/retry/queue/mod.rs Outdated
Comment thread lib/saluki-io/src/net/util/retry/queue/persisted.rs Outdated
Comment thread lib/saluki-io/src/net/util/retry/queue/persisted.rs Outdated
jszwedko and others added 13 commits May 29, 2026 17:55
…try file cleanup

Adds forwarder_outdated_file_in_days (default 10) to RetryConfiguration.
When disk persistence is enabled, ADP now removes retry-*.json files
older than the configured number of days each time it starts, preventing
unbounded disk growth after long outages. Set to 0 to disable.
Matches the core Agent's behavior in
comp/forwarder/defaultforwarder/default_forwarder.go.

Closes #1360

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…le scope

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ile_in_days

Schema type: number maps to Float in smoke tests, causing injection of 1.5
which fails to deserialize into u32. Explicit Integer override makes the
smoke test inject 42 instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Scan the per-queue subdirectory (storage_path/{queue_id}) not the root;
  retry files live at storage_path/{queue_id}/retry-*.json
- Use the filename-embedded creation timestamp via decode_timestamped_filename
  (now pub-exported from saluki-io) instead of filesystem mtime, which can
  be reset by backup/restore tools
- break (not continue) on next_entry() error to avoid potential infinite loop
  on macOS with persistent readdir errors
- Downgrade ENOENT on remove_file to debug; it indicates a concurrent sibling
  endpoint task already deleted the same file
- Update tests to use valid filename-encoded timestamps; remove filetime dep

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ueue::from_root_path

Cleanup now lives inside saluki-io alongside the filename format it depends
on. No cross-crate exports needed. RetryQueue::with_disk_persistence and
PersistedQueue::from_root_path each gain a max_age_days: u32 parameter;
io.rs passes outdated_file_in_days() at the single call site.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y files

The core Agent's FileRemovalPolicy with outdatedFileDayCount=0 sets the
cutoff to now, deleting all retry files. Remove the early-return guard
(which incorrectly documented 0 as "disable") to match that behavior.
Update the test and field doc accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Simplify doc last line to just 'Defaults to 10.'
- Use 10 (not 0) in with_disk_persistence test call for clarity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…suite

Remove the nested mod, match the style of storage_ratio_exceeded and other
existing tests (tempfile::tempdir, flat helpers, files_in_dir).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Test via the public API (from_root_path) rather than calling the private
remove_outdated_retry_files directly. Uses make_persisted_queue helper,
FakeData, and DiskUsageRetrieverImpl — consistent with storage_ratio_exceeded
and other tests in the module.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jszwedko jszwedko force-pushed the jszwedko/forwarder-outdated-file branch from 8dc55a2 to d24e244 Compare May 30, 2026 00:55
@jszwedko jszwedko changed the title feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup feat(forwarder): support forwarder_outdated_file_in_days for stale retry file cleanup May 30, 2026
@jszwedko jszwedko marked this pull request as ready for review May 30, 2026 00:56
@jszwedko jszwedko requested a review from a team as a code owner May 30, 2026 00:56
@jszwedko jszwedko changed the title feat(forwarder): support forwarder_outdated_file_in_days for stale retry file cleanup feat(datadog): support forwarder_outdated_file_in_days for stale retry file cleanup May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/docs Reference documentation. area/io General I/O and networking.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support forwarder_outdated_file_in_days to clean up stale retry queue files.

1 participant