Skip to content

fix(cases): async case duration sync#2781

Open
daryllimyt wants to merge 12 commits into
mainfrom
daryl/eng-1462-async-case-duration-sync
Open

fix(cases): async case duration sync#2781
daryllimyt wants to merge 12 commits into
mainfrom
daryl/eng-1462-async-case-duration-sync

Conversation

@daryllimyt

@daryllimyt daryllimyt commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes ENG-1462: https://linear.app/tracecat/issue/ENG-1462/investigate-slow-case-loads-from-synchronous-duration-sync

This removes case-duration recomputation from case read paths and moves normal mutation-driven duration materialization into an after-commit, coalesced Redis stream consumer.

Changes:

  • Make GET /cases/{case_id}/durations read-only; it now lists materialized rows without syncing first.
  • Add duration_sync="async" | "inline" | "none" to case event creation.
  • Keep create-case duration materialization inline so new cases have rows immediately.
  • Treat CASE_VIEWED as audit-only for durations and reject case_viewed as a duration anchor.
  • Enqueue mutation-driven case duration sync after commit.
  • Add a dedicated Redis stream consumer that coalesces by (workspace_id, case_id), skips irrelevant event types, and uses per-case PG advisory locks.
  • Fall back to inline sync when TRACECAT__CASE_DURATION_SYNC_ENABLED=false, so the flag is a safe async-worker kill switch.
  • Keep failed coalesced case jobs pending instead of letting transient sync errors stop the consumer task.
  • Generate updated frontend client types for the new service role and endpoint description.

Motivation

Opening a case could previously trigger writes from GET-time paths:

  • case detail view tracking created a CASE_VIEWED event,
  • event creation synchronously called sync_case_durations(), and
  • GET /durations also synced durations before listing.

When workflows mutated the same case concurrently, those synchronous recomputes amplified write contention around case_duration and held request transactions open. Other non-case pages remained snappy because they did not hit this write-on-read path.

Benchmarks

Hot-case profile for before/after comparison:

  • 1 case
  • 40 duration definitions
  • 300 history events
  • 4 mutators x 8 mutations
  • 12 case loads
  • 3 baseline case loads
  • 10 ms load interval

Hot-case old-path runs

Mode Case-load baseline Case-load burst Mutation latency Notes
update p50 113.8 ms, max 159.0 ms p50 184.0 ms, p95 222.2 ms, max 222.2 ms p50 291.5 ms, p95 928.3 ms, max 942.2 ms max lock waits 70, ungranted 3, case_duration ungranted 0
event p50 138.0 ms, max 170.5 ms p50 122.6 ms, p95 320.3 ms, max 320.3 ms p50 170.5 ms, p95 209.5 ms, max 237.9 ms old-path variant
sync p50 131.0 ms, max 189.3 ms p50 124.7 ms, p95 292.5 ms, max 292.5 ms p50 155.2 ms, p95 179.6 ms, max 206.0 ms old-path variant

Hot-case new async-worker runs

Run Case-load baseline Case-load burst Mutation latency Errors
worker-inclusive, isolated stream/db p50 55.9 ms, p95 71.2 ms, max 71.2 ms p50 121.2 ms, p95 181.0 ms, max 181.0 ms p50 118.6 ms, p95 155.3 ms, max 189.0 ms 0
simplified implementation p50 68.6 ms, p95 91.7 ms, max 91.7 ms p50 124.6 ms, p95 235.2 ms, max 235.2 ms p50 117.7 ms, p95 217.9 ms, max 221.9 ms 0
current PR commit p50 71.4 ms, p95 118.3 ms, max 118.3 ms p50 127.9 ms, p95 154.0 ms, max 154.0 ms p50 118.1 ms, p95 149.9 ms, max 150.6 ms 0

Main signal: the old hot update path had mutation p95 around 928.3 ms; the current PR commit is 149.9 ms in the same reduced hot-case profile.

Existing burst/health benchmark on new implementation

Profile:

  • 20 cases
  • 80 definitions
  • 600 history events per case
  • 1 update per case
  • 50 ms health interval
  • 1000 ms health timeout
Metric p50 p95 max Samples
update latency 315.2 ms 322.4 ms 322.5 ms 20
health baseline 2.0 ms 2.3 ms 2.3 ms 4
health burst 5.3 ms 142.3 ms 142.3 ms 14
health cooldown 1.6 ms 2.3 ms 2.3 ms 6
loop lag baseline 2.9 ms 3.4 ms 3.4 ms 4
loop lag burst 47.5 ms 193.7 ms 193.7 ms 14
loop lag cooldown 2.7 ms 3.4 ms 3.4 ms 6

Other values:

  • burst elapsed: 0.674 s
  • health errors: baseline 0, burst 0, cooldown 0
  • update errors: 0

Verification

  • uv run pytest tests/unit/test_case_events_service.py tests/unit/test_cases_service.py tests/unit/test_case_duration_service.py tests/unit/test_case_duration_router.py tests/unit/test_case_duration_sync_consumer.py
    • 107 passed
  • uv run ruff check ...
    • passed
  • uv run ruff format --check ...
    • passed
  • uv run basedpyright ...
    • 0 errors, 0 warnings, 0 notes
  • Pre-commit hooks during commit:
    • ruff check/format passed
    • generated frontend client passed after installing frontend dependencies
    • python type check passed
    • frontend biome check passed
    • frontend type check passed
  • Current PR hot-case benchmark:
    • TRACECAT_RUN_CASE_DURATION_BENCHMARKS=1 ... uv run pytest tests/integration/test_case_duration_benchmarks.py -k hot_case -s
    • passed

Summary by cubic

Moves case-duration recomputation off reads and event writes to an async Redis-stream consumer to fix ENG-1462 slow case loads. The durations endpoint is read-only, and the background worker starts with the API and reads backlog for reliable sync.

  • New Features

    • Async duration materialization via Redis stream: coalesces per case, runs as an API background task with a kill switch, and uses service role tracecat-case-duration-sync.
    • Event creation supports duration_sync="async" | "inline" | "none"; create-case uses inline. Definition create/update enqueue cursor-paged backfills; when the worker is disabled, they backfill inline streamed in small batches.
    • Treat CASE_VIEWED as audit-only and reject it as a duration anchor; UI hides it from anchor options and falls back to safe defaults when editing. GET /cases/{id}/durations lists materialized rows only.
  • Bug Fixes

    • Reliability: transaction-scoped per-case PG advisory locks, claim/retry idle pending jobs, read stream backlog on group creation, keep locked/failed jobs pending, and force unconditional sync when a backfill coalesces with case events.
    • Correctly match status-change aliases when deciding if a sync is needed (e.g., case_closed, case_reopened map to status_changed).
    • Parse TRACECAT__CASE_DURATION_SYNC_ENABLED with env_bool so the async worker flag behaves correctly across environments.
    • Back-compat and efficiency: CASE_VIEWED events still enqueue duration sync when legacy definitions reference them; metadata-only definition updates (name/description) no longer trigger backfills.

Written for commit 32576dc. Summary will update on new commits.

Review in cubic

@daryllimyt daryllimyt added tests Changes to unit and integration tests fix Bug fix performance Changes that improve performance cases Case management improvements and changes labels May 29, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ee692687f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/consumer.py Outdated
@zeropath-ai

zeropath-ai Bot commented May 29, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 32576dc.

Security Overview
Detected Code Changes
Change Type Relevant files
Enhancement ► frontend/src/client/schemas.gen.ts
    Add tracecat-case-duration-sync to service ID union
► frontend/src/client/services.gen.ts
    Update description for caseDurationsListCaseDuration
► frontend/src/client/types.gen.ts
    Add tracecat-case-duration-sync to service ID union
► frontend/src/components/cases/case-duration-dialog.tsx
    Use CASE_DURATION_EVENT_VALUES instead of CASE_EVENT_VALUES
    Use CASE_DURATION_EVENT_OPTIONS instead of CASE_EVENT_OPTIONS
► frontend/src/components/cases/case-duration-options.ts
    Introduce CaseDurationAnchorEventType and related types/values
    Filter out case_viewed from duration event options
► frontend/src/components/cases/update-case-duration-dialog.tsx
    Add CaseDurationAnchorEventType and isCaseDurationAnchorEventType
    Use fallback event types for start and end anchors
Refactor ► tests/integration/test_case_duration_benchmarks.py
    Add new benchmark for hot case load latency during async duration mutation burst
    Add helper functions for loading case pages and running bursts
► tests/unit/test_case_duration_definition_inline_backfill.py
    Add unit tests for inline definition backfill
► tests/unit/test_case_duration_router.py
    Add unit test for read-only nature of list_case_durations
► tests/unit/test_case_duration_service.py
    Add unit tests for create_definition and update_definition regarding backfill queuing and inline sync
    Test that compute_case_durations_from_events calls sync_case_durations
    Add test for CaseDurationEventAnchor rejecting case_viewed
► tests/unit/test_case_duration_sync_consumer.py
    Add unit tests for CaseDurationSyncConsumer covering various scenarios like coalescing, locking, failures, and backfills
► tests/unit/test_case_events_service.py
    Add unit tests for create_event queuing duration sync and performing inline sync
    Add stub for enqueue_case_duration_sync_after_commit and publish_case_event_payload
Other ► tests/integration/test_case_duration_benchmarks.py
    Add new environment variables for hot case benchmarks

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 18 files

Confidence score: 3/5

  • There is a concrete reliability risk in tracecat/cases/durations/consumer.py: failed/unacked jobs are only reclaimed on idle reads, so retries can be starved indefinitely when the stream stays busy.
  • Given the medium severity (6/10) with fairly high confidence (8/10) and direct user-facing impact on retry behavior, this carries some merge risk rather than being a minor housekeeping issue.
  • Pay close attention to tracecat/cases/durations/consumer.py - reclaim logic tied to idle reads may prevent timely recovery of failed/unacked jobs under sustained load.

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread tracecat/cases/durations/consumer.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19974b485d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/consumer.py Outdated
Comment thread tracecat/cases/durations/consumer.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b21e7f525d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/consumer.py Outdated
Comment thread tracecat/cases/durations/service.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8d25a0a4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/schemas.py

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread tracecat/cases/durations/service.py Outdated
@blacksmith-sh

This comment has been minimized.

@daryllimyt daryllimyt force-pushed the daryl/eng-1462-async-case-duration-sync branch from e8d25a0 to db63b32 Compare June 19, 2026 14:39

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db63b32fd8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/config.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb491c2483

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/service.py Outdated
Comment thread tracecat/cases/durations/service.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32576dca0a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +37 to +39
fallbackEventType: CaseDurationAnchorEventType
): CaseDurationFormValues["start"] {
if (isCaseDropdownEventType(anchor.event_type)) {
const eventType = isCaseDurationAnchorEventType(anchor.event_type)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid replacing legacy viewed anchors on save

When an existing duration definition still has a case_viewed anchor, this fallback initializes the edit form as case_created/case_closed. The submit handler always sends both anchors, so a user who only edits the name or description and saves will silently rewrite the legacy anchor and change the metric instead of preserving or explicitly migrating it. This affects workspaces with preexisting case_viewed duration definitions that are still returned by the backend compatibility path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cases Case management improvements and changes fix Bug fix performance Changes that improve performance tests Changes to unit and integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant