Skip to content

feat: add chaos testing module for fault injection#224

Open
ybdarrenwang wants to merge 7 commits into
strands-agents:mainfrom
ybdarrenwang:feature/chaos-tool
Open

feat: add chaos testing module for fault injection#224
ybdarrenwang wants to merge 7 commits into
strands-agents:mainfrom
ybdarrenwang:feature/chaos-tool

Conversation

@ybdarrenwang
Copy link
Copy Markdown
Collaborator

@ybdarrenwang ybdarrenwang commented May 12, 2026

Description

Introduces a chaos testing module for fault injection during agent evaluation. Enables systematic testing of agent resilience under tool failures and response corruption without modifying agent code.

Key capabilities:

  • Effect hierarchy — parameterized chaos effects split into pre-hook (cancel tool call with error) and post-hook (corrupt tool response): ToolCallFailure, TruncateFields, RemoveFields, CorruptValues
  • ChaosCase — extends Case with an effects field that carries the failure injection config. Provides ChaosCase.expand(cases, effect_maps) to generate the Cartesian product of base cases × named effect maps (a dict[str, dict[str, list[ChaosEffect]]] where keys are short human-readable condition names)
  • Plugin integrationChaosPlugin hooks into Strands' native BeforeToolCallEvent / AfterToolCallEvent system; reads the active ChaosCase from a ContextVar (zero chaos concepts in user task code)
  • Experiment runnerChaosExperiment composes the base Experiment to run ChaosCase objects, managing ContextVar lifecycle per case for thread/async safety

Design principles:

  • Stateless plugin — concurrency-safe via ContextVar
  • User's task function contains zero chaos awareness; just pass plugins=[chaos] to the agent
  • ChaosCase extends Case (not modifies it) — stable extension point for future chaos-specific fields without breaking the base framework
  • Named effect maps replace the previous ChaosScenario class — simpler API surface with readable case names in reports (e.g., book_a_flight|search_timeout)
  • Composable with existing Strands Evals workflow (Case, Evaluator, EvaluationReport)

Related Issues

#114

Documentation PR

strands-agents/docs#836

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread src/strands_evals/chaos/effects.py
Comment thread src/strands_evals/chaos/plugin.py Outdated
Comment thread src/strands_evals/chaos/scenario.py Outdated
Comment thread src/strands_evals/chaos/plugin.py Outdated
Comment thread src/strands_evals/chaos/experiment.py Outdated
Comment thread src/strands_evals/chaos/plugin.py Outdated
Comment thread src/strands_evals/chaos/effects.py
Comment thread src/strands_evals/chaos/experiment.py Outdated
Comment thread src/strands_evals/chaos/plugin.py Outdated
@github-actions
Copy link
Copy Markdown

Review Summary

Assessment: Request Changes

This PR introduces a well-structured chaos testing module with clean separation between effects, scenarios, plugin, and experiment orchestration. The ContextVar-based design for concurrency safety and the composition with the existing Experiment class are solid architectural decisions.

Review Categories
  • Unimplemented Feature: apply_rate is declared as a user-facing parameter but never checked anywhere in the execution path — this will confuse users who set it expecting probabilistic firing.
  • Documentation Accuracy: Multiple docstrings reference non-existent classes (Timeout, NetworkError) instead of the actual ToolCallFailure API. These should be corrected before merge to avoid misleading users.
  • Style Guide Compliance: All logger.info(f"...") calls should use lazy %s formatting per the project's STYLE_GUIDE.
  • Type Safety: The hasattr(effect, "max_length") duck-typing in the plugin should use an explicit isinstance check or a proper polymorphic dispatch.
  • API Review: This PR introduces new public abstractions (ChaosExperiment, ChaosPlugin, ChaosScenario, effect hierarchy) that customers will use directly. Consider adding the needs-api-review label per the API Bar Raising process.

The overall design is thoughtful and the test coverage is good. Addressing the apply_rate gap and the doc inaccuracies are the most impactful changes needed.

Comment thread src/strands_evals/chaos/experiment.py Outdated
Comment thread src/strands_evals/chaos/experiment.py
@github-actions
Copy link
Copy Markdown

Review Summary (Follow-up)

Assessment: Comment

All feedback from the previous review has been thoroughly addressed — apply_rate is wired up, docstrings are accurate, logging follows the style guide, type-safety is improved, and the task wrapper is deduplicated. Nice work on the quick turnaround.

Remaining Items
  • Async task support: _wrap_task returns a sync wrapper unconditionally, which breaks async tasks passed to run_evaluations_async. Needs an async-aware branch.
  • Input validation: run_evaluations should reject async tasks with a clear error (matching base Experiment behavior) rather than silently producing bad results.

These are straightforward fixes. Once addressed, this looks good to merge.

# Produces 6 ChaosCase objects: 2 cases × (2 effect maps + 1 baseline)
"""

effects: dict[str, list[ChaosEffect]] = Field(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The type annotation dict[str, list[ChaosEffect]] causes Pydantic to only serialize the base class fields during model_dump(). Concrete effect fields (error_type, max_length, remove_ratio, corrupt_ratio) are silently dropped:

c = ChaosCase(name='test', input='hi', effects={'tool': [ToolCallFailure(error_type='timeout')]})
c.model_dump()['effects']
# → {'tool': [{'apply_rate': 1.0}]}  ← error_type lost!

Deserialization also fails since ChaosEffect is abstract. This affects the base Experiment's error reporting path (case.model_dump() on line 535 of experiment.py) and any serialization/persistence scenario.

Suggestion: Use a Pydantic discriminated union with a type field:

from typing import Annotated, Union
from pydantic import Discriminator, Tag

# Add a type discriminator to each concrete effect:
class ToolCallFailure(ToolEffect):
    effect_type: Literal["tool_call_failure"] = "tool_call_failure"
    ...

# Then annotate:
AnyEffect = Annotated[
    Union[
        Annotated[ToolCallFailure, Tag("tool_call_failure")],
        Annotated[TruncateFields, Tag("truncate_fields")],
        Annotated[RemoveFields, Tag("remove_fields")],
        Annotated[CorruptValues, Tag("corrupt_values")],
    ],
    Discriminator("effect_type"),
]

class ChaosCase(Case, Generic[InputT, OutputT]):
    effects: dict[str, list[AnyEffect]] = Field(default_factory=dict)

This ensures full round-trip serialization fidelity.

"""
import asyncio

if asyncio.iscoroutinefunction(task):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The _wrap_task async handling was added per previous review feedback, but there's no test exercising run_evaluations_async with an actual async task. Without it, regressions to this code path would go undetected.

Suggestion: Add a test like:

@pytest.mark.asyncio
async def test_run_evaluations_async_with_async_task(self, cases, effect_maps, evaluator):
    chaos_cases = ChaosCase.expand(cases, effect_maps)
    experiment = ChaosExperiment(cases=chaos_cases, evaluators=[evaluator])

    async def async_task(case: ChaosCase):
        active = _current_chaos_case.get()
        assert active is case
        return "async_output"

    reports = await experiment.run_evaluations_async(task=async_task, max_workers=2)
    assert len(reports) >= 1

@github-actions
Copy link
Copy Markdown

Review Summary (Round 3)

Assessment: Request Changes

All prior feedback has been resolved. One important serialization issue remains that would cause data loss in error reporting and persistence scenarios.

Details
  • Serialization: ChaosCase.effects uses the abstract base type ChaosEffect in its annotation, causing Pydantic to drop concrete fields during model_dump() and fail on model_validate(). Needs a discriminated union.
  • Test coverage: The async task path in _wrap_task lacks a corresponding test.

The architecture, ContextVar design, plugin implementation, and test structure are all solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant