Skip to content

feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129

Open
yinsong1986 wants to merge 1 commit into
strands-labs:mainfrom
yinsong1986:feat/benchmark-protocol
Open

feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129
yinsong1986 wants to merge 1 commit into
strands-labs:mainfrom
yinsong1986:feat/benchmark-protocol

Conversation

@yinsong1986
Copy link
Copy Markdown

Summary

Implements #107: a benchmark-agnostic evaluation protocol so LIBERO, Meta-World,
RoboSuite, ManiSkill, and user-authored tasks can plug into the SimEngine eval
loop without committing the core to any single benchmark's conventions.

Adapters stay thin and ship in follow-up extras (#108 Meta-World, #109 RoboSuite,
#110 LIBERO). The existing success_fn path is kept working for backcompat.

What's in scope (from #107)

  1. BenchmarkProtocol ABC + StepInfo dataclass in strands_robots/simulation/benchmark.py
  2. PolicyRunner.evaluate widened to accept spec: BenchmarkProtocol + seed, alongside the existing success_fn path
  3. Tool actions + tool_spec.json entries: list_benchmarks, register_benchmark_from_file, evaluate_benchmark
  4. Named-predicate library in strands_robots/simulation/predicates.py (11 built-ins: body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, …)
  5. Declarative YAML/JSON loader (register_benchmark_from_file) restricted to the named-predicate DSL - no eval, no exec, safe for LLM-authored specs
  6. Reference adapter: DeclarativeBenchmark (the DSL-driven adapter) serves as the end-to-end reference. A full MetaWorldAdapter is tracked as a separate PR (Benchmark adapter: Meta-World (MetaWorldAdapter) #108) since it needs the metaworld env + real task validation.
  7. Tests: 110 new tests covering protocol contract, cumulative reward, per-episode seed reproducibility, DSL compile + validation, backcompat, and MuJoCo dispatch

Architecture notes

  • Robot compatibility is first-class metadata. Default on_episode_start validates loaded robot data_config against supported_robots and auto-loads default_robot when the sim is empty. Mismatches surface as structured error dicts.
  • Registry mirrors register_urdf. Module-level dict[str, BenchmarkProtocol] guarded by an RLock. Re-registration is idempotent-overwrite with a warning.
  • Predicate library is a closed registry. YAML/JSON specs can only reference predicates in PREDICATE_REGISTRY. No eval. Sandboxed by construction.
  • Seed reproducibility. evaluate(spec=…, seed=42) seeds a master RNG, derives a per-episode child RNG, and threads it through spec.on_episode_start(sim, rng).
  • Signature-driven dispatch stays intact. No router code changes; just three new enum entries and two new top-level properties (benchmark_name, spec_path).

Verification

```bash
hatch run lint # ruff + mypy, all clean
hatch run pytest tests/simulation/test_benchmark.py \
tests/simulation/test_benchmark_predicates.py \
tests/simulation/test_benchmark_dsl.py \
tests/simulation/test_policy_runner_benchmark.py \
tests/simulation/mujoco/test_benchmark_dispatch.py

→ 110 passed

```

Full suite: 1347 passed, 31 skipped. 3 pre-existing failures all require OpenGL/OSMesa (confirmed failing on `main` too).

Example

Python:

```python
sim.register_benchmark_from_file(benchmark_name='drawer-open', spec_path='specs/drawer.yaml')
sim.evaluate_benchmark(benchmark_name='drawer-open', robot_name='arm', policy_provider='mock', n_episodes=10, seed=42)
```

Spec (`specs/drawer.yaml`):

```yaml
name: drawer-open
max_steps: 300
supported_robots: [panda]
default_robot: panda
success:
all:
- {predicate: joint_above, joint: drawer_slide, value: 0.15}
failure:
any:
- {predicate: body_below_z, body: gripper, z: -0.1}
dense_reward:

  • {predicate: distance_neg, body_a: gripper, body_b: drawer_handle, weight: 1.0}
  • {predicate: joint_progress, joint: drawer_slide, target: 0.2, weight: 5.0}
    ```

Out of scope (tracked as follow-ups)

Closes #107.

Introduce BenchmarkProtocol ABC + string-keyed registry so every standard
benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks)
can plug into the SimEngine eval loop without committing the core to any
single benchmark's conventions. Adapters stay thin and ship in follow-up
extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO).

What lands here
- strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo
  dataclass + thread-safe string-keyed registry. Robot compatibility is
  first-class metadata; default on_episode_start auto-loads default_robot
  and validates loaded robots against supported_robots.
- strands_robots/simulation/predicates.py: named-predicate library
  (body_above_z, joint_above, distance_less_than, inside_region,
  contact_between, contact_any, distance_neg, joint_progress, constant, ...).
  Closed registry, no eval() - safe for untrusted / LLM-authored specs.
- strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark +
  register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib,
  YAML gated behind require_optional('pyyaml')).
- PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int
  alongside the legacy success_fn path. Spec path adds cumulative reward,
  per-episode seeded RNG, is_failure early termination, and structured
  compatibility errors. Legacy path unchanged for backcompat.
- SimEngine base adds evaluate_benchmark / list_benchmarks /
  register_benchmark_from_file facades, auto-dispatched via the existing
  _dispatch_action path. MuJoCo tool_spec.json gains three action enum
  entries plus benchmark_name / spec_path properties.

Test coverage (110 new tests, all passing)
- Protocol contract, registry ops, thread-safety, compatibility errors
- Every predicate against lightweight fake sims + the reward math
- DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing
- PolicyRunner.evaluate(spec=...) for cumulative reward, seed
  reproducibility, is_success/is_failure/done terminations, legacy
  backcompat, evaluate_benchmark / list_benchmarks / register_... facades
- MuJoCo dispatch path for all three new actions

Out of scope (tracked as follow-ups)
- Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110)
- BDDL parser, dense-reward curriculum tooling, RL training harness
- Replacing the existing success_fn path (kept working)

Refs strands-labs#107.
@yinsong1986
Copy link
Copy Markdown
Author

yinsong1986 commented May 11, 2026

Requesting review from @sundargthb, @awsarron, @cagataycali 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry)

1 participant