feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107) by yinsong1986 · Pull Request #129 · strands-labs/robots

yinsong1986 · 2026-05-11T21:50:10Z

Summary

Implements #107: a benchmark-agnostic evaluation protocol so LIBERO, Meta-World,
RoboSuite, ManiSkill, and user-authored tasks can plug into the SimEngine eval
loop without committing the core to any single benchmark's conventions.

Adapters stay thin and ship in follow-up extras (#108 Meta-World, #109 RoboSuite,
#110 LIBERO). The existing success_fn path is kept working for backcompat.

What's in scope (from #107)

BenchmarkProtocol ABC + StepInfo dataclass in strands_robots/simulation/benchmark.py
PolicyRunner.evaluate widened to accept spec: BenchmarkProtocol + seed, alongside the existing success_fn path
Tool actions + tool_spec.json entries: list_benchmarks, register_benchmark_from_file, evaluate_benchmark
Named-predicate library in strands_robots/simulation/predicates.py (11 built-ins: body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, …)
Declarative YAML/JSON loader (register_benchmark_from_file) restricted to the named-predicate DSL - no eval, no exec, safe for LLM-authored specs
Reference adapter: DeclarativeBenchmark (the DSL-driven adapter) serves as the end-to-end reference. A full MetaWorldAdapter is tracked as a separate PR (Benchmark adapter: Meta-World (MetaWorldAdapter) #108) since it needs the metaworld env + real task validation.
Tests: 110 new tests covering protocol contract, cumulative reward, per-episode seed reproducibility, DSL compile + validation, backcompat, and MuJoCo dispatch

Architecture notes

Robot compatibility is first-class metadata. Default on_episode_start validates loaded robot data_config against supported_robots and auto-loads default_robot when the sim is empty. Mismatches surface as structured error dicts.
Registry mirrors register_urdf. Module-level dict[str, BenchmarkProtocol] guarded by an RLock. Re-registration is idempotent-overwrite with a warning.
Predicate library is a closed registry. YAML/JSON specs can only reference predicates in PREDICATE_REGISTRY. No eval. Sandboxed by construction.
Seed reproducibility. evaluate(spec=…, seed=42) seeds a master RNG, derives a per-episode child RNG, and threads it through spec.on_episode_start(sim, rng).
Signature-driven dispatch stays intact. No router code changes; just three new enum entries and two new top-level properties (benchmark_name, spec_path).

Verification

```bash
hatch run lint # ruff + mypy, all clean
hatch run pytest tests/simulation/test_benchmark.py \
tests/simulation/test_benchmark_predicates.py \
tests/simulation/test_benchmark_dsl.py \
tests/simulation/test_policy_runner_benchmark.py \
tests/simulation/mujoco/test_benchmark_dispatch.py

→ 110 passed

```

Full suite: 1347 passed, 31 skipped. 3 pre-existing failures all require OpenGL/OSMesa (confirmed failing on `main` too).

Example

Python:

```python
sim.register_benchmark_from_file(benchmark_name='drawer-open', spec_path='specs/drawer.yaml')
sim.evaluate_benchmark(benchmark_name='drawer-open', robot_name='arm', policy_provider='mock', n_episodes=10, seed=42)
```

Spec (`specs/drawer.yaml`):

```yaml
name: drawer-open
max_steps: 300
supported_robots: [panda]
default_robot: panda
success:
all:
- {predicate: joint_above, joint: drawer_slide, value: 0.15}
failure:
any:
- {predicate: body_below_z, body: gripper, z: -0.1}
dense_reward:

{predicate: distance_neg, body_a: gripper, body_b: drawer_handle, weight: 1.0}
{predicate: joint_progress, joint: drawer_slide, target: 0.2, weight: 5.0}
```

Out of scope (tracked as follow-ups)

Meta-World (Benchmark adapter: Meta-World (MetaWorldAdapter) #108), RoboSuite (Benchmark adapter: RoboSuite (RoboSuiteAdapter) #109), LIBERO + BDDL (Benchmark adapter: LIBERO (LiberoAdapter + BDDL parser) #110) adapters
Dense-reward curriculum tooling, RL training harness
Replacing the existing `success_fn` path (kept working)

Closes #107.

Introduce BenchmarkProtocol ABC + string-keyed registry so every standard benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks) can plug into the SimEngine eval loop without committing the core to any single benchmark's conventions. Adapters stay thin and ship in follow-up extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO). What lands here - strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo dataclass + thread-safe string-keyed registry. Robot compatibility is first-class metadata; default on_episode_start auto-loads default_robot and validates loaded robots against supported_robots. - strands_robots/simulation/predicates.py: named-predicate library (body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, ...). Closed registry, no eval() - safe for untrusted / LLM-authored specs. - strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark + register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib, YAML gated behind require_optional('pyyaml')). - PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int alongside the legacy success_fn path. Spec path adds cumulative reward, per-episode seeded RNG, is_failure early termination, and structured compatibility errors. Legacy path unchanged for backcompat. - SimEngine base adds evaluate_benchmark / list_benchmarks / register_benchmark_from_file facades, auto-dispatched via the existing _dispatch_action path. MuJoCo tool_spec.json gains three action enum entries plus benchmark_name / spec_path properties. Test coverage (110 new tests, all passing) - Protocol contract, registry ops, thread-safety, compatibility errors - Every predicate against lightweight fake sims + the reward math - DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing - PolicyRunner.evaluate(spec=...) for cumulative reward, seed reproducibility, is_success/is_failure/done terminations, legacy backcompat, evaluate_benchmark / list_benchmarks / register_... facades - MuJoCo dispatch path for all three new actions Out of scope (tracked as follow-ups) - Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110) - BDDL parser, dense-reward curriculum tooling, RL training harness - Replacing the existing success_fn path (kept working) Refs strands-labs#107.

yinsong1986 · 2026-05-11T22:06:04Z

Requesting review from @sundargthb, @awsarron, @cagataycali 🙏

cagataycali self-requested a review May 11, 2026 22:51

yinsong1986 mentioned this pull request May 11, 2026

feat: LIBERO benchmark adapter + BDDL parser (#110) #130

Open

cagataycali requested a review from max-rattray-aws May 11, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129

feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129
yinsong1986 wants to merge 1 commit into
strands-labs:mainfrom
yinsong1986:feat/benchmark-protocol

yinsong1986 commented May 11, 2026

Uh oh!

yinsong1986 commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yinsong1986 commented May 11, 2026

Summary

What's in scope (from #107)

Architecture notes

Verification

→ 110 passed

Example

Out of scope (tracked as follow-ups)

Uh oh!

yinsong1986 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yinsong1986 commented May 11, 2026 •

edited

Loading