feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129
Open
yinsong1986 wants to merge 1 commit into
Open
feat: benchmark-agnostic evaluation protocol (BenchmarkProtocol + registry) (#107)#129yinsong1986 wants to merge 1 commit into
yinsong1986 wants to merge 1 commit into
Conversation
Introduce BenchmarkProtocol ABC + string-keyed registry so every standard benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks) can plug into the SimEngine eval loop without committing the core to any single benchmark's conventions. Adapters stay thin and ship in follow-up extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO). What lands here - strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo dataclass + thread-safe string-keyed registry. Robot compatibility is first-class metadata; default on_episode_start auto-loads default_robot and validates loaded robots against supported_robots. - strands_robots/simulation/predicates.py: named-predicate library (body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, ...). Closed registry, no eval() - safe for untrusted / LLM-authored specs. - strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark + register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib, YAML gated behind require_optional('pyyaml')). - PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int alongside the legacy success_fn path. Spec path adds cumulative reward, per-episode seeded RNG, is_failure early termination, and structured compatibility errors. Legacy path unchanged for backcompat. - SimEngine base adds evaluate_benchmark / list_benchmarks / register_benchmark_from_file facades, auto-dispatched via the existing _dispatch_action path. MuJoCo tool_spec.json gains three action enum entries plus benchmark_name / spec_path properties. Test coverage (110 new tests, all passing) - Protocol contract, registry ops, thread-safety, compatibility errors - Every predicate against lightweight fake sims + the reward math - DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing - PolicyRunner.evaluate(spec=...) for cumulative reward, seed reproducibility, is_success/is_failure/done terminations, legacy backcompat, evaluate_benchmark / list_benchmarks / register_... facades - MuJoCo dispatch path for all three new actions Out of scope (tracked as follow-ups) - Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110) - BDDL parser, dense-reward curriculum tooling, RL training harness - Replacing the existing success_fn path (kept working) Refs strands-labs#107.
Author
|
Requesting review from @sundargthb, @awsarron, @cagataycali 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #107: a benchmark-agnostic evaluation protocol so LIBERO, Meta-World,
RoboSuite, ManiSkill, and user-authored tasks can plug into the
SimEngineevalloop without committing the core to any single benchmark's conventions.
Adapters stay thin and ship in follow-up extras (#108 Meta-World, #109 RoboSuite,
#110 LIBERO). The existing
success_fnpath is kept working for backcompat.What's in scope (from #107)
BenchmarkProtocolABC +StepInfodataclass instrands_robots/simulation/benchmark.pyPolicyRunner.evaluatewidened to acceptspec: BenchmarkProtocol+seed, alongside the existingsuccess_fnpathtool_spec.jsonentries:list_benchmarks,register_benchmark_from_file,evaluate_benchmarkstrands_robots/simulation/predicates.py(11 built-ins:body_above_z,joint_above,distance_less_than,inside_region,contact_between,contact_any,distance_neg,joint_progress,constant, …)register_benchmark_from_file) restricted to the named-predicate DSL - noeval, noexec, safe for LLM-authored specsDeclarativeBenchmark(the DSL-driven adapter) serves as the end-to-end reference. A fullMetaWorldAdapteris tracked as a separate PR (Benchmark adapter: Meta-World (MetaWorldAdapter) #108) since it needs themetaworldenv + real task validation.Architecture notes
on_episode_startvalidates loaded robotdata_configagainstsupported_robotsand auto-loadsdefault_robotwhen the sim is empty. Mismatches surface as structured error dicts.register_urdf. Module-leveldict[str, BenchmarkProtocol]guarded by anRLock. Re-registration is idempotent-overwrite with a warning.PREDICATE_REGISTRY. Noeval. Sandboxed by construction.evaluate(spec=…, seed=42)seeds a master RNG, derives a per-episode child RNG, and threads it throughspec.on_episode_start(sim, rng).benchmark_name,spec_path).Verification
```bash
hatch run lint # ruff + mypy, all clean
hatch run pytest tests/simulation/test_benchmark.py \
tests/simulation/test_benchmark_predicates.py \
tests/simulation/test_benchmark_dsl.py \
tests/simulation/test_policy_runner_benchmark.py \
tests/simulation/mujoco/test_benchmark_dispatch.py
→ 110 passed
```
Full suite: 1347 passed, 31 skipped. 3 pre-existing failures all require OpenGL/OSMesa (confirmed failing on `main` too).
Example
Python:
```python
sim.register_benchmark_from_file(benchmark_name='drawer-open', spec_path='specs/drawer.yaml')
sim.evaluate_benchmark(benchmark_name='drawer-open', robot_name='arm', policy_provider='mock', n_episodes=10, seed=42)
```
Spec (`specs/drawer.yaml`):
```yaml
name: drawer-open
max_steps: 300
supported_robots: [panda]
default_robot: panda
success:
all:
- {predicate: joint_above, joint: drawer_slide, value: 0.15}
failure:
any:
- {predicate: body_below_z, body: gripper, z: -0.1}
dense_reward:
```
Out of scope (tracked as follow-ups)
Closes #107.