feat: LIBERO benchmark adapter + BDDL parser (#110)#130
Open
yinsong1986 wants to merge 2 commits into
Open
Conversation
Introduce BenchmarkProtocol ABC + string-keyed registry so every standard benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks) can plug into the SimEngine eval loop without committing the core to any single benchmark's conventions. Adapters stay thin and ship in follow-up extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO). What lands here - strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo dataclass + thread-safe string-keyed registry. Robot compatibility is first-class metadata; default on_episode_start auto-loads default_robot and validates loaded robots against supported_robots. - strands_robots/simulation/predicates.py: named-predicate library (body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, ...). Closed registry, no eval() - safe for untrusted / LLM-authored specs. - strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark + register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib, YAML gated behind require_optional('pyyaml')). - PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int alongside the legacy success_fn path. Spec path adds cumulative reward, per-episode seeded RNG, is_failure early termination, and structured compatibility errors. Legacy path unchanged for backcompat. - SimEngine base adds evaluate_benchmark / list_benchmarks / register_benchmark_from_file facades, auto-dispatched via the existing _dispatch_action path. MuJoCo tool_spec.json gains three action enum entries plus benchmark_name / spec_path properties. Test coverage (110 new tests, all passing) - Protocol contract, registry ops, thread-safety, compatibility errors - Every predicate against lightweight fake sims + the reward math - DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing - PolicyRunner.evaluate(spec=...) for cumulative reward, seed reproducibility, is_success/is_failure/done terminations, legacy backcompat, evaluate_benchmark / list_benchmarks / register_... facades - MuJoCo dispatch path for all three new actions Out of scope (tracked as follow-ups) - Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110) - BDDL parser, dense-reward curriculum tooling, RL training harness - Replacing the existing success_fn path (kept working) Refs strands-labs#107.
Follow-up to strands-labs#107 / strands-labs#129 (BenchmarkProtocol). Ships LiberoAdapter so the ~130 LIBERO tasks become runnable on the MuJoCo backend through: sim.evaluate_benchmark(benchmark_name="libero-spatial-pick-cube", policy_provider="mock", n_episodes=50, seed=42) What lands - strands_robots/benchmarks/libero/bddl_parser.py: pure-Python BDDL parser (tokenizer + s-expr parser + typed AST + goal compiler). Closed predicate vocabulary (on/near/inside/open/closed/grasped/upright + and/or/not); unknown predicates rejected explicitly with the valid list in the error message. No eval(), no exec() — safe for untrusted input. - strands_robots/benchmarks/libero/adapter.py: LiberoAdapter (BenchmarkProtocol, Panda-only). Loads scene before super(). on_episode_start so the robot-compat check sees the scene's Panda, then applies per-episode RNG-seeded xy jitter to init-subject bodies. Sparse on_step (reward=0, done=False); is_success walks the compiled predicate tree. - strands_robots/benchmarks/libero/suite.py: load_libero_suite() — bulk-registers every task in a LIBERO suite under ``libero-<suite>-<task>`` keys. Accepts bddl_dir= override so the adapter is usable without the libero pip package (LIBERO is only needed when discovering task files from the installed package). Malformed BDDL files are logged and skipped, not fatal. - strands_robots/simulation/predicates.py: 4 new generic predicates (body_on, body_inside, body_upright, grasped) required by the BDDL vocabulary. They use only SimEngine.get_body_state / get_contacts so they remain backend-agnostic. - pyproject.toml: new [benchmark-libero] extra (libero>=0.1.0,<1.0.0). NOT added to [all] — LIBERO is a heavy optional dep. Core tests pass without it installed; load_libero_suite without bddl_dir= fails with a clean ImportError naming the extra. Test coverage (91 new tests, all passing) - BDDL parser: tokenizer (comments/quotes/parens), s-expr parser, top-level (define ...) structure, :language/:init/:goal/:objects extraction, boolean combinators with short-circuit behaviour, rejection of unknown predicates/wrong arities, round-trip on a curated 5-task subset covering every predicate family. - LiberoAdapter: construction (from_file/from_text/__init__), supported_robots=['panda'], instruction surfacing, is_success positive+negative+compound+negated, on_episode_start scene-load ordering + compat error on non-Panda, init jitter with deterministic seed, sparse on_step semantics, PolicyRunner.evaluate(spec=) + evaluate_benchmark integration, unknown-task → structured error. - load_libero_suite: suite name normalisation (kebab/snake/case- insensitive), bulk registration with bddl_dir override, scene path resolution (existing-file + missing-file), malformed-BDDL skip, max_steps/init_jitter forwarding, unknown suite + nonexistent dir errors, empty directory is a no-op. - End-to-end via MuJoCo: LiberoAdapter registered + evaluate_benchmark dispatched through _dispatch_action → PolicyRunner → is_success round-trips without crashing; non-Panda robot surfaces structured compat error. - New predicate tests: body_on / body_inside / body_upright / grasped positive + negative + edge cases (missing body, no gripper contact, invalid tolerance). Architecture notes - **Panda-only by design.** LIBERO's scene MJCFs include Panda geometry and BDDL predicates reference Panda gripper body names (robot0_gripper_*). Retargeting to SO-100 et al. would require rewriting every BDDL predicate against different body names — out of scope. - **No libero dep for the adapter or parser.** You can use LiberoAdapter.from_text(...) with your own BDDL strings and MJCFs; the libero package is only imported by load_libero_suite when discovering task files from the installed package. - **Stacked on PR strands-labs#129.** The BenchmarkProtocol framework (strands-labs#107) must merge first. Once strands-labs#129 lands, this PR auto-rebases onto main. Out of scope (per strands-labs#110) - Dense-reward variants (LIBERO is sparse by design). - Language-conditioning evaluation tooling (instruction passes through to the policy as-is via LiberoAdapter.instruction). - Hosting LIBERO's MJCF assets (depend on upstream). Closes strands-labs#110.
Author
|
Requesting review from @sundargthb, @awsarron, @cagataycali 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #110 as a follow-up to #129 (the
BenchmarkProtocolframework from #107). ShipsLiberoAdapterso the ~130 LIBERO tasks become runnable on the MuJoCo backend through the standardevaluate_benchmarksurface:What's in scope (from #110)
strands_robots/benchmarks/libero/bddl_parser.py) — pure-Python tokenizer + s-expression parser + typed AST + goal compiler. Handlesdefine,:domain,:objects,:init,:goal,:language, boolean combinators (and/or/not), and the closed predicate vocabulary LIBERO uses:on,near,inside,open,closed,grasped,upright. Unknown predicates rejected explicitly with the valid list. Noeval(), noexec()— safe for untrusted input.LiberoAdapter.on_episode_startloads the declared scene MJCF (if provided) before the base compatibility check, so the check sees the scene's Panda robot rather than reporting "sim empty → load default".LiberoAdapterclass (strands_robots/benchmarks/libero/adapter.py) — Panda-onlyBenchmarkProtocolsubclass. Compiles BDDL goal → single(sim) -> boolcallable. Sparseon_step(LIBERO has no dense reward). Applies per-episode RNG-seeded ±jitter to init-subject bodies.load_libero_suite(suite_name)bulk-registers every BDDL task in a suite underlibero-<suite>-<task>keys. Acceptsbddl_dir=override so the adapter is fully usable without theliberopip package — you can ship your own BDDL files. LIBERO is only imported when auto-discovering task files from the installed package.strands-robots[benchmark-libero]withlibero>=0.1.0,<1.0.0. Not added to[all](LIBERO is a heavy opt-in dep). Core tests pass without it installed.New generic predicates
BDDL requires 4 predicates not yet in the core library; they're generic enough to live there rather than being LIBERO-specific:
body_on(body_a, body_b, z_offset=0.02, xy_tol=0.15)body_inside(body, container, xy_tol=0.15, z_tol=0.15)body_upright(body, tol=0.15)tolof world +Zgrasped(body, gripper_prefix)gripper_prefix*All use only
SimEngine.get_body_state/get_contactsso they remain backend-agnostic.Architecture notes
<include>Panda geometry and BDDL predicates reference Panda gripper body names (robot0_gripper_*). Retargeting to SO-100 et al. would require rewriting every BDDL predicate against different body names — out of scope.load_libero_suitekeeps loading the rest of the suite if one file is bad — so one upstream drift doesn't brick a whole suite.Tests (91 new, all passing)
tests/benchmarks/libero/test_bddl_parser.py(38) — tokenizer / parser / vocabulary / combinators / round-trip on one example per predicate family.tests/benchmarks/libero/test_libero_adapter.py(19) — construction, Panda-only compat, is_success cases, scene-load ordering, init jitter determinism, PolicyRunner / evaluate_benchmark integration, unknown-task structured error.tests/benchmarks/libero/test_libero_suite.py(15) — suite name normalisation, bulk registration, scene resolution, malformed-BDDL skip, forwarding ofmax_steps/init_jitter, error cases.tests/benchmarks/libero/test_libero_e2e.py(2) — end-to-end MuJoCo dispatch path.tests/simulation/test_benchmark_predicates.py(+17) — new predicate tests (body_on / body_inside / body_upright / grasped).Verification
Without LIBERO installed:
Example: register + evaluate a LIBERO task
Out of scope (per #110)
adapter.instructionsurfaces the BDDL:languagestring; the policy consumes it as-is).Closes #110.
Reviewers from #129 — would appreciate your eyes on this one too: @sundargthb @awsarron @cagataycali