Skip to content

feat: LIBERO benchmark adapter + BDDL parser (#110)#130

Open
yinsong1986 wants to merge 2 commits into
strands-labs:mainfrom
yinsong1986:feat/benchmark-libero
Open

feat: LIBERO benchmark adapter + BDDL parser (#110)#130
yinsong1986 wants to merge 2 commits into
strands-labs:mainfrom
yinsong1986:feat/benchmark-libero

Conversation

@yinsong1986
Copy link
Copy Markdown

Summary

Implements #110 as a follow-up to #129 (the BenchmarkProtocol framework from #107). Ships LiberoAdapter so the ~130 LIBERO tasks become runnable on the MuJoCo backend through the standard evaluate_benchmark surface:

sim.evaluate_benchmark(
    benchmark_name="libero-spatial-pick_up_the_red_cube",
    policy_provider="lerobot_local",
    n_episodes=50,
    seed=42,
)

Stacked on #129. #129 must merge first. The diff below will include #129's commits until then; once #129 merges this PR auto-rebases onto main.

What's in scope (from #110)

  1. BDDL parser (strands_robots/benchmarks/libero/bddl_parser.py) — pure-Python tokenizer + s-expression parser + typed AST + goal compiler. Handles define, :domain, :objects, :init, :goal, :language, boolean combinators (and / or / not), and the closed predicate vocabulary LIBERO uses: on, near, inside, open, closed, grasped, upright. Unknown predicates rejected explicitly with the valid list. No eval(), no exec() — safe for untrusted input.
  2. Scene loadingLiberoAdapter.on_episode_start loads the declared scene MJCF (if provided) before the base compatibility check, so the check sees the scene's Panda robot rather than reporting "sim empty → load default".
  3. LiberoAdapter class (strands_robots/benchmarks/libero/adapter.py) — Panda-only BenchmarkProtocol subclass. Compiles BDDL goal → single (sim) -> bool callable. Sparse on_step (LIBERO has no dense reward). Applies per-episode RNG-seeded ±jitter to init-subject bodies.
  4. Registry bootstrapload_libero_suite(suite_name) bulk-registers every BDDL task in a suite under libero-<suite>-<task> keys. Accepts bddl_dir= override so the adapter is fully usable without the libero pip package — you can ship your own BDDL files. LIBERO is only imported when auto-discovering task files from the installed package.
  5. Packaging — new optional extra strands-robots[benchmark-libero] with libero>=0.1.0,<1.0.0. Not added to [all] (LIBERO is a heavy opt-in dep). Core tests pass without it installed.
  6. Tests — 91 new tests (see below) covering parser, adapter lifecycle, suite loader, and end-to-end MuJoCo dispatch.

New generic predicates

BDDL requires 4 predicates not yet in the core library; they're generic enough to live there rather than being LIBERO-specific:

Predicate Signature
body_on(body_a, body_b, z_offset=0.02, xy_tol=0.15) A is above B and horizontally close
body_inside(body, container, xy_tol=0.15, z_tol=0.15) A is within B's AABB
body_upright(body, tol=0.15) body's local +Z axis within tol of world +Z
grasped(body, gripper_prefix) body in contact with any geom matching gripper_prefix*

All use only SimEngine.get_body_state / get_contacts so they remain backend-agnostic.

Architecture notes

  • Panda-only by design. LIBERO's scene MJCFs <include> Panda geometry and BDDL predicates reference Panda gripper body names (robot0_gripper_*). Retargeting to SO-100 et al. would require rewriting every BDDL predicate against different body names — out of scope.
  • BDDL parser never runs Python. Closed token set + closed predicate vocabulary. LLM-authored or agent-authored BDDL files are safe to parse.
  • Malformed BDDL is logged, not fatal. load_libero_suite keeps loading the rest of the suite if one file is bad — so one upstream drift doesn't brick a whole suite.

Tests (91 new, all passing)

  • tests/benchmarks/libero/test_bddl_parser.py (38) — tokenizer / parser / vocabulary / combinators / round-trip on one example per predicate family.
  • tests/benchmarks/libero/test_libero_adapter.py (19) — construction, Panda-only compat, is_success cases, scene-load ordering, init jitter determinism, PolicyRunner / evaluate_benchmark integration, unknown-task structured error.
  • tests/benchmarks/libero/test_libero_suite.py (15) — suite name normalisation, bulk registration, scene resolution, malformed-BDDL skip, forwarding of max_steps / init_jitter, error cases.
  • tests/benchmarks/libero/test_libero_e2e.py (2) — end-to-end MuJoCo dispatch path.
  • tests/simulation/test_benchmark_predicates.py (+17) — new predicate tests (body_on / body_inside / body_upright / grasped).

Verification

hatch run format                  # ruff check --fix + format
hatch run lint                    # all clean
hatch run pytest tests/           # 1438 passed, 31 skipped (env-dependent GL skips)

Without LIBERO installed:

python -c "
import strands_robots
from strands_robots.benchmarks.libero import LiberoAdapter, parse_bddl
a = LiberoAdapter.from_text('(define (problem t) (:goal (grasped cube)))')
print(a.problem.name, a.supported_robots)  # → t ['panda']
"
# → ok
# load_libero_suite without bddl_dir= raises a clean ImportError naming [benchmark-libero].

Example: register + evaluate a LIBERO task

from strands_robots.simulation import Simulation
from strands_robots.benchmarks.libero import load_libero_suite

sim = Simulation(tool_name="libero_sim", mesh=False)
sim.create_world()

# Bulk-register every task in libero_spatial (requires libero pip pkg OR bddl_dir=)
load_libero_suite("libero_spatial", scene_dir="/path/to/scenes")

result = sim.evaluate_benchmark(
    benchmark_name="libero-spatial-pick_up_the_red_cube",
    policy_provider="lerobot_local",
    policy_config={"pretrained_name_or_path": "lerobot/smolvla"},
    n_episodes=20,
    seed=42,
)

Out of scope (per #110)

  • Dense-reward variants of LIBERO tasks (LIBERO is sparse-success by design).
  • Language-conditioning evaluation tooling (adapter.instruction surfaces the BDDL :language string; the policy consumes it as-is).
  • Hosting LIBERO's MJCF assets (depend on upstream).

Closes #110.


Reviewers from #129 — would appreciate your eyes on this one too: @sundargthb @awsarron @cagataycali

Introduce BenchmarkProtocol ABC + string-keyed registry so every standard
benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks)
can plug into the SimEngine eval loop without committing the core to any
single benchmark's conventions. Adapters stay thin and ship in follow-up
extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO).

What lands here
- strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo
  dataclass + thread-safe string-keyed registry. Robot compatibility is
  first-class metadata; default on_episode_start auto-loads default_robot
  and validates loaded robots against supported_robots.
- strands_robots/simulation/predicates.py: named-predicate library
  (body_above_z, joint_above, distance_less_than, inside_region,
  contact_between, contact_any, distance_neg, joint_progress, constant, ...).
  Closed registry, no eval() - safe for untrusted / LLM-authored specs.
- strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark +
  register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib,
  YAML gated behind require_optional('pyyaml')).
- PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int
  alongside the legacy success_fn path. Spec path adds cumulative reward,
  per-episode seeded RNG, is_failure early termination, and structured
  compatibility errors. Legacy path unchanged for backcompat.
- SimEngine base adds evaluate_benchmark / list_benchmarks /
  register_benchmark_from_file facades, auto-dispatched via the existing
  _dispatch_action path. MuJoCo tool_spec.json gains three action enum
  entries plus benchmark_name / spec_path properties.

Test coverage (110 new tests, all passing)
- Protocol contract, registry ops, thread-safety, compatibility errors
- Every predicate against lightweight fake sims + the reward math
- DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing
- PolicyRunner.evaluate(spec=...) for cumulative reward, seed
  reproducibility, is_success/is_failure/done terminations, legacy
  backcompat, evaluate_benchmark / list_benchmarks / register_... facades
- MuJoCo dispatch path for all three new actions

Out of scope (tracked as follow-ups)
- Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110)
- BDDL parser, dense-reward curriculum tooling, RL training harness
- Replacing the existing success_fn path (kept working)

Refs strands-labs#107.
Follow-up to strands-labs#107 / strands-labs#129 (BenchmarkProtocol). Ships LiberoAdapter so the
~130 LIBERO tasks become runnable on the MuJoCo backend through:

    sim.evaluate_benchmark(benchmark_name="libero-spatial-pick-cube",
                           policy_provider="mock", n_episodes=50, seed=42)

What lands
- strands_robots/benchmarks/libero/bddl_parser.py: pure-Python BDDL
  parser (tokenizer + s-expr parser + typed AST + goal compiler). Closed
  predicate vocabulary (on/near/inside/open/closed/grasped/upright +
  and/or/not); unknown predicates rejected explicitly with the valid
  list in the error message. No eval(), no exec() — safe for untrusted
  input.
- strands_robots/benchmarks/libero/adapter.py: LiberoAdapter
  (BenchmarkProtocol, Panda-only). Loads scene before super().
  on_episode_start so the robot-compat check sees the scene's Panda,
  then applies per-episode RNG-seeded xy jitter to init-subject bodies.
  Sparse on_step (reward=0, done=False); is_success walks the compiled
  predicate tree.
- strands_robots/benchmarks/libero/suite.py: load_libero_suite() —
  bulk-registers every task in a LIBERO suite under
  ``libero-<suite>-<task>`` keys. Accepts bddl_dir= override so the
  adapter is usable without the libero pip package (LIBERO is only
  needed when discovering task files from the installed package).
  Malformed BDDL files are logged and skipped, not fatal.
- strands_robots/simulation/predicates.py: 4 new generic predicates
  (body_on, body_inside, body_upright, grasped) required by the BDDL
  vocabulary. They use only SimEngine.get_body_state / get_contacts so
  they remain backend-agnostic.
- pyproject.toml: new [benchmark-libero] extra (libero>=0.1.0,<1.0.0).
  NOT added to [all] — LIBERO is a heavy optional dep. Core tests pass
  without it installed; load_libero_suite without bddl_dir= fails with
  a clean ImportError naming the extra.

Test coverage (91 new tests, all passing)
- BDDL parser: tokenizer (comments/quotes/parens), s-expr parser,
  top-level (define ...) structure, :language/:init/:goal/:objects
  extraction, boolean combinators with short-circuit behaviour,
  rejection of unknown predicates/wrong arities, round-trip on a
  curated 5-task subset covering every predicate family.
- LiberoAdapter: construction (from_file/from_text/__init__),
  supported_robots=['panda'], instruction surfacing, is_success
  positive+negative+compound+negated, on_episode_start scene-load
  ordering + compat error on non-Panda, init jitter with deterministic
  seed, sparse on_step semantics, PolicyRunner.evaluate(spec=) +
  evaluate_benchmark integration, unknown-task → structured error.
- load_libero_suite: suite name normalisation (kebab/snake/case-
  insensitive), bulk registration with bddl_dir override, scene path
  resolution (existing-file + missing-file), malformed-BDDL skip,
  max_steps/init_jitter forwarding, unknown suite + nonexistent dir
  errors, empty directory is a no-op.
- End-to-end via MuJoCo: LiberoAdapter registered + evaluate_benchmark
  dispatched through _dispatch_action → PolicyRunner → is_success
  round-trips without crashing; non-Panda robot surfaces structured
  compat error.
- New predicate tests: body_on / body_inside / body_upright / grasped
  positive + negative + edge cases (missing body, no gripper contact,
  invalid tolerance).

Architecture notes
- **Panda-only by design.** LIBERO's scene MJCFs include Panda geometry
  and BDDL predicates reference Panda gripper body names
  (robot0_gripper_*). Retargeting to SO-100 et al. would require
  rewriting every BDDL predicate against different body names — out of
  scope.
- **No libero dep for the adapter or parser.** You can use
  LiberoAdapter.from_text(...) with your own BDDL strings and
  MJCFs; the libero package is only imported by load_libero_suite
  when discovering task files from the installed package.
- **Stacked on PR strands-labs#129.** The BenchmarkProtocol framework (strands-labs#107) must
  merge first. Once strands-labs#129 lands, this PR auto-rebases onto main.

Out of scope (per strands-labs#110)
- Dense-reward variants (LIBERO is sparse by design).
- Language-conditioning evaluation tooling (instruction passes through
  to the policy as-is via LiberoAdapter.instruction).
- Hosting LIBERO's MJCF assets (depend on upstream).

Closes strands-labs#110.
@yinsong1986
Copy link
Copy Markdown
Author

Requesting review from @sundargthb, @awsarron, @cagataycali 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark adapter: LIBERO (LiberoAdapter + BDDL parser)

1 participant