feat: LIBERO benchmark adapter + BDDL parser (#110) by yinsong1986 · Pull Request #130 · strands-labs/robots

yinsong1986 · 2026-05-11T23:07:19Z

Summary

Implements #110 as a follow-up to #129 (the BenchmarkProtocol framework from #107). Ships LiberoAdapter so the ~130 LIBERO tasks become runnable on the MuJoCo backend through the standard evaluate_benchmark surface:

sim.evaluate_benchmark(
    benchmark_name="libero-spatial-pick_up_the_red_cube",
    policy_provider="lerobot_local",
    n_episodes=50,
    seed=42,
)

Stacked on #129. #129 must merge first. The diff below will include #129's commits until then; once #129 merges this PR auto-rebases onto main.

What's in scope (from #110)

BDDL parser (strands_robots/benchmarks/libero/bddl_parser.py) — pure-Python tokenizer + s-expression parser + typed AST + goal compiler. Handles define, :domain, :objects, :init, :goal, :language, boolean combinators (and / or / not), and the closed predicate vocabulary LIBERO uses: on, near, inside, open, closed, grasped, upright. Unknown predicates rejected explicitly with the valid list. No eval(), no exec() — safe for untrusted input.
Scene loading — LiberoAdapter.on_episode_start loads the declared scene MJCF (if provided) before the base compatibility check, so the check sees the scene's Panda robot rather than reporting "sim empty → load default".
LiberoAdapter class (strands_robots/benchmarks/libero/adapter.py) — Panda-only BenchmarkProtocol subclass. Compiles BDDL goal → single (sim) -> bool callable. Sparse on_step (LIBERO has no dense reward). Applies per-episode RNG-seeded ±jitter to init-subject bodies.
Registry bootstrap — load_libero_suite(suite_name) bulk-registers every BDDL task in a suite under libero-<suite>-<task> keys. Accepts bddl_dir= override so the adapter is fully usable without the libero pip package — you can ship your own BDDL files. LIBERO is only imported when auto-discovering task files from the installed package.
Packaging — new optional extra strands-robots[benchmark-libero] with libero>=0.1.0,<1.0.0. Not added to [all] (LIBERO is a heavy opt-in dep). Core tests pass without it installed.
Tests — 91 new tests (see below) covering parser, adapter lifecycle, suite loader, and end-to-end MuJoCo dispatch.

New generic predicates

BDDL requires 4 predicates not yet in the core library; they're generic enough to live there rather than being LIBERO-specific:

Predicate	Signature
`body_on(body_a, body_b, z_offset=0.02, xy_tol=0.15)`	A is above B and horizontally close
`body_inside(body, container, xy_tol=0.15, z_tol=0.15)`	A is within B's AABB
`body_upright(body, tol=0.15)`	body's local +Z axis within `tol` of world +Z
`grasped(body, gripper_prefix)`	body in contact with any geom matching `gripper_prefix*`

All use only SimEngine.get_body_state / get_contacts so they remain backend-agnostic.

Architecture notes

Panda-only by design. LIBERO's scene MJCFs <include> Panda geometry and BDDL predicates reference Panda gripper body names (robot0_gripper_*). Retargeting to SO-100 et al. would require rewriting every BDDL predicate against different body names — out of scope.
BDDL parser never runs Python. Closed token set + closed predicate vocabulary. LLM-authored or agent-authored BDDL files are safe to parse.
Malformed BDDL is logged, not fatal. load_libero_suite keeps loading the rest of the suite if one file is bad — so one upstream drift doesn't brick a whole suite.

Tests (91 new, all passing)

tests/benchmarks/libero/test_bddl_parser.py (38) — tokenizer / parser / vocabulary / combinators / round-trip on one example per predicate family.
tests/benchmarks/libero/test_libero_adapter.py (19) — construction, Panda-only compat, is_success cases, scene-load ordering, init jitter determinism, PolicyRunner / evaluate_benchmark integration, unknown-task structured error.
tests/benchmarks/libero/test_libero_suite.py (15) — suite name normalisation, bulk registration, scene resolution, malformed-BDDL skip, forwarding of max_steps / init_jitter, error cases.
tests/benchmarks/libero/test_libero_e2e.py (2) — end-to-end MuJoCo dispatch path.
tests/simulation/test_benchmark_predicates.py (+17) — new predicate tests (body_on / body_inside / body_upright / grasped).

Verification

hatch run format                  # ruff check --fix + format
hatch run lint                    # all clean
hatch run pytest tests/           # 1438 passed, 31 skipped (env-dependent GL skips)

Without LIBERO installed:

python -c "
import strands_robots
from strands_robots.benchmarks.libero import LiberoAdapter, parse_bddl
a = LiberoAdapter.from_text('(define (problem t) (:goal (grasped cube)))')
print(a.problem.name, a.supported_robots)  # → t ['panda']
"
# → ok
# load_libero_suite without bddl_dir= raises a clean ImportError naming [benchmark-libero].

Example: register + evaluate a LIBERO task

from strands_robots.simulation import Simulation
from strands_robots.benchmarks.libero import load_libero_suite

sim = Simulation(tool_name="libero_sim", mesh=False)
sim.create_world()

# Bulk-register every task in libero_spatial (requires libero pip pkg OR bddl_dir=)
load_libero_suite("libero_spatial", scene_dir="/path/to/scenes")

result = sim.evaluate_benchmark(
    benchmark_name="libero-spatial-pick_up_the_red_cube",
    policy_provider="lerobot_local",
    policy_config={"pretrained_name_or_path": "lerobot/smolvla"},
    n_episodes=20,
    seed=42,
)

Out of scope (per #110)

Dense-reward variants of LIBERO tasks (LIBERO is sparse-success by design).
Language-conditioning evaluation tooling (adapter.instruction surfaces the BDDL :language string; the policy consumes it as-is).
Hosting LIBERO's MJCF assets (depend on upstream).

Closes #110.

Reviewers from #129 — would appreciate your eyes on this one too: @sundargthb @awsarron @cagataycali

Introduce BenchmarkProtocol ABC + string-keyed registry so every standard benchmark (LIBERO, Meta-World, RoboSuite, ManiSkill, user-authored tasks) can plug into the SimEngine eval loop without committing the core to any single benchmark's conventions. Adapters stay thin and ship in follow-up extras (strands-labs#108 Meta-World, strands-labs#109 RoboSuite, strands-labs#110 LIBERO). What lands here - strands_robots/simulation/benchmark.py: BenchmarkProtocol ABC + StepInfo dataclass + thread-safe string-keyed registry. Robot compatibility is first-class metadata; default on_episode_start auto-loads default_robot and validates loaded robots against supported_robots. - strands_robots/simulation/predicates.py: named-predicate library (body_above_z, joint_above, distance_less_than, inside_region, contact_between, contact_any, distance_neg, joint_progress, constant, ...). Closed registry, no eval() - safe for untrusted / LLM-authored specs. - strands_robots/simulation/benchmark_spec.py: DeclarativeBenchmark + register_benchmark_from_file loading YAML/JSON specs (JSON via stdlib, YAML gated behind require_optional('pyyaml')). - PolicyRunner.evaluate now accepts spec=BenchmarkProtocol and seed=int alongside the legacy success_fn path. Spec path adds cumulative reward, per-episode seeded RNG, is_failure early termination, and structured compatibility errors. Legacy path unchanged for backcompat. - SimEngine base adds evaluate_benchmark / list_benchmarks / register_benchmark_from_file facades, auto-dispatched via the existing _dispatch_action path. MuJoCo tool_spec.json gains three action enum entries plus benchmark_name / spec_path properties. Test coverage (110 new tests, all passing) - Protocol contract, registry ops, thread-safety, compatibility errors - Every predicate against lightweight fake sims + the reward math - DSL validation (good/bad specs), file loading (JSON/YAML), sandboxing - PolicyRunner.evaluate(spec=...) for cumulative reward, seed reproducibility, is_success/is_failure/done terminations, legacy backcompat, evaluate_benchmark / list_benchmarks / register_... facades - MuJoCo dispatch path for all three new actions Out of scope (tracked as follow-ups) - Meta-World / RoboSuite / LIBERO adapters (strands-labs#108, strands-labs#109, strands-labs#110) - BDDL parser, dense-reward curriculum tooling, RL training harness - Replacing the existing success_fn path (kept working) Refs strands-labs#107.

Follow-up to strands-labs#107 / strands-labs#129 (BenchmarkProtocol). Ships LiberoAdapter so the ~130 LIBERO tasks become runnable on the MuJoCo backend through: sim.evaluate_benchmark(benchmark_name="libero-spatial-pick-cube", policy_provider="mock", n_episodes=50, seed=42) What lands - strands_robots/benchmarks/libero/bddl_parser.py: pure-Python BDDL parser (tokenizer + s-expr parser + typed AST + goal compiler). Closed predicate vocabulary (on/near/inside/open/closed/grasped/upright + and/or/not); unknown predicates rejected explicitly with the valid list in the error message. No eval(), no exec() — safe for untrusted input. - strands_robots/benchmarks/libero/adapter.py: LiberoAdapter (BenchmarkProtocol, Panda-only). Loads scene before super(). on_episode_start so the robot-compat check sees the scene's Panda, then applies per-episode RNG-seeded xy jitter to init-subject bodies. Sparse on_step (reward=0, done=False); is_success walks the compiled predicate tree. - strands_robots/benchmarks/libero/suite.py: load_libero_suite() — bulk-registers every task in a LIBERO suite under ``libero-<suite>-<task>`` keys. Accepts bddl_dir= override so the adapter is usable without the libero pip package (LIBERO is only needed when discovering task files from the installed package). Malformed BDDL files are logged and skipped, not fatal. - strands_robots/simulation/predicates.py: 4 new generic predicates (body_on, body_inside, body_upright, grasped) required by the BDDL vocabulary. They use only SimEngine.get_body_state / get_contacts so they remain backend-agnostic. - pyproject.toml: new [benchmark-libero] extra (libero>=0.1.0,<1.0.0). NOT added to [all] — LIBERO is a heavy optional dep. Core tests pass without it installed; load_libero_suite without bddl_dir= fails with a clean ImportError naming the extra. Test coverage (91 new tests, all passing) - BDDL parser: tokenizer (comments/quotes/parens), s-expr parser, top-level (define ...) structure, :language/:init/:goal/:objects extraction, boolean combinators with short-circuit behaviour, rejection of unknown predicates/wrong arities, round-trip on a curated 5-task subset covering every predicate family. - LiberoAdapter: construction (from_file/from_text/__init__), supported_robots=['panda'], instruction surfacing, is_success positive+negative+compound+negated, on_episode_start scene-load ordering + compat error on non-Panda, init jitter with deterministic seed, sparse on_step semantics, PolicyRunner.evaluate(spec=) + evaluate_benchmark integration, unknown-task → structured error. - load_libero_suite: suite name normalisation (kebab/snake/case- insensitive), bulk registration with bddl_dir override, scene path resolution (existing-file + missing-file), malformed-BDDL skip, max_steps/init_jitter forwarding, unknown suite + nonexistent dir errors, empty directory is a no-op. - End-to-end via MuJoCo: LiberoAdapter registered + evaluate_benchmark dispatched through _dispatch_action → PolicyRunner → is_success round-trips without crashing; non-Panda robot surfaces structured compat error. - New predicate tests: body_on / body_inside / body_upright / grasped positive + negative + edge cases (missing body, no gripper contact, invalid tolerance). Architecture notes - **Panda-only by design.** LIBERO's scene MJCFs include Panda geometry and BDDL predicates reference Panda gripper body names (robot0_gripper_*). Retargeting to SO-100 et al. would require rewriting every BDDL predicate against different body names — out of scope. - **No libero dep for the adapter or parser.** You can use LiberoAdapter.from_text(...) with your own BDDL strings and MJCFs; the libero package is only imported by load_libero_suite when discovering task files from the installed package. - **Stacked on PR strands-labs#129.** The BenchmarkProtocol framework (strands-labs#107) must merge first. Once strands-labs#129 lands, this PR auto-rebases onto main. Out of scope (per strands-labs#110) - Dense-reward variants (LIBERO is sparse by design). - Language-conditioning evaluation tooling (instruction passes through to the policy as-is via LiberoAdapter.instruction). - Hosting LIBERO's MJCF assets (depend on upstream). Closes strands-labs#110.

yinsong1986 · 2026-05-11T23:13:15Z

Requesting review from @sundargthb, @awsarron, @cagataycali 🙏

yinsong1986 added 2 commits May 11, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LIBERO benchmark adapter + BDDL parser (#110)#130

feat: LIBERO benchmark adapter + BDDL parser (#110)#130
yinsong1986 wants to merge 2 commits into
strands-labs:mainfrom
yinsong1986:feat/benchmark-libero

yinsong1986 commented May 11, 2026

Uh oh!

yinsong1986 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yinsong1986 commented May 11, 2026

Summary

What's in scope (from #110)

New generic predicates

Architecture notes

Tests (91 new, all passing)

Verification

Example: register + evaluate a LIBERO task

Out of scope (per #110)

Uh oh!

yinsong1986 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant