You are working in cube-standard, the protocol and base classes that benchmarks and
harnesses implement. This file is your map; it is deliberately short. Read the relevant
spec in openspec/specs/ before modifying any layer.
CUBE Standard defines the contract: how benchmarks expose tasks, how tools expose actions, how resources are provisioned. It does NOT run agents, record trajectories, or coordinate experiments — that lives in cube-harness.
| Layer | Module | Spec | What it does |
|---|---|---|---|
| 1. Core types | cube.core |
core/spec.md | Action, Observation, Content, EnvironmentOutput, TypedBaseModel |
| 2. Tool | cube.tool |
tool/spec.md | Tool, AsyncTool, @tool_action, ToolConfig, Toolbox |
| 3. Task | cube.task |
task/spec.md | Task, TaskMetadata, TaskConfig, gym-style reset/step/evaluate |
| 4. Benchmark | cube.benchmark |
benchmark/spec.md | Benchmark, BenchmarkMetadata, class-level registry |
| 5. Testing | cube.testing |
testing/spec.md | run_debug_suite, assert_debug_tasks_reward_one |
Cross-cutting:
- Resource lifecycle — resource/spec.md (L1 provisioned images, L2 benchmark-scoped, L3 task-scoped)
- Container — container/spec.md (single-container abstraction for tasks)
- Server — server/spec.md (JSON-RPC 2.0, MCP-compatible)
- CLI — cli/spec.md (
cube init,cube list,cube test,cube registry add)
- Read the spec first. Before touching any layer, read its spec in
openspec/specs/. Specs are the authoritative design intent — but they can be stale or wrong; flag discrepancies rather than silently working around them. - Fix in the right place. A quick local experiment to understand a problem is fine. But the committed fix must address the root cause in the correct layer — not a workaround scoped to a single call site or context.
- Understand before fixing. Many bad fixes come from acting too fast. Make sure you understand the broader design before proposing a change. A fix that misses the bigger picture is worse than no fix.
- Lean diffs. Make the minimal change that solves the problem. Avoid verbose additions, unnecessary abstractions, and duplicated logic that already exists elsewhere. If existing code can be reused or consolidated, do it. A hard-to-review diff is a liability.
- Think long-term. Every change should age well. Ask whether today's shortcut becomes tomorrow's debt — and whether the design could evolve cleanly if requirements change.
Sign your commits. Every commit needs a Signed-off-by line (git commit -s). DCO is enforced by CI — unsigned commits will be blocked.
PRs are reviewed with /code-review (plugin docs), which audits changes against these guidelines. Write PRs as if a reviewer will check each principle above against the diff.
- Find the relevant spec — which layer? Start there.
- Read the spec's "Invariants" and "Gotchas" sections — these are the traps.
- Check for an active change in
openspec/changes/— someone may already be working on this. - For substantive changes to a spec's contract, write a delta spec in
openspec/changes/<name>/deltas.mdfirst (ADDED / MODIFIED / REMOVED requirements) before coding. - For completed changes, move the folder to
openspec/changes/archive/YYYY-MM-DD-<name>/and apply deltas to the main spec.
src/cube/ Core framework
├── core.py tool.py task.py Layers 1–3
├── benchmark.py Layer 4
├── testing.py Debug suite
├── server.py JSON-RPC / FastAPI
├── cli.py `cube` command
├── resource.py L1/L2/L3 resource lifecycle
├── container.py Single-container abstraction
├── backends/ Docker, Modal, Daytona, Toolkit backends
├── tools/ Reference tool stubs (browser)
├── resources/ BrowserSession, ChatSession protocols
├── integrations/nemogym.py NemoGym interop
└── _template/ Scaffold used by `cube init`
cube-resources/ Optional resource packages (playwright, chat, infra-*)
cube-tools/ Optional tool packages (browser, computer, chat)
examples/ counter-cube (reference), toy_benchmark
tests/ Unit + integration + backends
- Serializable configs subclass
TypedBaseModel— polymorphic via injected_typefield. - ClassVar registries on
BenchmarkConfig:benchmark_metadata,task_metadata,task_config_class,benchmark_classare class-level, not constructor params. Auto-loaded from files next to the module (metadata only). - Config → Factory pattern:
XyzConfig.make()returns a liveXyz. Config is serialized across process boundaries; live object never is. TaskConfigis the serialization boundary — workers get aTaskConfigand call.make()locally. Task objects never cross processes.- Credentials are resolved from env vars at runtime. Never fields on
InfraConfigorContainerBackend(would be serialized).
Active proposals: openspec/changes/. Archived: openspec/changes/archive/.
make lint and make test. For benchmark debug suite: cube test <benchmark-name>.
- cube-harness — runs experiments, agents, trajectories, XRay viewer
- cube-registry — metadata registry for published benchmarks (
cube registry add)