Helm

The local operations layer for long-running AI agents.

Coding, ops, research, automation — any agent that runs for hours on the same workspace.
Profiles before commands · Checkpoints before risky work · Durable history after the chat is gone.

Landing · Quickstart · What Helm does · Workflows · Docs · 한국어

Quickstart

pip install helm-agent-ops
helm init --path ~/.helm/workspace
export HELM_WORKSPACE=~/.helm/workspace

Run your first inspection under a declared risk profile:

helm profile run inspect_local --task-name "first look" -- git status --short
helm status --brief
helm dashboard

The first command produces a guarded execution record. The second shows what just happened in plain English. The third lays out the workspace state on one page.

No PyPI? Use the bootstrap installer: curl -fsSL https://raw.githubusercontent.com/JDeun/Helm/main/install.sh | bash

Why Helm

Long-running AI agents drift. They forget prior decisions, execute risky actions before you can stop them, and leave behind a chat log nobody can audit a week later — regardless of whether the agent is editing code, running ops, organizing notes, browsing sites, or chaining tool calls.

Helm is a thin, file-backed operations layer that sits around your existing agent runtime. It does not replace your agent. It makes the agent's work boundable, recoverable, and reviewable.

The model proposes actions; the harness validates, authorizes, executes, records, and returns observations. Safety and completion claims should come from execution evidence, not from prompt advice or a compacted chat transcript.

Without Helm	With Helm
Risky commands run as soon as the agent decides	Commands run under a declared execution profile with a guard check
Multi-step or multi-file changes leave you guessing what happened	Checkpoint created before the work; visible rollback point
"What did the agent do yesterday?" → scroll the chat	Local task ledger, command log, dashboard, markdown report
Context lives in the chat window	File-backed memory + ranked retrieval rehydrates the next session
Skill rules live in prompts	`SKILL.md` + `contract.json` enforce policy at run time

If your agent only runs one-off demos, you do not need Helm. If you run it for hours on the same workspace — coding, ops, knowledge capture, or any mix — you do.

What Helm does

🛡️ Guard before execution

Execution profiles declare blast radius (inspect_local, workspace_edit, risky_edit, service_ops, remote_handoff)
Command guard blocks destructive or out-of-profile actions before they run
Tool-group grants restrict which capabilities each profile exposes

💾 Recover after the fact

Checkpoints before broad edits give a clear rollback target
Task ledger & command log keep durable history independent of the chat, including tool grants and experience attribution
Browser & profile gates can pause runaway work and require evidence of cleanup

🧭 Operate over time

File-backed memory with ranked retrieval (helm context --explain-ranking)
Skill lifecycle governs how skill rules promote / decay
Adaptive harness integrates failure signatures → policy transitions
Shadow reports & promotion queues surface when guarded features or skill candidates are ready to enforce

A three-minute demo

helm profile run inspect_local --task-name "inspect current repository" -- git status --short
helm checkpoint create --label before-risky-work --include $HELM_WORKSPACE
helm report --format markdown
helm dashboard

Each command leaves a structured record on disk: task ledger, command log, checkpoint record, dashboard summary. None of it requires the agent to remember anything.

Workflows

Inspect the workspace

helm doctor
helm status --brief
helm dashboard

Run a command under a declared profile

helm profile run inspect_local --task-name "inspect repository state" -- git status --short
helm profile run workspace_edit --task-name "tighten typing in api/" -- ruff check api/

Adopt existing systems as context sources

helm survey
helm onboard --use-detected --dry-run
helm onboard --use-detected

Check rollback and recent state

helm checkpoint-recommend
helm checkpoint list
helm task list --status running
helm task doctor
helm report --format markdown

Query durable context with inspectable ranking

helm context --mode decisions --explain-ranking --json
helm context --mode timeline --since 2026-05-01
helm context --mode entity --entity project_helm
helm context --mode reflect-candidates

Run a privacy boundary preflight

helm privacy scan --text "Contact alice@example.com" --json
helm privacy tokenize --scope task-123 --text "Contact alice@example.com"

Review stale skill claims

helm skill-lifecycle negative-claims --persist
helm skill-lifecycle revalidation-due
helm skill-lifecycle revalidate-claim \
  --skill old-skill \
  --claim-id sha256:abc123 \
  --status resolved \
  --note "command now exists"

Review run contracts and improvement candidates

helm run-contract --json
helm capability-diff --json
helm skill-promotion digest --json
helm shadow-report --since 14 --format md --with-recommendations

Probe model health

helm health state --json
helm health select --json

Every command also accepts --path /custom/workspace if you do not want to use $HELM_WORKSPACE. The demo workspace at examples/demo-workspace is safe to point at.

v0.10.2 — loop and skill-intake primitives

Current release: v0.10.2 — released 2026-06-24. This patch adds read-only loop validation and conservative external skill-intake classification.

helm loops validate and helm loops inspect validate reusable workflow contracts.
Completion-evidence and docs-sweep loop examples define evidence and stop conditions before runner work.
helm skill-intake classify and helm skill-intake validate provide a conservative review path for external skill candidates.

See the full v0.10.2 notes.

v0.10.1 — ledger attribution patch

Released 2026-06-20. This patch keeps task-ledger attribution inspectable across profiled runs and chat memory captures.

Completed, blocked, and guard-audit ledger rows now record experience_attribution.
helm memory capture-chat keeps queued / running rows free of final-only memory and attribution payloads.
Chat capture rows preserve conversation as the selected tool for attribution.

See the full v0.10.1 notes.

v0.10.0 — harness-engineering layer

Released 2026-05-22. Everything new ships in shadow mode by default — decisions are logged but not enforced until you opt in.

Failure signature classification — every failure event normalizes to {component, tool, profile, error_class, target, fingerprint} so the same failure is recognizable across runs.
Profile → tool-group grants — each execution profile exposes only the tools it should; runner records the grant in every ledger row.
Repeated-failure policy transitions — same-fingerprint, patch-failed, same-skill, and credential-invalid-grant patterns automatically pick a next action (stop / decompose / repair / re-auth).
Patch-first edit policy + validation gates — file edits prefer patch operations; per-extension validation commands run after writes.
Task-state control container — Forge's "Control Flow Is Not Memory" principle: required-steps, completed-steps, blockers, approvals, and recovered messages live as structured state, not transcript content.
Trace recorder → trace replay → skill candidate — every run produces a JSON trace; recurring success patterns surface as skill drafts; recurring failures surface as repair candidates.
Profile pause / resume — secret-token-gated hard stop per profile, gated by OPENCLAW_PAUSE_GATE.
Browser work verifier — pre-flight decision (allow_single_session, block_mutation, require_user_login, require_confirmation, pause_profile, require_cleanup_evidence) with a runner-side enforcement gate.
Model repair + synthetic respond hooks — library entry points for small-model fallback proxies; gated by HELM_MODEL_REPAIR and HELM_SYNTHETIC_RESPOND.
Shadow-mode reporter — helm shadow-report --since 14 --with-recommendations aggregates 14 days of signals and emits ready_to_enforce / needs_more_data / caution / no_signal per feature.

See the full v0.10.0 notes and the 13-document docs/harness-engineering/ directory for the design.

Workspace model

Helm runs in a dedicated workspace, treating existing systems as read-only context sources first.

Helm state lives under .helm/ inside the workspace.
Profiles, notes, policies, and skill rules stay as explicit files.
OpenClaw, Hermes, and notes vaults can be adopted instead of overwritten.
JSONL is the append-only source of truth; SQLite is a query index.

How Helm compares

Category	Better for	Helm adds
Agent frameworks (LangChain, AutoGen, etc.)	prompts, planners, tool loops, agent graphs	profiles, guard decisions, checkpoints, task ledgers
Observability (Langfuse, Helicone, etc.)	hosted traces, service metrics	pre-execution policy + local recovery state
Evaluation (DeepEval, Phoenix, etc.)	scoring model output	operational history around repeated human-agent work
Shell wrappers (cmd helpers)	command convenience	workspace state, memory capture, reports, recovery discipline

See deeper comparisons in docs/comparisons/.

Documentation

Get started	Core concepts	Advanced
Three-minute demo First run Onboarding Demos OpenClaw integration OpenHands integration Existing agent workspace	Execution profiles Privacy boundary Task state Task finalization Loops Action governance Proactive discovery Memory operations policy Ops memory query Adaptive harness Skill quality & policy	Harness engineering — index Control Flow Is Not Memory Helm vs Forge Skill self-improvement HITL decision patterns Evidence label convention Helm dogfooding reference Research background

Research background

Helm's design follows the findings in Harness Design Determines Operational Stability in Small Language Models, which experimentally studies how planning, verification, and recovery harnesses affect operational stability. Its adaptive-harness direction is also informed by It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers, which shows that harness strictness should be selected by model type and failure mode rather than applied uniformly.

Cite Helm:

@software{helm_2026,
  title  = {Helm: A stability-first operations layer for long-lived agent workspaces},
  author = {Cho, Yong Eun},
  year   = {2026},
  url    = {https://github.com/JDeun/Helm},
  version = {0.10.2}
}

See CITATION.cff for the machine-readable form.

Contributing

Issues and pull requests welcome.

Read CONTRIBUTING.md before opening a PR.
Run the test suite: python -m pytest -q (currently 1,432 tests).
Run the release checks: python scripts/release_version_check.py --version <next>.
Security reports: see SECURITY.md.

Release history

Latest: v0.10.2 — loop and skill-intake primitives (2026-06-24)
Previous: v0.10.1, v0.10.0, v0.9.6
Full changelog: CHANGELOG.md · older release notes

What Helm does NOT include

Helm ships only the public operations layer. It does not include:

Private memory contents
Personal agent overlays
Credentials or secrets
Raw task content from any specific workspace
Live connector tokens

The repository is safe to fork, clone, and inspect.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.github		.github
assets		assets
commands		commands
docs		docs
examples		examples
memory_tree		memory_tree
references		references
scripts		scripts
skills		skills
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.ko.md		README.ko.md
README.md		README.md
SECURITY.md		SECURITY.md
helm.py		helm.py
helm_context.py		helm_context.py
helm_frontmatter.py		helm_frontmatter.py
helm_state_model.py		helm_state_model.py
helm_workspace.py		helm_workspace.py
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Helm

Quickstart

Why Helm

What Helm does

🛡️ Guard before execution

💾 Recover after the fact

🧭 Operate over time

A three-minute demo

Workflows

v0.10.2 — loop and skill-intake primitives

v0.10.1 — ledger attribution patch

v0.10.0 — harness-engineering layer

Workspace model

How Helm compares

Documentation

Research background

Contributing

Release history

What Helm does NOT include

License

About

Uh oh!

Releases 34

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Helm

Quickstart

Why Helm

What Helm does

🛡️ Guard before execution

💾 Recover after the fact

🧭 Operate over time

A three-minute demo

Workflows

v0.10.2 — loop and skill-intake primitives

v0.10.1 — ledger attribution patch

v0.10.0 — harness-engineering layer

Workspace model

How Helm compares

Documentation

Research background

Contributing

Release history

What Helm does NOT include

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 34

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages