Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions hindsight-tools/mission-sandbox/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
node_modules
dist
.next
.next-*
standalone
next-env.d.ts
*.tsbuildinfo
projects
130 changes: 130 additions & 0 deletions hindsight-tools/mission-sandbox/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# @vectorize-io/hindsight-mission-sandbox

Tune Hindsight's **retain (extraction)** and **observation (consolidation)** missions against your
own task, then verify with an **external validator** (a benchmark like LOCOMO, or your app's eval).

The tool is deliberately small and opinionated:

- **You bring** the documents and a way to score success (the validator). The tool does **not**
measure accuracy or label facts — task success is decoupled from the tool.
- **You refine a mission with feedback.** After looking at validator results, you hand the tool
_feedback_ (and optional failing examples); it rewrites the current mission. No good/bad labeling.
- **Retain iterates across versioned banks** (`<project>-v1`, `-v2`, …) so you can point the
validator at any version and compare. **Observations iterate in place** (clear + re-consolidate),
since they're re-derived from the same facts.

## The loop

```
init (bind docs)
└─ retain mission (feedback + examples → refine retain mission)
└─ retain apply (ingest docs into a NEW bank <project>-vN)
└─ VALIDATE EXTERNALLY against <project>-vN ─┐
┌──────────────────────────────────────────────────┘ failures become the next feedback
retain mission (feedback) → retain apply → validate → …

observe mission (feedback → refine obs mission) → observe apply (clear obs + re-consolidate on current bank) → validate
```

The validator is never inside the tool. A typical round: run your eval against `<project>-vN`,
read what failed, then `retain mission <project> --feedback "<what to fix>" --example "<failing case>"`
→ `retain apply` (new version) → re-validate.

## Commands

```bash
# bind a project to its documents (no ingest yet)
mission-sandbox init <project> --documents <path> [--api-url URL]

# RETAIN loop — iterates across versioned banks
mission-sandbox retain mission <project> --feedback "<what to change>" [--example "<failing case>" ...]
mission-sandbox retain apply <project> # ingest docs → new bank <project>-vN, prints the bank id

# OBSERVE loop — iterates in place on the current bank
mission-sandbox observe mission <project> --feedback "<what to change>" [--example "<...>" ...]
mission-sandbox observe apply <project> # clear observations on current bank + re-consolidate

mission-sandbox status <project> # bound docs, current missions, versions (+ bank ids)
mission-sandbox ui <projects-dir> # minimal UI: project status + versions
```

- `retain mission` / `observe mission` refine the **current** mission from your feedback (+ examples);
the first call (no prior mission) treats the feedback as the initial spec. The LLM sees the current
mission + feedback + examples — nothing else, no labels.
- `retain apply` always creates the **next** version bank and ingests into it. Point your validator
(e.g. LOCOMO `--template`/the bank id) at that version.
- `--model` overrides the Gemini model used for mission refinement (default `gemini-2.5-flash`, or
`HINDSIGHT_API_LLM_MODEL`). Mission refinement is the **only** LLM call the tool makes; ingestion +
consolidation run on the Hindsight deployment.

## Verifying with LOCOMO (example external validator)

The LOCOMO runner is unchanged and is the **only** thing that measures accuracy. Build a template
from a version's missions and point the runner at it (default mode — **no `--use-reflect`**):

```bash
# representative subset: trim the runner's input to N per category (data only — restore after)
cd hindsight-dev/benchmarks/locomo/datasets && cp locomo10.json locomo10.full.json
N=5 # widen to 10+ once a mission looks good, to confirm it generalises and surface weak categories
python3 - "$N" <<'PY'
import json, sys
n=int(sys.argv[1]); d=json.load(open("locomo10.json"))
for s in d:
if s["sample_id"]!="<id>": continue
s["qa"]=[q for c in (1,2,3,4) for q in [x for x in s["qa"] if x.get("category")==c and x.get("answer")][:n]]
json.dump(d,open("locomo10.json","w"))
PY

# verify a version's missions
python3 -c "import json;p=json.load(open('<project>/project.json'));v=p['versions'][-1]; \
json.dump({'version':'1','bank':{'retain_mission':v['retainMission'],'observations_mission':v.get('observeMission')}}, \
open('<project>/template.json','w'))"
set -a; source hindsight-api-slim/.env; set +a; export HINDSIGHT_API_LLM_MODEL=gemini-2.5-flash
uv run --project hindsight-dev python hindsight-dev/benchmarks/locomo/locomo_benchmark.py \
--conversation <id> --wait-consolidation --template <project>/template.json
# results: hindsight-dev/benchmarks/locomo/results/benchmark_results.json (by-category is_correct)
mv hindsight-dev/benchmarks/locomo/datasets/locomo10.full.json hindsight-dev/benchmarks/locomo/datasets/locomo10.json
```

Read accuracy **by category**; a weak category is your next `--feedback`. Notes from real runs:
single-question swings between runs are **recall variance** (each apply re-ingests) — watch
category trends; and verify a "failure" against the transcript before chasing it (some benchmark
golds are wrong).

## Project model (`project.json`)

```jsonc
{
"documents": "/path/to/docs", // bound at init
"apiUrl": "http://localhost:8888",
"retain": { "mission": "…", "feedback": ["…"] },
"observe": { "mission": "…", "feedback": ["…"] },
"versions": [
{
"n": 1,
"bank": "<project>-v1",
"retainMission": "…",
"observeMission": "…",
"createdAt": "…",
},
],
"currentVersion": 1,
}
```

## Setup

```bash
npm install
npm run build --workspace @vectorize-io/hindsight-mission-sandbox
export GEMINI_API_KEY=... # or GOOGLE_API_KEY, or a Gemini HINDSIGHT_API_LLM_* in your .env
```

## Development

```bash
npm run test # vitest unit tests for core
npm run typecheck # tsc for the lib + the Next app
npm run build # build the core lib (dist) + the minimal Next UI
```
13 changes: 13 additions & 0 deletions hindsight-tools/mission-sandbox/bin/cli.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env node
// Thin launcher: delegate to the compiled CLI. Run `npm run build:lib` (or `npm run build`)
// to produce dist/. For development without a build, use `npm run cli -- <args>` (tsx).
import("../dist/cli/index.js").catch((err) => {
if (err && err.code === "ERR_MODULE_NOT_FOUND") {
console.error(
"mission-sandbox: build output missing. Run `npm run build` in the package first."
);
} else {
console.error(err);
}
process.exit(1);
});
9 changes: 9 additions & 0 deletions hindsight-tools/mission-sandbox/next.config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import type { NextConfig } from "next";

const nextConfig: NextConfig = {
output: "standalone",
// core is consumed as a built package (dist); its heavy runtime deps stay external.
serverExternalPackages: ["@google/genai", "@vectorize-io/hindsight-client"],
};

export default nextConfig;
74 changes: 74 additions & 0 deletions hindsight-tools/mission-sandbox/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
{
"name": "@vectorize-io/hindsight-mission-sandbox",
"version": "0.1.0",
"description": "Iterate on Hindsight observation missions with a fast feedback loop — CLI + Next.js UI",
"type": "module",
"main": "./dist/core/index.js",
"types": "./dist/core/index.d.ts",
"exports": {
".": {
"types": "./dist/core/index.d.ts",
"import": "./dist/core/index.js"
},
"./core": {
"types": "./dist/core/index.d.ts",
"import": "./dist/core/index.js"
}
},
"bin": {
"mission-sandbox": "bin/cli.js"
},
"files": [
"bin",
"dist",
"standalone",
"public"
],
"scripts": {
"build": "npm run build:lib && npm run build:ui",
"build:lib": "tsc -p tsconfig.lib.json",
"build:ui": "NODE_ENV=production next build && npm run build:standalone",
"build:standalone": "rm -rf standalone && SERVER_JS=$(find .next/standalone -path '*/node_modules' -prune -o -name 'server.js' -print | head -1) && test -n \"$SERVER_JS\" || (echo 'Error: server.js not found in .next/standalone - standalone build failed' && exit 1) && STANDALONE_ROOT=$(dirname \"$SERVER_JS\") && cp -r \"$STANDALONE_ROOT\" standalone && cp -r .next/standalone/node_modules standalone/node_modules && mkdir -p standalone/.next && cp -r .next/static standalone/.next/static && mkdir -p standalone/public && (cp -r public/* standalone/public/ 2>/dev/null || true)",
"dev": "npm run build:lib && next dev -p ${PORT:-7777}",
"cli": "tsx src/cli/index.ts",
"start": "next start -p ${PORT:-7777}",
"lint": "next lint",
"typecheck": "tsc -p tsconfig.lib.json --noEmit && tsc --noEmit",
"test": "vitest run",
"test:watch": "vitest",
"prepublishOnly": "npm run build"
},
"keywords": [
"hindsight",
"memory",
"observations",
"mission",
"prompt-optimization"
],
"license": "MIT",
"repository": {
"type": "git",
"url": "https://github.com/vectorize-io/hindsight.git",
"directory": "hindsight-tools/mission-sandbox"
},
"dependencies": {
"@google/genai": "^2.7.0",
"@vectorize-io/hindsight-client": "^0.7.0",
"commander": "^14.0.0",
"next": "^16.2.6",
"react": "^19.2.0",
"react-dom": "^19.2.0"
},
"devDependencies": {
"@tailwindcss/postcss": "^4.1.17",
"@types/node": "^24.10.0",
"@types/react": "^19.2.2",
"@types/react-dom": "^19.2.2",
"eslint": "^9.39.1",
"eslint-config-next": "^16.0.1",
"tailwindcss": "^4.1.17",
"tsx": "^4.19.2",
"typescript": "^5.9.3",
"vitest": "^4.1.2"
}
}
5 changes: 5 additions & 0 deletions hindsight-tools/mission-sandbox/postcss.config.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
const config = {
plugins: ["@tailwindcss/postcss"],
};

export default config;
28 changes: 28 additions & 0 deletions hindsight-tools/mission-sandbox/src/app/api/extract/route.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import { runExtractPreview } from "@vectorize-io/hindsight-mission-sandbox/core";

import { projectDir } from "@/app/lib/project-context";

export const runtime = "nodejs";
export const dynamic = "force-dynamic";

/** Dry-run extraction preview: what does this mission extract from the given text? (no ingest) */
export async function POST(req: Request) {
const body = (await req.json().catch(() => ({}))) as {
project?: string;
content?: string;
retainMission?: string | null;
};
if (!body.project || !body.content) {
return Response.json({ error: "project and content are required" }, { status: 400 });
}
try {
const facts = await runExtractPreview({
projectDir: projectDir(body.project),
content: body.content,
retainMission: body.retainMission,
});
return Response.json({ facts });
} catch (e) {
return Response.json({ error: e instanceof Error ? e.message : String(e) }, { status: 500 });
}
}
106 changes: 106 additions & 0 deletions hindsight-tools/mission-sandbox/src/app/components/ExtractPanel.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"use client";

import { useState } from "react";

interface PreviewFact {
text: string;
factType: string;
occurredStart: string | null;
occurredEnd: string | null;
entities: string[];
}

/**
* Dry-run extraction preview: paste text + an optional mission, see what the retain step would
* extract — with no ingestion, no persistence. Backed by the /memories/extract API.
*/
export function ExtractPanel({
project,
defaultMission,
}: {
project: string;
defaultMission: string | null;
}) {
const [content, setContent] = useState("");
const [mission, setMission] = useState(defaultMission ?? "");
const [facts, setFacts] = useState<PreviewFact[] | null>(null);
const [loading, setLoading] = useState(false);
const [error, setError] = useState<string | null>(null);

async function run() {
setLoading(true);
setError(null);
setFacts(null);
try {
const res = await fetch("/api/extract", {
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify({ project, content, retainMission: mission || null }),
});
const data = await res.json();
if (!res.ok) throw new Error(data.error || `HTTP ${res.status}`);
setFacts(data.facts as PreviewFact[]);
} catch (e) {
setError(e instanceof Error ? e.message : String(e));
} finally {
setLoading(false);
}
}

return (
<details className="mt-5 rounded-lg border border-[var(--border)] p-4">
<summary className="cursor-pointer text-sm font-semibold uppercase tracking-wide text-[var(--muted)]">
Dry-run extraction (preview a mission, no ingest)
</summary>

<label className="mt-3 block text-xs uppercase tracking-wider text-[var(--muted)]">
text
</label>
<textarea
className="mt-1 h-28 w-full rounded-md border border-[var(--border)] bg-[var(--surface-2)] p-2 text-sm"
placeholder="Paste a document / chunk to extract facts from…"
value={content}
onChange={(e) => setContent(e.target.value)}
/>

<label className="mt-3 block text-xs uppercase tracking-wider text-[var(--muted)]">
retain mission (override — defaults to the project&apos;s current mission)
</label>
<textarea
className="mt-1 h-20 w-full rounded-md border border-[var(--border)] bg-[var(--surface-2)] p-2 text-xs"
value={mission}
onChange={(e) => setMission(e.target.value)}
/>

<button
className="mt-3 rounded-md border border-[var(--accent)] px-3 py-1.5 text-sm text-[var(--accent)] disabled:opacity-50"
onClick={run}
disabled={loading || !content.trim()}
>
{loading ? "Extracting…" : "Extract (dry-run)"}
</button>

{error ? <p className="mt-2 text-sm text-[var(--bad)]">{error}</p> : null}

{facts ? (
<div className="mt-3">
<div className="text-xs uppercase tracking-wider text-[var(--muted)]">
{facts.length} fact{facts.length === 1 ? "" : "s"} extracted
</div>
<ul className="mt-1 space-y-1">
{facts.map((f, i) => (
<li key={i} className="border-l-2 border-[var(--border)] pl-3 text-sm">
{f.text}
<span className="ml-1 text-xs text-[var(--muted)]">
[{f.factType}
{f.occurredStart ? ` · ${f.occurredStart.slice(0, 10)}` : ""}
{f.entities.length ? ` · ${f.entities.join(", ")}` : ""}]
</span>
</li>
))}
</ul>
</div>
) : null}
</details>
);
}
Loading