chihacknight · sartaj · May 22, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
diff --git a/AGENT.md b/AGENT.md
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,6 +4,48 @@ This file provides senior engineering-level guidance for Claude Code when workin
 
 ## Project Overview
 
+**govbot is a 4-tool stack for civic-data publishing**, built so an
+activist crew can run a credible news-bot at nearly-free cost on commodity
+infrastructure (GitHub Actions + a laptop with local models). The stack
+exists to clear one bar: the first user, the **climate-activist** userland
+repo, must be able to ship Bluesky posts that are "worth reading" at
+"nearly free to run/improve". Every architectural choice in this repo
+should be checked against that.
+
+The 4 tools, with the honest state of each:
+
+1. **Select real gov data** — `govbot pull` over 55 OCD dataset git repos
+   (every US state + DC + territories + federal Congress), content-
+   addressed in `~/.govbot/cache/`. `govbot doctor` validates. Today
+   `govbot source --select docs` ships bill text + subjects; **sponsors
+   and voting records are captured in metadata but not yet projected
+   into `--select docs`** — a recall gap for sponsor-pattern signals.
+2. **Filter / transform** — fastclass tagging is the shipped transform
+   (Wave A). The planned **`summarize` transform** (local-LLM digests
+   of grouped bills, emitted with model id + source bill ids + prompt
+   revision so the digest is reproducible) **does not exist** —
+   userland holds a `summarizer/prompt.md` stub.
+3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a Bluesky
+   posting bot ship today. **X is not built. AI digest publishing is
+   not built.** **"Receipts" as defined in the vision** — a GitHub
+   Pages artifact carrying the deterministic provenance behind every
+   AI digest (model used, source bill ids, fastclass scores +
+   reasoning, regen command) — **is a new capability that does not yet
+   exist**. The current classification evidence chains carry most of
+   the data a receipt would need; they are not yet packaged into a
+   public artifact.
+4. **Coding-agent-native dev experience** — `AGENT.md` provides the
+   make/manage/update flow that a fresh Claude Code session can follow
+   without other onboarding. The fastclass plugin
+   (`/fastclass:from-intent`, `/fastclass:improve`, `/fastclass:ratify`,
+   `/fastclass:install-model`) handles the classifier loop. `govbot
+   doctor` validates installations. This is the one tool that is
+   already shipping its vision.
+
+Operators: keep the gap map above honest as features land. The README's
+Roadmap section is the public version of this list; this CLAUDE.md is the
+internal version, biased toward what the code actually does today.
+
 This is **govbot** - a monorepo for distributed data analysis of government updates. Git repos function as datasets, including legislation from 47+ states/jurisdictions. The `actions/` folder contains self-contained modules that can run as shell scripts or GitHub Actions.
 
 ## Senior Engineering Prompts
@@ -42,7 +84,7 @@ Use these meta-prompts to guide architectural decisions and code quality.
 
 ### Performance & Scale
 
-- **"What happens with 10x the data?"** - Current scale is ~47 jurisdictions. Consider: What if we add counties? Cities? Federal agencies?
+- **"What happens with 10x the data?"** - Current scale is ~55 dataset repos (all US state/territory legislatures + federal). The runtime registry (`registry.json`) is what makes 10x feasible — adding counties, cities, or agencies is a data change, not a recompile.
 
 - **"Can this be parallelized?"** - State-level operations are inherently parallel. Pipelines should support concurrent execution.
 
@@ -80,14 +122,54 @@ scripts/        # Repository-level utility scripts
 ## Common Commands
 
 ```bash
-govbot init          # Create govbot.yml config
-govbot clone all     # Download all state legislation datasets
-govbot clone wy il   # Download specific states
-govbot logs          # Stream legislative activity as JSON Lines
-govbot logs | govbot tag  # Process and tag data
+govbot               # Scaffold govbot.yml (interactive wizard), then run the pipeline
+govbot pull all      # Download all state legislation datasets
+govbot pull wy il    # Download specific states
+govbot source        # Stream legislative activity as JSON Lines
+govbot logs          # Deprecated alias for `govbot source` (default mode); back-compat with the CHN-Bluesky-Govbot-Main framework's `govbot logs > bills.jsonl`
+govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply
 govbot load          # Load bill metadata into DuckDB
-govbot build         # Generate RSS feeds
+govbot publish       # Run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky)
+govbot run           # Run the full pipeline: pull -> classify -> apply -> publish
+```
+
+## govbot source — streaming legislative activity
+
+`govbot source` walks every linked dataset and emits one JSON record per
+bill log entry. It is the **source** stage of the stream protocol — the
+records `govbot publish` and `fastclass classify` consume.
+
+### The `--filter default` policy
+
+`--filter` defaults to `default`, which applies the per-dataset filter under
+`actions/govbot/src/filters/<dataset>/default.rs`. Each dataset's `default.rs`
+implements an **action-based** rule that drops *routine* log entries —
+introductions, committee referrals, "Bill Number Assigned", "Placed on
+General File", boilerplate "President Signed" lines, prefiling, status
+updates — so the stream emits only **substantive** events (passage votes,
+executive signatures, amendments, defeats, committee reports with content).
+
+This is not a recency cut. A bill whose only log entries are routine
+actions — e.g. a freshly-filed bill with just an "Introduction" log —
+emits **zero records** under `--filter default` until a substantive event
+lands. The bill itself is not deleted; it simply produces no stream rows
+yet. Once a substantive log appears (e.g. a passage vote later in the
+session), the bill flows through.
+
+If a bill is unexpectedly missing from `source` output:
+```bash
+govbot source --filter none --repos <dataset>   # confirm it's the filter
 ```
+If `--filter none` shows the bill and `--filter default` does not, the
+fix is to add a substantive log entry, not to change the filter.
+
+### The `--select docs` projection
+
+`--select docs` collapses each surviving entry to the
+`{"id","text","kind":"docs"}` document the stream protocol defines
+(`schemas/STREAM_PROTOCOL.md` §1) — the record `fastclass classify -`
+consumes. The default `--select default` keeps the full joined record
+for `govbot publish` and ad-hoc analysis.
 
 ## DuckDB Integration
 
@@ -103,20 +185,111 @@ The `govbot load` command loads bill metadata into a DuckDB database for SQL ana
 
 **Usage**:
 ```bash
-govbot clone all                    # First, get the data
+govbot pull all                     # First, get the data
 govbot load                         # Load into DuckDB
 govbot load --memory-limit 32GB     # For large datasets
 duckdb --ui ~/.govbot/govbot.duckdb # Open in browser UI
 ```
 
 See `actions/govbot/DUCKDB.md` for query examples and schema documentation.
 
+## Classifying with fastclass
+
+Classification is a **pipe** of two composable tools that compose over a
+process boundary — govbot streams the data, **fastclass** (a standalone,
+self-improving text classifier) classifies it, govbot persists the result:
+
+```bash
+govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply
+```
+
+- **`govbot source --select docs`** emits one `{"id","text","kind":"docs"}`
+  document per bill carrying the **full bill text** from `metadata.json`; the
+  `id` is the bill's dataset path, which routes the result back.
+- **`fastclass classify -`** scores each document against a **classifier
+  bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` +
+  `eval/`). govbot passes only the bundle path; it never reads the bundle.
+- **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag
+  `.tag.json` files under `<project>/tags/<dataset>/country:.../sessions/<id>/`
+  — the files `govbot publish` turns into feeds. It classifies nothing
+  itself; it is purely the persistence sink.
+
+### Project layout — `tags/` vs `.govbot/` vs `dist/`
+
+A govbot project has three top-level tool-managed dirs, each with a
+distinct role; do not conflate them:
+
+- **`.govbot/`** — the tool's **cache**, the `node_modules/` equivalent.
+  Cloned datasets, content-addressed sync state, an optional registry
+  override. Fully regenerable; safe to `rm -rf` to start fresh.
+  **Never edited by hand, never written to by `apply`.** It does NOT
+  hold user-meaningful state — the bluesky publisher's posted-state
+  ledger lives under `state/`, not here.
+- **`tags/`** — **classification output**, written by `govbot apply`. The
+  layout mirrors the source path with a dataset prefix:
+  `tags/<dataset>/country:.../state:.../sessions/<id>/<tag>.tag.json`.
+  Regenerated by every classify run; the dataset prefix is what isolates
+  same-named tag files across jurisdictions in a multi-dataset project.
+- **`state/`** — **publisher state**, written by `govbot publish`. The
+  bluesky publisher's posted-state ledger lives at
+  `state/bluesky-<name>.ledger`. Regenerable-but-operational: deleting
+  it makes the next run double-post. Peer of `tags/` and `dist/`.
+- **`dist/`** — **publisher output**, written by `govbot publish` (RSS /
+  HTML / JSON feeds).
+
+### Publishers dedup by bill, not by action log
+
+Every publisher (`bluesky`, `rss`, `html`, `json`) emits **one item per
+(jurisdiction, bill_id)** — not one per action-log file. A single bill
+typically emits many records to the source stream (one per committee
+referral, hearing, vote event, …); the publishers collapse them to a
+single representative so an activist sees one post per bill in their
+feed. Before this fix, the climate-tracker feed posted NV AB1 six
+times under six different action logs.
+
+The dedup key is the **bill-level GUID** (`rss::bill_guid`), of the form
+`<dataset>/.../sessions/<id>/bills/<bill_id>`. For `bluesky` the
+**ledger key** is also this bill-level GUID: future log additions for an
+already-posted bill do **not** trigger a re-post. The publisher reads
+legacy per-log ledger entries on upgrade — entries written under the
+per-bill-log layout collapse to the new bill key cleanly; entries
+written under the session-level-log layout (the OCD-files common case)
+incur a one-time re-post per previously-posted bill, after which the
+ledger holds the new bill-level GUID and the bill never re-posts again.
+
+**`govbot.yml` is NOT the classifier — it is a manifest.** It declares
+`datasets`, `transforms`, `publish`, and `pipelines`; it has **no `tags:`
+block**. The tag taxonomy lives in a separate **fastclass classifier bundle**
+that the manifest's `transforms.<name>.classifier` field references by path.
+The two configs change at different cadences and are read by different tools:
+`govbot.yml` answers *"what data, what transforms, what publishers"*; the
+classifier bundle's `classifier.yml` answers *"what's relevant"*.
+
+To run the self-improving loop, work inside the classifier bundle directory and
+use the fastclass Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`)
+and the fastclass `compile evaluate` / `compile backtest` / `compile ratify` primitives. The
+retired `fastclass --propose` flag no longer exists. For activists who have
+ratified one proposal end-to-end, `/fastclass:improve autonomous` becomes the
+ongoing default — constitution-passing proposals apply as usual, coverage-gap
+proposals re-test against the rolling eval set and land only if rolling proves
+them safe (`generated_by: autonomous-coverage-gap` in `fastclass.lock`).
+AGENT.md §2 carries the activist-facing framing.
+
+**Prerequisite**: the `fastclass` binary must be resolvable on `PATH`,
+`~/.cargo/bin`, or `~/.govbot/bin` (`cargo install --path <fastclass repo>`).
+`govbot run`'s transform stage resolves transform binaries the same way.
+
+To improve tag quality, read **`AGENTS.md` in the fastclass repo** — the
+operational playbook for the classify → eval → propose → backtest → promote
+loop. Its one hard rule: never show the frozen `eval/constitution.yml` gold set
+to an LLM.
+
 ## Testing with Mock Data
 
 Mock legislative data is available for offline development:
 - Location: `actions/govbot/mocks/.govbot/repos/`
 - Contains: Wyoming (wy) and Guam (gu) sample data
-- Usage: `govbot logs --govbot-dir ./actions/govbot/mocks/.govbot`
+- Usage: `govbot source --govbot-dir ./actions/govbot/mocks/.govbot`
 
 ## govbot Development
 
@@ -125,7 +298,7 @@ cd actions/govbot
 just setup           # Install Rust toolchain and dependencies
 just test            # Run snapshot tests
 just review          # Review snapshot changes (insta)
-just govbot logs     # Run CLI in dev mode (uses mocks/.govbot)
+just govbot source   # Run CLI in dev mode (uses mocks/.govbot)
 just mocks wy il     # Update mock data for testing
 ```