Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
1342ce2
Layer govbot: manifest cutover, transform DAG, package-manager, AGENT.md
sartaj May 22, 2026
267f820
Tidy: cargo fmt, rename stale 'just govbot logs' refs in filter boile…
sartaj May 22, 2026
4571909
Polish CLI help, wizard, and ls behavior
sartaj May 22, 2026
d79480c
Docs: align README / AGENT.md / CLAUDE.md with the post-refactor stack
sartaj May 22, 2026
80617ac
apply: route tag files back to source dataset by default
sartaj May 23, 2026
877367e
run: --dry-run flag; bluesky skips with WARN when creds absent
sartaj May 23, 2026
29b9703
run: skip cache write when project has a local dataset seed
sartaj May 23, 2026
c989129
bluesky: render {link} via publisher base_url, fall back to bill source
sartaj May 23, 2026
2798dd1
source: document the default --filter policy
sartaj May 23, 2026
6cbb12e
source: read tags from dataset, fall back to cwd-rooted layout
sartaj May 23, 2026
24fb563
apply/source: move tag files out of .govbot/ to project tags/
sartaj May 23, 2026
49d5d56
bluesky: route {link} to companion html publisher's landing page
sartaj May 23, 2026
a908227
publish: one publisher type, one artifact (rss != html)
sartaj May 23, 2026
7eec08b
hygiene: rename src/embeddings.rs to src/tagfile.rs
sartaj May 23, 2026
7592418
source: id docs by bill, not session, for symlinked OCD log layouts
sartaj May 23, 2026
5ab6d3c
source: use canonical on-disk bill dir for docs id, not log.bill_id
sartaj May 23, 2026
02ef80e
doctor: corpus-level smoke test for pulled cache integrity
sartaj May 23, 2026
86030e9
AGENT.md: install the semantic Tier-2 model in the make flow
sartaj May 23, 2026
cb3d80f
source: --select docs surfaces OCD subjects as optional structured si…
sartaj May 23, 2026
b2c2220
docs: document the optional `model` block in describe §3
sartaj May 24, 2026
21f1fc1
docs: document the optional `model_rerank` block in describe §3
sartaj May 24, 2026
99bb7ea
docs: reframe README/AGENT/CLAUDE/cli around the 4-tool govbot stack
sartaj May 24, 2026
6649acb
bluesky: move ledger out of .govbot/ into state/ (publisher state, no…
sartaj May 24, 2026
3682760
docs: name /fastclass:improve autonomous as activist post-ratify default
sartaj May 24, 2026
6fd614f
docs: switch fastclass CLI examples to compile umbrella
sartaj May 24, 2026
5679c02
run: scope pipeline source step by manifest datasets (--repos)
sartaj May 25, 2026
ddb6679
publishers: dedup by bill, ledger by bill-id
sartaj May 25, 2026
4056de9
docs: credit CHN-Bluesky-Govbot framework lineage
sartaj May 26, 2026
0e63562
init: --from-frankie-config scaffolds project from CHN topic config
sartaj May 26, 2026
4d8e4f8
cli: govbot logs alias for back-compat with chn-bluesky-govbot
sartaj May 26, 2026
24fbf92
cli: govbot logs defaults to --filter none for true back-compat
sartaj May 26, 2026
1e60731
docs: AGENT.md playbook for the data-catalog (pipeline-manager) layer
sartaj May 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
728 changes: 728 additions & 0 deletions AGENT.md

Large diffs are not rendered by default.

193 changes: 183 additions & 10 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,48 @@ This file provides senior engineering-level guidance for Claude Code when workin

## Project Overview

**govbot is a 4-tool stack for civic-data publishing**, built so an
activist crew can run a credible news-bot at nearly-free cost on commodity
infrastructure (GitHub Actions + a laptop with local models). The stack
exists to clear one bar: the first user, the **climate-activist** userland
repo, must be able to ship Bluesky posts that are "worth reading" at
"nearly free to run/improve". Every architectural choice in this repo
should be checked against that.

The 4 tools, with the honest state of each:

1. **Select real gov data** — `govbot pull` over 55 OCD dataset git repos
(every US state + DC + territories + federal Congress), content-
addressed in `~/.govbot/cache/`. `govbot doctor` validates. Today
`govbot source --select docs` ships bill text + subjects; **sponsors
and voting records are captured in metadata but not yet projected
into `--select docs`** — a recall gap for sponsor-pattern signals.
2. **Filter / transform** — fastclass tagging is the shipped transform
(Wave A). The planned **`summarize` transform** (local-LLM digests
of grouped bills, emitted with model id + source bill ids + prompt
revision so the digest is reproducible) **does not exist** —
userland holds a `summarizer/prompt.md` stub.
3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a Bluesky
posting bot ship today. **X is not built. AI digest publishing is
not built.** **"Receipts" as defined in the vision** — a GitHub
Pages artifact carrying the deterministic provenance behind every
AI digest (model used, source bill ids, fastclass scores +
reasoning, regen command) — **is a new capability that does not yet
exist**. The current classification evidence chains carry most of
the data a receipt would need; they are not yet packaged into a
public artifact.
4. **Coding-agent-native dev experience** — `AGENT.md` provides the
make/manage/update flow that a fresh Claude Code session can follow
without other onboarding. The fastclass plugin
(`/fastclass:from-intent`, `/fastclass:improve`, `/fastclass:ratify`,
`/fastclass:install-model`) handles the classifier loop. `govbot
doctor` validates installations. This is the one tool that is
already shipping its vision.

Operators: keep the gap map above honest as features land. The README's
Roadmap section is the public version of this list; this CLAUDE.md is the
internal version, biased toward what the code actually does today.

This is **govbot** - a monorepo for distributed data analysis of government updates. Git repos function as datasets, including legislation from 47+ states/jurisdictions. The `actions/` folder contains self-contained modules that can run as shell scripts or GitHub Actions.

## Senior Engineering Prompts
Expand Down Expand Up @@ -42,7 +84,7 @@ Use these meta-prompts to guide architectural decisions and code quality.

### Performance & Scale

- **"What happens with 10x the data?"** - Current scale is ~47 jurisdictions. Consider: What if we add counties? Cities? Federal agencies?
- **"What happens with 10x the data?"** - Current scale is ~55 dataset repos (all US state/territory legislatures + federal). The runtime registry (`registry.json`) is what makes 10x feasible — adding counties, cities, or agencies is a data change, not a recompile.

- **"Can this be parallelized?"** - State-level operations are inherently parallel. Pipelines should support concurrent execution.

Expand Down Expand Up @@ -80,14 +122,54 @@ scripts/ # Repository-level utility scripts
## Common Commands

```bash
govbot init # Create govbot.yml config
govbot clone all # Download all state legislation datasets
govbot clone wy il # Download specific states
govbot logs # Stream legislative activity as JSON Lines
govbot logs | govbot tag # Process and tag data
govbot # Scaffold govbot.yml (interactive wizard), then run the pipeline
govbot pull all # Download all state legislation datasets
govbot pull wy il # Download specific states
govbot source # Stream legislative activity as JSON Lines
govbot logs # Deprecated alias for `govbot source` (default mode); back-compat with the CHN-Bluesky-Govbot-Main framework's `govbot logs > bills.jsonl`
govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply
govbot load # Load bill metadata into DuckDB
govbot build # Generate RSS feeds
govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky)
govbot run # Run the full pipeline: pull -> classify -> apply -> publish
```

## govbot source — streaming legislative activity

`govbot source` walks every linked dataset and emits one JSON record per
bill log entry. It is the **source** stage of the stream protocol — the
records `govbot publish` and `fastclass classify` consume.

### The `--filter default` policy

`--filter` defaults to `default`, which applies the per-dataset filter under
`actions/govbot/src/filters/<dataset>/default.rs`. Each dataset's `default.rs`
implements an **action-based** rule that drops *routine* log entries —
introductions, committee referrals, "Bill Number Assigned", "Placed on
General File", boilerplate "President Signed" lines, prefiling, status
updates — so the stream emits only **substantive** events (passage votes,
executive signatures, amendments, defeats, committee reports with content).

This is not a recency cut. A bill whose only log entries are routine
actions — e.g. a freshly-filed bill with just an "Introduction" log —
emits **zero records** under `--filter default` until a substantive event
lands. The bill itself is not deleted; it simply produces no stream rows
yet. Once a substantive log appears (e.g. a passage vote later in the
session), the bill flows through.

If a bill is unexpectedly missing from `source` output:
```bash
govbot source --filter none --repos <dataset> # confirm it's the filter
```
If `--filter none` shows the bill and `--filter default` does not, the
fix is to add a substantive log entry, not to change the filter.

### The `--select docs` projection

`--select docs` collapses each surviving entry to the
`{"id","text","kind":"docs"}` document the stream protocol defines
(`schemas/STREAM_PROTOCOL.md` §1) — the record `fastclass classify -`
consumes. The default `--select default` keeps the full joined record
for `govbot publish` and ad-hoc analysis.

## DuckDB Integration

Expand All @@ -103,20 +185,111 @@ The `govbot load` command loads bill metadata into a DuckDB database for SQL ana

**Usage**:
```bash
govbot clone all # First, get the data
govbot pull all # First, get the data
govbot load # Load into DuckDB
govbot load --memory-limit 32GB # For large datasets
duckdb --ui ~/.govbot/govbot.duckdb # Open in browser UI
```

See `actions/govbot/DUCKDB.md` for query examples and schema documentation.

## Classifying with fastclass

Classification is a **pipe** of two composable tools that compose over a
process boundary — govbot streams the data, **fastclass** (a standalone,
self-improving text classifier) classifies it, govbot persists the result:

```bash
govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply
```

- **`govbot source --select docs`** emits one `{"id","text","kind":"docs"}`
document per bill carrying the **full bill text** from `metadata.json`; the
`id` is the bill's dataset path, which routes the result back.
- **`fastclass classify -`** scores each document against a **classifier
bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` +
`eval/`). govbot passes only the bundle path; it never reads the bundle.
- **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag
`.tag.json` files under `<project>/tags/<dataset>/country:.../sessions/<id>/`
— the files `govbot publish` turns into feeds. It classifies nothing
itself; it is purely the persistence sink.

### Project layout — `tags/` vs `.govbot/` vs `dist/`

A govbot project has three top-level tool-managed dirs, each with a
distinct role; do not conflate them:

- **`.govbot/`** — the tool's **cache**, the `node_modules/` equivalent.
Cloned datasets, content-addressed sync state, an optional registry
override. Fully regenerable; safe to `rm -rf` to start fresh.
**Never edited by hand, never written to by `apply`.** It does NOT
hold user-meaningful state — the bluesky publisher's posted-state
ledger lives under `state/`, not here.
- **`tags/`** — **classification output**, written by `govbot apply`. The
layout mirrors the source path with a dataset prefix:
`tags/<dataset>/country:.../state:.../sessions/<id>/<tag>.tag.json`.
Regenerated by every classify run; the dataset prefix is what isolates
same-named tag files across jurisdictions in a multi-dataset project.
- **`state/`** — **publisher state**, written by `govbot publish`. The
bluesky publisher's posted-state ledger lives at
`state/bluesky-<name>.ledger`. Regenerable-but-operational: deleting
it makes the next run double-post. Peer of `tags/` and `dist/`.
- **`dist/`** — **publisher output**, written by `govbot publish` (RSS /
HTML / JSON feeds).

### Publishers dedup by bill, not by action log

Every publisher (`bluesky`, `rss`, `html`, `json`) emits **one item per
(jurisdiction, bill_id)** — not one per action-log file. A single bill
typically emits many records to the source stream (one per committee
referral, hearing, vote event, …); the publishers collapse them to a
single representative so an activist sees one post per bill in their
feed. Before this fix, the climate-tracker feed posted NV AB1 six
times under six different action logs.

The dedup key is the **bill-level GUID** (`rss::bill_guid`), of the form
`<dataset>/.../sessions/<id>/bills/<bill_id>`. For `bluesky` the
**ledger key** is also this bill-level GUID: future log additions for an
already-posted bill do **not** trigger a re-post. The publisher reads
legacy per-log ledger entries on upgrade — entries written under the
per-bill-log layout collapse to the new bill key cleanly; entries
written under the session-level-log layout (the OCD-files common case)
incur a one-time re-post per previously-posted bill, after which the
ledger holds the new bill-level GUID and the bill never re-posts again.

**`govbot.yml` is NOT the classifier — it is a manifest.** It declares
`datasets`, `transforms`, `publish`, and `pipelines`; it has **no `tags:`
block**. The tag taxonomy lives in a separate **fastclass classifier bundle**
that the manifest's `transforms.<name>.classifier` field references by path.
The two configs change at different cadences and are read by different tools:
`govbot.yml` answers *"what data, what transforms, what publishers"*; the
classifier bundle's `classifier.yml` answers *"what's relevant"*.

To run the self-improving loop, work inside the classifier bundle directory and
use the fastclass Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`)
and the fastclass `compile evaluate` / `compile backtest` / `compile ratify` primitives. The
retired `fastclass --propose` flag no longer exists. For activists who have
ratified one proposal end-to-end, `/fastclass:improve autonomous` becomes the
ongoing default — constitution-passing proposals apply as usual, coverage-gap
proposals re-test against the rolling eval set and land only if rolling proves
them safe (`generated_by: autonomous-coverage-gap` in `fastclass.lock`).
AGENT.md §2 carries the activist-facing framing.

**Prerequisite**: the `fastclass` binary must be resolvable on `PATH`,
`~/.cargo/bin`, or `~/.govbot/bin` (`cargo install --path <fastclass repo>`).
`govbot run`'s transform stage resolves transform binaries the same way.

To improve tag quality, read **`AGENTS.md` in the fastclass repo** — the
operational playbook for the classify → eval → propose → backtest → promote
loop. Its one hard rule: never show the frozen `eval/constitution.yml` gold set
to an LLM.

## Testing with Mock Data

Mock legislative data is available for offline development:
- Location: `actions/govbot/mocks/.govbot/repos/`
- Contains: Wyoming (wy) and Guam (gu) sample data
- Usage: `govbot logs --govbot-dir ./actions/govbot/mocks/.govbot`
- Usage: `govbot source --govbot-dir ./actions/govbot/mocks/.govbot`

## govbot Development

Expand All @@ -125,7 +298,7 @@ cd actions/govbot
just setup # Install Rust toolchain and dependencies
just test # Run snapshot tests
just review # Review snapshot changes (insta)
just govbot logs # Run CLI in dev mode (uses mocks/.govbot)
just govbot source # Run CLI in dev mode (uses mocks/.govbot)
just mocks wy il # Update mock data for testing
```

Expand Down
Loading