Skip to content

Govbot 2: Upgrade Tagging + Decoupling#35

Open
sartaj wants to merge 32 commits into
mainfrom
govbot-stack-refactor
Open

Govbot 2: Upgrade Tagging + Decoupling#35
sartaj wants to merge 32 commits into
mainfrom
govbot-stack-refactor

Conversation

@sartaj

@sartaj sartaj commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Govbot 2 is the next-gen successor to govbot v1. Two product shape changes drive the rename:

  • Upgrade & Decouple Tagging — classification is now separated. govbot source emits newline-JSON; an external transform (fastclass classify - today, summarize next) reads/writes the same stream; govbot apply persists results under tags/. Analyzer tools are language-agnostic processes speaking the stream protocol — not govbot subcommands.
  • One runner over a Unix pipelinegovbot run (and bare govbot) orchestrates source | <transforms> | apply, then publish reads tags/ and ships dist/ + state/. Bare govbot falls back to govbot init when no govbot.yml is present.
  • Backwards compatibility — preserved in this same PR so it doesn't break frankies2727/CHN-Bluesky-Govbot-Main.

sartaj and others added 30 commits May 22, 2026 18:18
This commit lands the govbot-stack refactor in five waves, reshaping the
repo from a single-purpose climate-bill action into a layered, data-driven
newsbot framework. See ../govbot-stack-architecture.md for the design and
schemas/STREAM_PROTOCOL.md for the cross-domain wire contract this work
freezes.

Wave 1/2 - schema and manifest cutover, transform DAG, subcommand rename.
schemas/govbot.schema.json drops the legacy `tags:` block, renames `repos:`
to `datasets:` (now required), and introduces `transforms`, `publish`, and
`pipelines` sections so a manifest describes a DAG instead of three
hard-coded stages. src/config.rs Manifest mirrors that shape; the runtime
CLI filter state Config.repos is intentionally kept separate from the
manifest `datasets:` field and documented inline. src/pipeline.rs is
rewritten as a data-driven transform runner so classify, summarize, and
publish stages compose from manifest entries rather than being baked in.
The classify stage now takes an explicit `classifier=<bundle>` path
instead of relying on cwd, eliminating an implicit dependency. Subcommands
are hard-cutover renamed (Clone -> Pull, Tag -> Apply, Build -> Publish,
Logs -> Source) with matching tape/example/snapshot updates.
src/selectors.rs emits full bill text from metadata.json so downstream
stages do not have to re-fetch sources. CLAUDE.md is rewritten to explain
the new fastclass layering.

Wave 3 - publishers, Bluesky posting bot, AGENT.md, extraction source.
Publish surface is split out so each channel (rss, html, json, duckdb,
bluesky) lives as its own module. The Bluesky publisher is a real posting
bot speaking AT Protocol XRPC, reading credentials only from env, keeping
a posted-state ledger to avoid duplicates, and supporting --dry-run. A new
root AGENT.md ships a newsbot-builder playbook intended to be loaded by
URL (no plugin or marketplace machinery). README.md picks up a bootstrap
line pointing newcomers at AGENT.md. The actions/govbot/examples/
climate-activist/ prototype extraction source that previously lived in
this repo is removed in Wave 5 below; its standalone successor lives in
the climate-activist-gov-news-bot repo.

Wave 4 - package-manager layer, WorkingLocale deletion.
The 52-variant compile-time WorkingLocale enum and its generator binary
are deleted (src/locale_generated.rs, src/bin/generate-locale-enum.rs):
jurisdictions are now resolved at runtime through a small package
manager. New src/registry.rs plus data/registry.json catalog 55
datasets across the seed jurisdictions (documented in REGISTRY.md). New
src/lock.rs writes a govbot.lock that pins dataset SHAs for
reproducibility. New src/cache.rs provides a content-addressed cache
under ~/.govbot/cache/ keyed by URL + channel. src/git.rs is rewritten to
resolve the OpenStates / per-jurisdiction index at runtime. New
commands init, add, remove, pull, ls, search, and run wire the package
manager up to the existing pipeline.

Wave 5 - prototype cleanup.
actions/govbot/examples/climate-activist/ is deleted now that the
standalone climate-activist-gov-news-bot repo is the canonical extraction
source for the AGENT.md playbook.

Build-blocker - ureq 2 -> 3 migration.
actions/govbot/Cargo.toml bumps ureq from 2.12 to 3 because 2.12.1 was
undownloadable from crates.io in the build environment; ureq 3.3.0 is
available from the local cache. The two callers were migrated to the 3.x
API: src/bluesky.rs uses .header() in place of .set(), switches from
Error::Status(code, r) matching to .config().http_status_as_error(false)
plus an explicit status check, and replaces .into_json() with
.into_body().read_json(); src/registry.rs replaces .into_string() with
.into_body().read_to_string().

Tests - 30 tests pass (16 unit + 3 api_snaps + 1 cli_example_snaps + 10
wizard_tests). The cli_example_snaps source_basic snapshot is regenerated
against the wy mock so it contains a real govbot source activity stream
(5 bills) instead of the empty placeholder a previous subagent left
behind.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rplate

Two mechanical sweeps after the layering refactor landed:

1. cargo fmt — applies rustfmt's preferences uniformly; no behavioral
   changes. Touches bluesky.rs, cache.rs, git.rs, lock.rs, main.rs,
   pipeline.rs, processor.rs, publish.rs, registry.rs, wizard.rs.
2. In the 55 per-jurisdiction filter boilerplate files under
   src/filters/, replace 'just govbot logs --repos=...' with
   'just govbot source --repos=...'. The 'logs' subcommand was renamed
   to 'source' in the refactor; the filter-update LLM prompt embedded
   in each file still referenced the old name.

Tests: 30/30 pass, snapshots unchanged (the diff is purely in comments
and rustfmt cosmetics).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLI help (main.rs):
- Replace the legacy top-level 'about' ('Process pipeline log files
  with type-safe reactive streams') with a description of what govbot
  actually does today.
- Rewrite every subcommand's help to reflect current behavior — drop
  retired 'locales' terminology in 'delete', expand 'load' / 'update'
  /  'publish' / 'run' from one-liners into actually-useful paragraphs.
- 'source --repos' now has visible_alias 'datasets' so the flag name
  lines up with the manifest field that was renamed in the refactor.
  The old '--repos' continues to work.

Wizard (.gitignore):
- Generate a richer .gitignore that covers the publisher output dirs
  ('dist/', 'docs/') and the credential file ('.env'), not just
  '.govbot/'. A fresh project from 'govbot init' is now safer to push
  to GitHub without accidentally committing generated artifacts or
  credentials.
- The writer is now idempotent: re-running on an existing .gitignore
  only adds the missing entries.

ls behavior:
- 'govbot ls' help promises 'with no manifest, lists every dataset in
  the registry'. It didn't — now it does. Discovery in a bare directory
  works as advertised.

Cleanup (clippy):
- Drop a redundant 'use serde_json;' and a couple of small lints
  ('DESC | _' wildcard, an 'if \!x.is_empty() {""} else {""}' dead
  branch in the HTML index footer, a stylistic 'return' before a
  cfg-gated block ending).

Snapshots: updated 'cli_example_snaps@govbot_help' + the three wizard
session snapshots to match. 30/30 tests still green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
README.md:
- Drop the '47 states' boilerplate (the bundled registry now lists 55
  datasets — every state legislature + DC + the territories + federal
  Congress).
- Fix the wizard description: it no longer 'guides you through creating
  tags' — that moved into the fastclass classifier bundle when the
  layers split.
- Rewrite 'Other Commands' to cover the new registry-backed verbs
  (search/add/remove/ls) and call out 'publish --dry-run' for Bluesky.
- Add a 'Classifying with fastclass' section pointing at AGENT.md and
  STREAM_PROTOCOL.md so the README explains the two-tool composition
  instead of just listing flags.
- Replace the 'WIP / GitHub Actionable' note with a real description of
  the data catalog: coverage + GOVBOT_REGISTRY_URL override.

AGENT.md:
- Be honest about the fastclass install: '<fastclass repo>' was a
  placeholder that would have stopped a fresh session cold. Now the
  playbook tells the agent to ask the user for their fastclass checkout
  path (the public remote is still an architecture open question) and
  to STOP rather than scaffold a broken project if the user has no
  fastclass install.

CLAUDE.md:
- Fix the publish help-line which only listed RSS/HTML/JSON/DuckDB and
  omitted Bluesky.
- Update the '~47 jurisdictions' 10x-data line to ~55 dataset repos and
  point at the registry as what makes 10x feasible.

registry.rs:
- Update the '52-jurisdiction seed' comment + test message to reflect
  the catalog's actual size; switch the floor assertion from '>= 52' to
  '>= 55' and include the actual count in the failure message.

All 30 tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: with a project manifest at the repo root, `govbot apply`
materialized `country:us/state:wy/sessions/<year>/tags/<tag>.tag.json`
at the project root, not inside the dataset's own
`.govbot/repos/<short>/` directory. Userland had to patch its
`.gitignore` to suppress the misplaced files, and the dataset's
on-disk layout no longer mirrored where the bill metadata came from.

Root cause: `run_apply_command` defaulted its base output dir to the
current working dir, ignoring the dataset's identity which is already
carried in the result's `doc` field
(`<dataset>/country:.../sessions/.../bills/...`).

Fix: `parse_doc_route` now also extracts the leading `<dataset>`
segment (the dataset's `short_name`); when `--output-dir` is unset
`apply` routes each tag file under
`<project>/.govbot/repos/<dataset>/country:.../sessions/.../tags/` so
the file lands alongside the bill's `metadata.json`. An explicit
`--output-dir` preserves current behaviour (verbatim root). A
prefix-less doc id (non-govbot source) still falls back to the project
dir rather than dropping the record.

Added unit tests for `parse_doc_route`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: a first-time `govbot run` against a manifest with a
`bluesky` publisher dies red — even after classify / apply / rss /
html all succeeded — with "the bluesky publisher needs the
BLUESKY_HANDLE environment variable". Activists setting up the bot
before they have an app password see a hostile error and no way to
preview what the pipeline produces.

Root cause: the bluesky publisher unconditionally authenticated as
soon as it had pending records, and `govbot run` had no flag to
propagate `--dry-run` to the publishers it spawned. The existing
`govbot publish --dry-run` only covered the standalone-publish flow.

Fix (both, the bug report asked for either):
  - `govbot run --dry-run`: a new flag on the `Run` subcommand that
    propagates to every publisher (passes `--dry-run` through to
    `govbot publish`). Recommended first invocation: every publisher
    renders, nothing emits.
  - `bluesky` publisher: when `BLUESKY_HANDLE` /
    `BLUESKY_APP_PASSWORD` are not set and the publisher is not in
    dry-run, log a `WARN` and skip rather than bailing. The rest of
    the pipeline (rss / html / json / duckdb) keeps running.

AGENT.md §1.4 and §2.3 and README.md updated to mention both behaviours.

Added `creds_present_reflects_env` unit test (and an in-test
ENV_LOCK mutex so the env mutation is safe under parallel tests).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: with `.govbot/repos/wy-legislation/` already populated
(seeded locally rather than `govbot pull`-ed), `govbot run` still
ran `govbot pull` which tried to clone-or-pull through the shared
`~/.govbot/cache/` cache. In a sandboxed environment (read-only
HOME) that surfaced as `❌ wy IO error: Operation not permitted`
inside an otherwise-green pipeline.

Root cause: `pipeline::run_pipeline` always shelled out to
`govbot pull`, regardless of whether the project already had every
declared dataset sitting in `.govbot/repos/`.

Fix: before step 1, classify each manifest dataset against the
project-local `repos/` directory. If every dataset has a non-empty
`<repos>/<seed_dir_name>/` directory, the pull substep is reduced
to one `📂 using local seed: <path>` line per dataset and the
subprocess is not spawned at all. When *any* dataset is missing
locally, the pull subprocess still runs as today (cache failures
during a real pull stay loud — they are only silenced when the seed
is what `pull` would have synced to).

Added `is_local_seed_detects_populated_dir` unit test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: with the userland manifest's
`post_template: "{title}\n\n{tags} · {link}"`, a dry-run rendered
`clean_energy ·` with nothing after the bullet — the `{link}`
placeholder was always empty.

Root cause: `bluesky::render_post` called `rss::extract_link(entry,
None)` — `extract_link`'s `base_url` argument is what prefixes the
record's source-relative path, so without it the rss/html path was
never assembled and the empty string fell back through.

Fix: thread the publisher's `base_url` field — same shape as the
rss/html publishers — into `render_post`, then into
`extract_link`. When unset, `extract_link`'s existing fallback to
`bill.sources[0].url` kicks in so manifests without a base_url still
get a sensible link (or, last-resort, an empty string).

`schemas/govbot.schema.json`: extended the `base_url` description to
mention the bluesky `{link}` placeholder semantics. (The field was
already declared on every publisher; this is documentation.)

AGENT.md §1.3 manifest template and §2.2 bluesky publisher table
updated. **Userland note:** the next AGENT.md make-flow should add
`base_url:` to the bluesky publisher in
climate-activist-gov-news-bot's `govbot.yml`
(e.g. the same URL as the rss/html publishers'). Without it,
`{link}` will still resolve via `bill.sources[0].url` when
available, so it is not strictly required.

Added `render_link_uses_publisher_base_url` and
`render_link_falls_back_to_bill_source_url` unit tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: a wy-legislation bill seeded with only a 2025-01 log was
silently absent from `govbot source` output until a 2025-12 log was
added. The `--filter default` rule wasn't documented anywhere and
its behaviour was opaque: users assumed the cut was date-based when
it is actually action-based.

Root cause: `--filter default` runs the per-dataset `default.rs`
under `actions/govbot/src/filters/<dataset>/`. Those filters drop
routine log actions ("introduction", "referral-committee", "Bill
Number Assigned", "Placed on General File", etc.) so the stream
emits only substantive events. A freshly-filed bill whose only logs
are routine actions emits zero records until a substantive event
lands. Nothing about this was documented.

Fix: documentation only — preserving current behaviour. Per the bug
brief, prefer documentation + an explicit `--filter none` opt-out
over silently widening the cut (a behaviour change).

  - `govbot source --help`'s `--filter` description now spells out
    what `default` drops and notes that `--filter none` is the
    "is this a filter problem?" troubleshooting flag.
  - CLAUDE.md gains a "govbot source" section under Common Commands
    covering the `--filter default` policy and the
    `--select docs` projection.

`--filter none` already exists as the opt-out — no new
`--all-logs` flag is introduced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: after Wave 6 (80617ac) moved tag files into the dataset
(`.govbot/repos/<dataset>/.../tags/`), `govbot source --join tags`
returned 0 tag joins and the publishers reported
`Generating RSS feed with 0 entries...` even when tag files were
present at the new location.

Root cause: the `--join tags` branch built `tags_dir` from
`std::env::current_dir()` joined with `country:.../state:.../sessions/<id>/tags`
— the *old* project-root layout. The apply side moved; the consumer
side did not.

Fix: derive the tags dir from the dataset path the source iterator is
already walking. `resolve_tags_dir` walks up from the log file path
until it finds an ancestor whose immediate child is `bills/` and
returns the sibling `tags/`. If the dataset-rooted lookup yields no
matching tags, fall back to the legacy cwd-rooted construction so any
pre-existing project-root layouts (and any explicit `--output-dir`
overrides that landed there) still resolve.

Tag-matching is pulled into `match_tags_in_dir` so both lookups share
one implementation. Unit tests cover the resolver (canonical layout,
loose path outside the layout) and the matcher (hit, miss, absent dir).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`.govbot/` is the tool's cache — the equivalent of `node_modules/`,
`target/`, or `.venv/`. It holds cloned datasets, sync state, and an
optional registry override. It must stay tool-owned and user-edit-free
so a `rm -rf .govbot/` always restores a clean cache without losing the
bot's actual work.

Bug 1's fix (and Bug 6's read-side follow-up `6cbb12e`) moved tag files
*into* the cache. That solved the dirty-project-root complaint but it
was the wrong answer: tag files are derived classification **outputs**,
not cache contents. They are produced by `apply`, consumed by
publishers, and represent the bot's state of "which bills have been
classified, and how". They belong in their own dedicated location.

What changed:

- `govbot apply` now writes per-tag `.tag.json` files under
  `<project>/tags/<dataset>/country:.../state:.../sessions/<id>/<tag>.tag.json`.
  `tags/` is a project-rooted classification-output dir, peer to `dist/`
  (publisher output) and distinct from `.govbot/` (the cache). The
  `<dataset>` short-name prefix is what isolates same-named tag files
  across jurisdictions in a multi-dataset project.

- `govbot source --join tags` resolves tag dirs in a three-stage
  fallback chain (first non-empty wins, silent on miss):
    1. Primary: `<project>/tags/<dataset>/country:.../sessions/<id>/`
       — the new layout above.
    2. Fallback A: `<session_dir>/tags/` inside `.govbot/repos/...` —
       the Bug 6 location, kept read-only for working trees mid-
       migration. `apply` never writes here.
    3. Fallback B: the pre-Bug-1 cwd-rooted `country:.../sessions/<id>/
       tags/` — kept for layouts that pre-date the dataset-rooted move,
       and for explicit `--output-dir` overrides that landed there.

- Wizard `.gitignore` now adds `tags/` with an inline comment
  documenting the trade-off (file count grows with the catalog and
  most bots regenerate from raw data — git-ignore by default; remove
  the line to commit classification provenance).

- `--output-dir` help text updated to say the default is now
  `<project>/tags/` (was: the directory containing `govbot.yml`).
  Explicit overrides remain a verbatim root — the dataset prefix is
  dropped, as before.

- Docs (CLAUDE.md, AGENT.md, README.md) gained an explicit
  three-dir layout section: `.govbot/` = cache, `tags/` = classification
  output, `dist/` = publisher output.

Tests:
- `parse_doc_route_*` comments updated for the new destination.
- `resolve_tags_dir` replaced with `resolve_tags_dir_candidates` returning
  the ordered candidate list; new regression pins the project-rooted
  primary over the cache-rooted fallback.
- New `tag_paths_are_dataset_isolated` asserts two datasets sharing a
  country/state/session route same-named tag files to distinct files
  under `tags/<short>/...`, never into `.govbot/`.
- 40 → 41 tests passing offline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: with a configured bluesky publisher whose base_url was set
(e.g. https://example.org/climate-tracker), the rendered post's {link}
placeholder resolved to the raw metadata.json path under that base_url:
  https://example.org/climate-tracker/wy-legislation/.../HB9999/metadata.json
which sends an activist's reader to a JSON file, not the
human-readable HTML index the manifest's `site` (html) publisher
already produces.

Root cause: render_post passed only the bluesky publisher's own
base_url to rss::extract_link, which always appends the bill's
dataset sources.bill path (a metadata.json path). The bluesky
publisher had no awareness of the manifest's other publishers.

Fix (Option A — couple via the manifest, not new fields): the
publish-command flow finds the manifest's `type: html` publisher
once, takes its base_url, and threads it into every PublishJob as
`html_entry_url`. The bluesky publisher's {link} resolves in this
priority:
  1. html_entry_url — the html publisher's landing page;
  2. base_url joined to sources.bill (the historical default).
Falls back to the historical shape when no html publisher exists in
the manifest, so this is purely additive. No new manifest fields.

Why not Option B (entry_path_template) or C (always-prefer-html in
extract_link): B adds new manifest surface every project has to
learn; C entangles the html publisher's own output behavior with
bluesky's link resolution. Option A is the smallest change that
gives activists the URL they actually want, defaulting from data
already in the manifest.

Reproduction (in climate-activist-gov-news-bot, after seeding wy
mocks + a synthetic HB9999 with climate text and a 2025-12 passage
log, then `govbot apply` against a synthetic fastclass result):

  before: clean_energy · https://example.org/climate-tracker/wy-legislation/country:us/state:wy/sessions/2025/bills/HB9999/metadata.json
  after:  clean_energy · https://example.org/climate-tracker

AGENT.md §2.2 and the publisher schema are updated to describe the
new default. New regression test
render_link_prefers_html_publisher_landing_page locks the priority
in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Symptom: a manifest declaring both an rss publisher and an html
publisher into the same output_dir silently overwrote files. The
userland e2e produced:

  === Publisher 'feed' (Rss) ... ===
  ✓ Generated RSS feed: dist/climate-feed.xml
  ✓ Generated HTML index: dist/index.html    ← rss publisher wrote HTML

  === Publisher 'site' (Html) ... ===
  ✓ Generated RSS feed: dist/feed.xml        ← html publisher wrote RSS
  ✓ Generated HTML index: dist/index.html    ← COLLISION, last writer wins

Root cause: `emit_rss_html` was one function handling both kinds —
each call wrote both feed.xml and index.html regardless of `type`.
The second publisher to run (sorted by name) silently overwrote
the first's index.html.

Fix: split into two functions.
  - `type: rss`  -> emit_rss  writes only <output_dir>/feed.xml
  - `type: html` -> emit_html writes only <output_dir>/index.html
Each publisher's `output_file` defaults to its own kind's
filename. Declaring both gets both, side-by-side, no collision.

Reproduction (climate-activist-gov-news-bot, after seeding wy
mocks + a synthetic HB9999 + apply'ing a synthetic fastclass
result):

  before: dist/ contained climate-feed.xml AND feed.xml (the
          rss publisher's output_file vs the html publisher's
          cross-emitted feed.xml), and index.html was whichever
          publisher ran last.
  after:  dist/ contains climate-feed.xml + index.html, written
          by their respective publishers — no cross-emission, no
          collision.

Side effects:
- The wizard now scaffolds BOTH a `feed` (type: rss) and a `site`
  (type: html) publisher so a fresh project still gets both
  artifacts by default. Snapshots regenerated.
- AGENT.md §1.3 now documents both publisher kinds in the
  template and calls out the "one type, one artifact" rule.
- schemas/govbot.schema.json describes the per-kind output_file
  defaults.

Userland breaking change: a manifest that relied on the
double-emission (e.g. declaring only `type: rss` and expecting
index.html to also appear) will need to add a peer `type: html`
publisher. The climate-activist userland already declares both
under `feed` and `site`, so it picks up the fix transparently —
its `dist/feed.xml` cross-emission (which collided with the rss
output_file `climate-feed.xml`) goes away.

Three regression tests in src/publish.rs lock the split in:
- rss_publisher_writes_only_feed_xml
- html_publisher_writes_only_index_html
- rss_and_html_publishers_coexist_in_one_output_dir

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The module no longer has anything to do with embeddings — the ONNX
machinery moved out when classification was delegated to fastclass.
What remained was a misnomer: the in-process apply sink for
fastclass result JSON. The on-disk artifact this module models is
the per-tag `.tag.json` file `govbot apply` writes; `tagfile` is
what it actually does.

Pure rename — no behavior change:
- src/embeddings.rs -> src/tagfile.rs (git mv keeps history).
- lib.rs swaps `pub mod embeddings;` -> `pub mod tagfile;` and the
  re-export `pub use embeddings::{...}` -> `pub use tagfile::{...}`.
- Module-level docstring rewritten to drop the "what used to be
  here" framing and describe what the module currently does;
  retains a single historical sentence for archaeology.

All 45 tests stay green; no `use govbot::embeddings::*` sites
elsewhere needed updating (lib.rs was the only consumer; main.rs
already imports through the `govbot::*` re-export).

A trailing `grep -rn '\bembeddings\b' actions/govbot/src actions/govbot/tests`
now matches only the historical sentence in tagfile.rs's
module-level doc comment — no live code references the old name.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`govbot source --select docs` was emitting session-level ids when the
on-disk log lived at `<dataset>/.../sessions/<id>/logs/<file>.json`
(the common windycivi-cloned shape across most states). The walker
reports the symlink path, not its canonical per-bill target, so the
old `sources.log.split("/logs/").next()` builder stopped at the
session and dropped the `/bills/<bill_id>` segment from the id.

Real-world impact: `govbot pull all` across 55 jurisdictions produced
4916 records compressed to 97 unique ids — every bill in a session
collapsed onto one id. Downstream, `parse_doc_route` refuses ids
without a `bills` segment, so `govbot apply` skipped every collapsed
record entirely. STREAM_PROTOCOL §1 mandates `id` be the bill's
dataset path; the projection was silently violating the contract.

Fix: when the path stripped of `/logs/...` doesn't already end in
`/bills/<bill_id>`, append `/bills/<bill_id>` from `log.bill_id`.
This normalises both layouts to the same `<dataset>/country:<c>/
state:<s>/sessions/<id>/bills/<bill_id>` shape, restoring the
contract `apply` and the bluesky publisher rely on.

Why the mocks missed it: `actions/govbot/mocks/.govbot/repos/wy-
legislation/` ships its logs directly under each `bills/<id>/logs/`
dir (no symlinks). The buggy `split("/logs/").next()` already
produced the correct bill path for that layout, so no existing test
exercised the session-level-symlink shape real `govbot pull` writes.

Three new unit tests pin the contract: the per-bill layout
round-trips through `parse_doc_route`, the session-level layout
gets its bill_id appended, and 4 sibling bills in one session
hash to 4 distinct ids. Reproduced end-to-end against the cached
55-state corpus: 4916/97 → 4916/3989 (remaining collisions are
legitimate per-bill log re-emissions, not session-wide collapse).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Real-data bug surfaced by the 55-state corpus: MI/WV/ND/PA logs ship
`bill_id` as a *display* form with a space — "HB 5077", "SB 0001",
"HB 0163" — even though the on-disk directory is `bills/HB5077/`,
`bills/SB0001/`, `bills/HR0163/`. The pre-fix `ocd_entry_to_doc`
appended `log.bill_id` verbatim, so the doc id became
`mi-legislation/.../bills/SB 0001` (with a space). Any downstream
sibling-file lookup — `os.path.join(REPOS, doc, "metadata.json")` —
404'd because no such directory exists on disk; the architect saw
"(no metadata.json)" for ~30% of bills across the affected states.

Fix: source the `/bills/<dir>` segment from `sources.bill` (the
parent dir of the resolved `metadata.json`), which is the
authoritative on-disk dir name — the bill join already canonicalized
the symlinked log path to reach it. Fall back to `log.bill_id` only
when `--join bill` wasn't requested.

The Layout-1 detector (when `sources.log` already ends in
`/bills/<dir>` because the walker landed on the per-bill log
directly) must also consider the canonical dir name; otherwise
states whose `log.bill_id` has whitespace would fail the Layout-1
check and double-append, producing `.../bills/HR0163/bills/HR0163`.
Sample over the corpus: ~50% of mi/nd/pa records exhibited the
doubled-bills id before this fix.

Validation across the full 55-state pull (`govbot source --select
docs --filter none --limit none`, 2,377,146 records):

  - 0 ids with whitespace in the bill segment (was: hundreds of
    thousands)
  - 0 doubled `bills/<dir>/bills/<dir>` patterns (was: ~50% of
    mi/nd/pa)
  - 0 of 1038 sampled records missed `metadata.json` (20 per state
    across 52 states; was: ~30% missing for mi/wv/nd/pa)

Test count: 52 -> 57 (4 new regression tests + 1 helper unit test).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two real-data bugs shipped this stack because the only test harness was
the mock dataset, which happened to fit one happy-path layout:

  7592418 — `source --select docs` emitted session-level ids when the
            on-disk log lived at `.../sessions/<id>/logs/<file>` (the
            symlinked OCD layout most states use). 4916 records
            collapsed onto 97 distinct ids; downstream `apply` and
            `bluesky` silently dropped almost everything.

  5ab6d3c — `source --select docs` id used `log.bill_id` (e.g.
            "HB 4027", with whitespace) instead of the canonical
            on-disk dir name (`HB4027`) on MI/WV/ND/PA. The mismatched
            id then resolved to a non-existent `metadata.json` path —
            ~30% missing-metadata hits in title lookup.

Both would have been caught by a five-line check over a real pulled
cache. `govbot doctor` is that check, wired to a CLI verb activists
can run after `pull all` to confirm the project is coherent before
flipping `bluesky` off `--dry-run`.

What it checks, per dataset:
  - coverage          — at least one record emitted (WARN, not FAIL,
                        when `--filter default` legitimately drops
                        every routine log)
  - id_distinctness   — distinct-id / record-count ratio ≥ 0.03; the
                        bug 7592418 signature was 97/4916 = 0.02
  - metadata_sampleable — N sampled ids each resolve to a present,
                        parseable `metadata.json` with at least a
                        `title` or `identifier` (catches 5ab6d3c)
  - text_non_empty    — sampled `text` ≥ 50 chars (catches collapsed
                        bill joins)

And globally:
  - dataset_links     — every `*-legislation` entry in `.govbot/repos/`
                        resolves to a real directory (catches dangling
                        symlinks `get_local_datasets` filters out)
  - routable_ids      — every emitted id has a recognisable
                        `<dataset>/country:.../bills/<id>` shape

Skips cleanly when `.govbot/repos/` is absent or empty — this is a
smoke test against pulled data, not a unit test against mocks.

Form chosen: **C (user-facing `govbot doctor` subcommand).** The
A (cargo-test + env var) and B (just recipe) variants would have
been smaller diffs, but they hide the check from end users. A real
verb activists can run answers the right question — "is my project
coherent?" — without making them learn the test harness. Exit code
is non-zero on failure so it drops straight into a CI step; output
defaults to a human summary, with `--output json` for machine
consumers.

How to invoke:
  govbot doctor                          # text summary
  govbot doctor --output json            # machine-readable
  govbot doctor --sample 50 --limit 200  # more thorough sweep

Acceptance: against the architect's pulled cache at
`/Users/sartaj/Git/climate-activist-gov-news-bot/`, doctor finds
55 datasets, drains 4916 records to 3958 distinct ids in 16s
(<60s budget), and exits PASS with three WARN states (az/gu/va
— filter dropped every routine log). The MI HB 4027 case (now
resolves post-5ab6d3c) passes. Regression demo: delete every
metadata.json under a tmp-cloned `mi-legislation/`; doctor exits 1
with `metadata_sampleable: FAIL` and prints the missing-file paths,
including the exact pre-5ab6d3c `HB 4027` (with-whitespace) id
format. Real cache untouched.

`cargo test --offline`: 28+19+3+1+10 = 61 green (was 57); three new
unit tests pin the bug-5ab6d3c metadata-resolve leg, one pins the
`dataset_short_name` prefix/suffix bridge, and one pins the
`metadata.json has neither title nor identifier` rejection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A scaffolded classifier bundle has the taxonomy and the fusion config but
no embedding model — so the cascade in fusion.yml's `uncertainty_band`
silently degrades to lexical-only matchers and the bot misses paraphrases
and euphemisms ("energy diversity" never matches `clean_energy`). Real-data
audits show this as a 10–15 point recall gap.

Add a "Install the semantic Tier-2 model" step to §1.3 of the make flow,
right after the fastclass plugin is wired into `.claude/settings.json`.
Tells the activist to run `/fastclass:install-model` (the new plugin
command in the fastclass repo), explains why, and shows the
`fastclass describe` verification that the install worked.

The plugin command does the actual work — fetches the recommended
sentence-transformers/all-MiniLM-L6-v2 (~22 MB) into the shared
`~/.govbot/models/<sha-prefix>/` cache and links it into
`classifier/model/`. The install path is idempotent: re-running on a warm
cache hits the cache.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gnal

OCD-files `metadata.json` carries a high-quality `subject:` array assigned
by human OCD scrapers (e.g. ["ENERGY", "ENVIRONMENT", "TAXATION"]). Today
the `--select docs` projection drops these on the floor; a Wave-A
`concept_match` matcher in fastclass will read them as a direct, gold-
standard classification input rather than re-deriving topic signals
from the bill text.

Schema change (additive, optional):
  Before: {"id": ..., "kind": "docs", "text": ...}
  After:  {"id": ..., "kind": "docs", "text": ..., "subjects": ["..."]}

`subjects` is **omitted entirely** when the bill has no `subject:`, when
the array is empty (`[]`), or when every element is blank. Empty equals
absent — emitting `"subjects": []` would conflate "no signal" with
"explicitly no subjects" and force every consumer to handle two
identical cases. Bare log records (no `--join bill`) also omit it.

STREAM_PROTOCOL.md §1 now documents the new field's shape, source,
optionality, and the contract that downstream transforms unaware of
`subjects` ignore it (the contract is additive).

Real-data intel from the 185k-bill, 55-jurisdiction corpus at
`<project>/.govbot/repos/`:
- 45.4% of bills have a non-empty `subject:` (84,196 / 185,335)
- Coverage is bimodal: 31 states have ≥50% coverage; 24 states have 0%.
- High-coverage examples: HI 100% / 7264 distinct subjects, CA 95% /
  5945 distinct, TN 100% / 409 distinct, MN 88% / 245 distinct.
- Zero-coverage examples: NY (19407 bills), USA-federal (11317),
  OK (9262), MA (8975), IL (8648), RI (5495), PA (3578), WA (3411).
- 24,854 distinct subject strings corpus-wide. Vocabulary is not
  unified across states (HI alone uses 7264 distinct; many are
  state-local terms like "Cities and Towns-Specific" or "SCH BOARDS").
- Climate-relevant subject hits: ENERGY 1608, ENVIRONMENT 1834,
  CLIMATE 153, EMISSIONS 57, RENEWABLE 152, SOLAR 105, CARBON 113,
  POLLUTION 335, CONSERVATION 431, WATER 1798.

Tests (+5, 61 -> 66 total): subjects present, subject key absent,
subject `[]` (explicit empty), all-blank elements, and bare-log /
no-bill-join. cargo fmt clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-domain pickup from fastclass's govbot-stack-refactor branch
(commits 34b0038 + c12dc79): `fastclass model fetch` installs a
Tier-2 embedding model under `<bundle>/model/`, and `fastclass
describe` now surfaces that install as an optional `model: {name?,
sha256_prefix}` block in its JSON output. STREAM_PROTOCOL §3 is the
contract govbot type-checks transforms against, so the field needs
to live here too — patterned after §1's treatment of `subjects`
(additive, omitted when absent, byte-identical legacy output).

Doc-only — no code touched, all 66 govbot tests still pass offline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the `model` paragraph for the reranker peer added in fastclass A3.
Also surfaces `model-rerank/` in the §4 bundle layout for symmetry with
`model/`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
govbot is a 4-tool stack for civic-data publishing — pull real legislative
data, filter by what you care about, publish with receipts, all from a
coding-agent-native dev experience designed to run at nearly-free cost on
commodity infrastructure (GitHub Actions + laptop + local models). The
first user is climate-activist; the success bar is "Bluesky posts worth
reading, nearly free to run/improve".

This commit re-leads every doc surface that carries govbot's positioning
around that 4-tool framing, with an explicit honest gap map so the docs
don't oversell what is not yet built. No code logic changes — README,
AGENT, CLAUDE, Cargo.toml description, the main.rs module rustdoc and
the clap `about` for `govbot --help`, plus the regenerated `govbot
--help` snapshot.

The honest gap map (named in every surface):

- Select real gov data: sponsors + voting records are captured in
  metadata but not yet projected into `--select docs`; the "<1 minute"
  pull is the warm-cache case (cold ~3 min).
- Filter/transform: fastclass tagging ships; the planned `summarize`
  transform (local-LLM digests of grouped bills with a deterministic
  trace) does not yet exist — userland holds a prompt stub.
- Publish with receipts: RSS/HTML/JSON/DuckDB/Bluesky ship; X does not;
  AI digest publishing does not; the receipts artifact (GitHub Pages
  page carrying model id + source bills + fastclass reasoning + regen
  command behind every AI digest) is a new capability not yet built.
- Coding-agent-native dev experience: AGENT.md make/manage/update flow
  + fastclass plugin commands + `govbot doctor` already ship; this is
  the one tool already shipping its vision.

AGENT.md §2 (manage) introduces fastclass's `--autonomous` flag as the
activist-default after first ratification — the no-ratify apply path
that lets the crew run hands-off between ratifications while keeping
the audit trail via `generated_by: autonomous-coverage-gap` in
`fastclass.lock`. (Cross-domain pickup from fastclass commit 5f2b9c6.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t cache)

The Bluesky publisher's posted-state ledger was living under
`.govbot/bluesky-<name>.ledger` — inside the tool's cache directory. But
the user's framing (the 'node_modules-equivalent' insight that already
moved tag files out to `tags/`) makes the right home obvious: `.govbot/`
is regenerable cache, the ledger is user-meaningful operational state.
An activist who runs `rm -rf .govbot/` to reset the cache shouldn't lose
their post history and start double-posting on the next run.

This moves the default ledger destination to
`<project>/state/bluesky-<name>.ledger`. The legacy `.govbot/`-rooted
path is consulted as a *read-only fallback* on every run so an upgrading
project doesn't double-post records it logged under the old path; writes
only ever land at the new path. After a full re-run the legacy file
becomes harmless and can be removed.

govbot now has 4 tool-managed top-level dirs with distinct roles:

  .govbot/  cache (regenerable, safe to rm -rf, never edited)
  tags/     classification output (govbot apply writes here)
  state/    publisher state (govbot publish writes here)
  dist/     publisher output (RSS/HTML/JSON artifacts)

Same shape as the tags/-out-of-.govbot/ refactor. Schema bumped, wizard
generates the new layout, AGENT.md / CLAUDE.md / README all reflect the
4-dir model. Tests grew (66 vs 62) covering both the new path and the
legacy fallback. Backward-compatible: any existing project keeps working.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fastclass shipped autonomous mode (the no-ratify apply path gated by
constitution-passes + rolling-coverage-gap proofs). The cross-domain
pickup for govbot: name it as the activist-facing default after the
first ratification pass, in the activist-facing slash-command form
(`/fastclass:improve autonomous`), not the raw CLI form
(`fastclass classify --promote ... --autonomous`).

- AGENT.md §2 callout: lead with the slash-command form, keep the gate
  semantics (constitution sovereign; rolling re-test for coverage gaps;
  precision-regression always refuses), and tell the reader when to
  drop back to the reviewed path (new tag, ratification sweep after a
  flurry of autonomous lands).
- AGENT.md §3 step 3: show both `/fastclass:improve` and
  `/fastclass:improve autonomous` side by side so the activist knows
  which form to run for each situation.
- AGENT.md §"three jobs" nav blurb: switch from "fastclass's
  `--autonomous` mode" to the slash-command form for consistency.
- CLAUDE.md "Classifying with fastclass": one-sentence pointer to the
  autonomous form + the `generated_by: autonomous-coverage-gap` lock
  marker, so a senior engineer reading the contributor guide knows the
  loop has a beginner-default path.

README.md untouched — its #2 and "Classifying with fastclass" sections
defer the improvement loop to AGENT.md; adding an autonomous-mode
mention there would surface a concept the README doesn't introduce.

A fresh activist reading AGENT.md §2 today can answer: "I ratified
once, what do I run now?" — `/fastclass:improve autonomous`, with the
constitution still sovereign and the audit trail intact.

cargo test --offline: 66 passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The fastclass `compile evaluate|backtest|ratify` sub-subcommands replaced
the old `classify --eval|--backtest|--promote` flag forms. Deprecation
aliases keep the old shape working, but all govbot docs should teach the
new shape going forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Scenario A — the pipeline silently widened the source walk. Empirically,
`govbot run` with `datasets: [wy]` in a project whose `.govbot/repos/` had
52 datasets cached classified ~4900 records across every state instead of
the ~100 the manifest declared.

`run_publish_command` already passed `--repos` to its publish-time source
spawn (main.rs:2492-2497), and `run_source_command` itself scopes by
`--repos` correctly (verified: `--repos wy` → 100, `--repos wy il ca` →
300, no `--repos` → 4916). The defect was in the classify pipeline's own
source spawn — `pipeline::run_transform_dag` invoked `govbot source
--select docs` and never appended `--repos`, so the manifest's
`datasets:` was load-bearing for `pull` but ignored downstream at
classify.

Fix: translate `manifest.datasets` to a `--repos` argv (`[all] →` flag
omitted, matching source's "every linked dataset" sentinel; any other
list passed verbatim) and thread it through `run_transform_dag`. The
manifest's `datasets:` mental model is now coherent end-to-end: pull,
classify, and publish all honour the same scope.

Tests (69 total, +3):
  * unit (`pipeline.rs`): `source_repos_from_manifest` translates `[all]`
    to empty and any other list verbatim.
  * integration (`tests/run_repos_scope.rs`): against a throwaway
    two-dataset corpus, `govbot source --select docs --repos wy` emits
    only `wy-legislation/...` ids, and the no-`--repos` walk visits both
    datasets — pinning the source-side invariants the pipeline relies on.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A publisher's result stream emits **one record per action-log file**
(committee referral, hearing, vote event …), not one per bill. The
climate-tracker bluesky-pending list under `datasets: [all]` showed
the cost: NV AB1 posted 6×, AK HB53 4×, plus AL SB124, IL SB2456,
WY SF0015, CT HB07174 doubled — ~12 of 96 posts were the same bill.
For an activist who wants a daily digest, the same bill posted six
times in a row destroys credibility.

This commit collapses every publisher (bluesky, rss, html, json) to
**one item per (jurisdiction, bill_id)**. The collapsing key is the
**bill_guid** — the canonical `<dataset>/.../bills/<bill_id>` path,
derived from `sources.log` (strip `/logs/...`, append `/bills/<id>`
when the OCD-files session-level layout omits it) or `sources.bill`
(strip `/metadata.json`).

The `bluesky` publisher's selection is now dedup-then-filter, by
score:

  1. group every entry by `bill_guid` (no score filter yet);
  2. within each group, keep only logs that clear `min_score` for a
     selected tag;
  3. pick the **highest-scoring** qualifying log as the representative
     — the post we render is the strongest log for the bill, not an
     arbitrary newest one.

This means bluesky needs to see every log per bill (not a pre-dedup'd
stream), so `run_publish_command` now skips the global dedup and the
default `--limit 100` for bluesky publishers. RSS / HTML / JSON
publishers still get the bill-level global dedup; their result is one
feed item per bill.

**Ledger migration.** The bluesky ledger key is now the bill-level
GUID, so future action logs for an already-posted bill don't trigger
a re-post. Pre-fix ledgers held per-log GUIDs; on read each entry is
collapsed via `ledger_id_to_bill_key`:

  - **Per-bill-log layout** entries already carry `/bills/<id>` before
    `/logs/`, so stripping yields the new bill key cleanly. Bills
    posted under that layout don't re-post.
  - **Session-level-log layout** entries (the OCD-files common case)
    end at `/sessions/<id>/logs/<file>`; stripping yields the session
    prefix, which doesn't match the bill key. These bills re-post
    **once** on the first post-upgrade run, after which the new
    bill-level GUID lands in the ledger and never re-posts again.

The session-level case is the honest migration cost — recovering the
bill_id from a session-level log path alone would be wrong as often
as right (filenames don't reliably encode the bill).

Real-data check on the climate-tracker feed: pre-fix the bluesky
dry-run emitted 76 log-records carrying duplicates (e.g. KY SB89 5×,
WY SF0015 6×); post-fix it emits 72 unique bills with zero duplicates.

Tests: +9 — 5 in `bluesky` (per-bill rep selection, score-tie pick,
ledger bill-level dedup, legacy per-log GUID compat, bill_guid sanity)
and 4 in `publish` (deduplicate_entries collapses N→1 per bill,
keeps distinct bills distinct, rss publisher emits one `<item>` per
bill, html publisher emits one `<article>` per bill).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Govbot's civic-tech application — driving per-topic Bluesky bots from
state-legislative data — was first proven by Frankie Vegliante's
CHN-Bluesky-Govbot-Main framework. The per-topic config + GitHub
Actions cron + per-topic state ledger + shared posting pipeline across
13 issue areas is the pattern that govbot's 4-tool architecture
generalises. Credit it in README and in AGENT.md's playbook intro
before any upstream PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a `govbot init --from-frankie-config <path> [--into <dir>]` migration
tool for the 13 existing CHN-Bluesky-Govbot topics (transportation,
immigration, housing, education, …) so a Frankie topic maintainer can
move to the govbot+fastclass stack without rebuilding the keyword list
from scratch. Purely local — no network calls.

Field-to-field mapping the scaffold applies:

  name           → classifier tag name + bluesky publisher `select: [<name>]`
  display_name   → tag description framing + README header (falls back to
                   title-cased `name`)
  default_emoji  → README header + summarizer prompt voice
  keywords       → classifier/classifier.yml tags.<name>.include_keywords
                   (verbatim)
  emoji_map      → classifier.yml header comment listing keyword→emoji,
                   ready to fold into a post template later
  digest_title   → publish.feed.title + publish.site.title (falls back to
                   "<display_name> Bills Weekly Digest")
  topic          → tag description + summarizer prompt subject
                   (falls back to `name`)
  (any extras)   → absorbed via #[serde(flatten)] so Frankie configs that
                   carry schedule/timezone/jurisdictions still parse

Skeleton written into `<into>`:
  govbot.yml      manifest (datasets:[all], classify transform, rss/html/
                  bluesky publishers each selecting [<topic.name>])
  classifier/
    classifier.yml   one tag = the topic name, include_keywords = Frankie's
                     keyword list verbatim, examples: [], threshold: 0.3
    fusion.yml       declares the portable `models:` block (encoder +
                     reranker) so `fastclass model fetch --bundle` works
    eval/
      constitution.yml  two PLACEHOLDER items (one positive, one negative),
                        clearly marked
      rolling.yml       items: []
    proposals/.gitkeep  empty
  summarizer/prompt.md   stub folding in the topic focus
  README.md       activist-facing migration story + next-steps
  .gitignore      .govbot/ tags/ state/ dist/ classifier/model[-rerank]/
                  fastclass.lock govbot.lock .env

Pre-flight guard: refuses to overwrite if `<into>/govbot.yml` already
exists, with an explicit error message pointing the user at --into.

After scaffolding, stdout prints the 5-step next-steps recipe (install
Tier-2 model, seed gold, dry-run, /fastclass:improve, set BLUESKY_*
env vars) so a CHN topic maintainer can land on the new stack and keep
moving without re-reading the docs.

Tests (+4): unit tests for the parser (minimal-config-with-extras,
display-fallback, empty-name-rejected) and an integration test that
runs the binary, parses the scaffolded manifest + classifier.yml +
fusion.yml, asserts the keyword list survives verbatim, and confirms
the overwrite guard. Total 78 → 82.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The chihacknight/govbot upgrade renames `govbot logs` to `govbot source`
(commit 1342ce2). It was a hard cutover — no alias. The most important
downstream consumer, frankies2727/CHN-Bluesky-Govbot-Main, runs ~13 civic-
issue Bluesky bots whose cron drives them with `govbot logs > bills.jsonl`.
Without a back-compat alias, pushing the refactor to chihacknight/govbot
main would break every bot on the next cron run.

Restore `govbot logs` as a thin alias that:
  - mirrors `govbot source`'s flag surface (--datasets/--repos, --limit,
    --join, --select, --filter, --sort, --govbot-dir) verbatim,
  - prints a one-line deprecation warning to STDERR (stdout is the
    bills.jsonl payload — leaking to stdout would corrupt it),
  - delegates to the same `run_source_command` the canonical `Source`
    arm calls, with the args forwarded as-is.

Deprecation policy: the alias is documented as deprecated and will be
removed in a future major version. Until then any invocation that worked
pre-rename keeps working.

Pin the bills.jsonl shape contract with a new integration test
(`tests/bills_jsonl_compat.rs`) that asserts every field path Frankie's
`scripts/post_to_bluesky.py` parser reads is present on the source
stream, that `\bstate:([a-z]{2})\b` state detection still works, and
that the dedup_key Frankie composes is non-empty and stable across two
consecutive invocations on the same mock corpus. Anyone who breaks the
shape gets a red test before Frankie's bots see a broken cron.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sartaj and others added 2 commits May 25, 2026 23:10
The 4d8e4f8 alias defaulted to --filter default (action-based, drops
routine introductions / committee referrals / "Bill Number Assigned"
lines), producing ~5 records against the wy+gu mock corpus. Frankie's
scripts/post_to_bluesky.py was written against the pre-Source-rename
govbot logs output which did NOT filter; under that contract ~20
records flow. Default flipped to --filter none to preserve the older
behavior; opt into the action filter with --filter default.

Also generalizes extract_timestamp_from_path to accept either `_` or
`.` between the timestamp and the action slug — OCD-files emit both
shapes (e.g. `20250129T022703Z_bill_number_assigned.json` and
`20250131T030931Z.classification.introduction.lower.json`). The action
filter happens to drop every `.classification.*` entry, so the
`_`-only extractor was sufficient under --filter default; under
--filter none those entries flow through and need their timestamp
projected too (Frankie's parser reads record["timestamp"]).

bills_jsonl_compat now passes against the new default; the full
84-test suite is green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Scoped agent-facing playbook at actions/pipeline-manager/AGENT.md so
future Claude Code sessions can update the Python data-catalog layer
(chn-openstates-{scrape,files}.yml + render.py / apply.py) without
re-deriving how it works. Covers: the declarative-repo-factory mental
model, the two side-by-side configs, read-order for the Python
orchestration, the four common change shapes (add state, mark working,
change template, add new dataset family), the dry-run verification
loop, and the cross-tool sync gotcha with actions/govbot/data/registry.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sartaj sartaj changed the title Refactor: 4-tool govbot stack, CHN-Bluesky-Govbot back-compat Govbot 2: Upgrade Tagging + Decoupling May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant