From 1342ce2f2903f9830c87908912598dde9666c9e9 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 18:18:18 -0500 Subject: [PATCH 01/32] Layer govbot: manifest cutover, transform DAG, package-manager, AGENT.md This commit lands the govbot-stack refactor in five waves, reshaping the repo from a single-purpose climate-bill action into a layered, data-driven newsbot framework. See ../govbot-stack-architecture.md for the design and schemas/STREAM_PROTOCOL.md for the cross-domain wire contract this work freezes. Wave 1/2 - schema and manifest cutover, transform DAG, subcommand rename. schemas/govbot.schema.json drops the legacy `tags:` block, renames `repos:` to `datasets:` (now required), and introduces `transforms`, `publish`, and `pipelines` sections so a manifest describes a DAG instead of three hard-coded stages. src/config.rs Manifest mirrors that shape; the runtime CLI filter state Config.repos is intentionally kept separate from the manifest `datasets:` field and documented inline. src/pipeline.rs is rewritten as a data-driven transform runner so classify, summarize, and publish stages compose from manifest entries rather than being baked in. The classify stage now takes an explicit `classifier=` path instead of relying on cwd, eliminating an implicit dependency. Subcommands are hard-cutover renamed (Clone -> Pull, Tag -> Apply, Build -> Publish, Logs -> Source) with matching tape/example/snapshot updates. src/selectors.rs emits full bill text from metadata.json so downstream stages do not have to re-fetch sources. CLAUDE.md is rewritten to explain the new fastclass layering. Wave 3 - publishers, Bluesky posting bot, AGENT.md, extraction source. Publish surface is split out so each channel (rss, html, json, duckdb, bluesky) lives as its own module. The Bluesky publisher is a real posting bot speaking AT Protocol XRPC, reading credentials only from env, keeping a posted-state ledger to avoid duplicates, and supporting --dry-run. A new root AGENT.md ships a newsbot-builder playbook intended to be loaded by URL (no plugin or marketplace machinery). README.md picks up a bootstrap line pointing newcomers at AGENT.md. The actions/govbot/examples/ climate-activist/ prototype extraction source that previously lived in this repo is removed in Wave 5 below; its standalone successor lives in the climate-activist-gov-news-bot repo. Wave 4 - package-manager layer, WorkingLocale deletion. The 52-variant compile-time WorkingLocale enum and its generator binary are deleted (src/locale_generated.rs, src/bin/generate-locale-enum.rs): jurisdictions are now resolved at runtime through a small package manager. New src/registry.rs plus data/registry.json catalog 55 datasets across the seed jurisdictions (documented in REGISTRY.md). New src/lock.rs writes a govbot.lock that pins dataset SHAs for reproducibility. New src/cache.rs provides a content-addressed cache under ~/.govbot/cache/ keyed by URL + channel. src/git.rs is rewritten to resolve the OpenStates / per-jurisdiction index at runtime. New commands init, add, remove, pull, ls, search, and run wire the package manager up to the existing pipeline. Wave 5 - prototype cleanup. actions/govbot/examples/climate-activist/ is deleted now that the standalone climate-activist-gov-news-bot repo is the canonical extraction source for the AGENT.md playbook. Build-blocker - ureq 2 -> 3 migration. actions/govbot/Cargo.toml bumps ureq from 2.12 to 3 because 2.12.1 was undownloadable from crates.io in the build environment; ureq 3.3.0 is available from the local cache. The two callers were migrated to the 3.x API: src/bluesky.rs uses .header() in place of .set(), switches from Error::Status(code, r) matching to .config().http_status_as_error(false) plus an explicit status check, and replaces .into_json() with .into_body().read_json(); src/registry.rs replaces .into_string() with .into_body().read_to_string(). Tests - 30 tests pass (16 unit + 3 api_snaps + 1 cli_example_snaps + 10 wizard_tests). The cli_example_snaps source_basic snapshot is regenerated against the wy mock so it contains a real govbot source activity stream (5 bills) instead of the empty placeholder a previous subagent left behind. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 556 +++++ CLAUDE.md | 61 +- README.md | 24 +- actions/govbot/Cargo.lock | 1221 ++--------- actions/govbot/Cargo.toml | 20 +- actions/govbot/DUCKDB.md | 4 +- actions/govbot/README.md | 138 +- actions/govbot/REGISTRY.md | 94 + actions/govbot/TAGGING.md | 132 +- actions/govbot/action.yml | 119 +- actions/govbot/data/registry.json | 62 + actions/govbot/examples/govbot-clone-list.sh | 1 - actions/govbot/examples/govbot-pull-list.sh | 1 + actions/govbot/examples/logs-basic.sh | 1 - actions/govbot/examples/source-basic.sh | 1 + actions/govbot/justfile | 35 +- .../govbot/src/bin/generate-locale-enum.rs | 204 -- actions/govbot/src/bluesky.rs | 539 +++++ actions/govbot/src/cache.rs | 155 ++ actions/govbot/src/config.rs | 181 +- actions/govbot/src/embeddings.rs | 454 +--- actions/govbot/src/git.rs | 778 +++---- actions/govbot/src/lib.rs | 18 +- actions/govbot/src/locale_generated.rs | 302 --- actions/govbot/src/lock.rs | 175 ++ actions/govbot/src/main.rs | 1864 ++++++++--------- actions/govbot/src/pipeline.rs | 330 ++- actions/govbot/src/processor.rs | 11 +- actions/govbot/src/publish.rs | 207 +- actions/govbot/src/registry.rs | 323 +++ actions/govbot/src/selectors.rs | 218 +- actions/govbot/src/wizard.rs | 261 +-- ...-clone-list.tape => govbot-pull-list.tape} | 6 +- actions/govbot/tapes/logs-basic.tape | 12 - actions/govbot/tapes/source-basic.tape | 12 + actions/govbot/tests/cli_example_snaps.rs | 10 +- ...i_example_snaps__snapshot@govbot_help.snap | 22 +- ...ple_snaps__snapshot@govbot_pull_list.snap} | 10 +- ...li_example_snaps__snapshot@logs_basic.snap | 8 - ..._example_snaps__snapshot@source_basic.snap | 14 + .../snapshots/wizard_tests__wizard_all.snap | 30 + .../wizard_tests__wizard_all_no_tag.snap | 24 - .../wizard_tests__wizard_all_with_tag.snap | 36 - .../wizard_tests__wizard_session_all.snap | 90 + ...rd_tests__wizard_session_all_own_tags.snap | 122 -- ...rd_tests__wizard_session_all_with_tag.snap | 111 - ...rd_tests__wizard_session_single_state.snap | 97 +- ...wizard_tests__wizard_session_specific.snap | 96 + ...sts__wizard_session_specific_own_tags.snap | 134 -- ...sts__wizard_session_specific_with_tag.snap | 123 -- .../wizard_tests__wizard_single.snap | 30 + .../wizard_tests__wizard_single_with_tag.snap | 36 - .../wizard_tests__wizard_specific.snap | 32 + .../wizard_tests__wizard_specific_no_tag.snap | 26 - actions/govbot/tests/wizard_tests.rs | 175 +- schemas/README.md | 22 +- schemas/STREAM_PROTOCOL.md | 116 + schemas/govbot.schema.json | 169 +- 58 files changed, 5083 insertions(+), 4970 deletions(-) create mode 100644 AGENT.md create mode 100644 actions/govbot/REGISTRY.md create mode 100644 actions/govbot/data/registry.json delete mode 100644 actions/govbot/examples/govbot-clone-list.sh create mode 100644 actions/govbot/examples/govbot-pull-list.sh delete mode 100644 actions/govbot/examples/logs-basic.sh create mode 100644 actions/govbot/examples/source-basic.sh delete mode 100644 actions/govbot/src/bin/generate-locale-enum.rs create mode 100644 actions/govbot/src/bluesky.rs create mode 100644 actions/govbot/src/cache.rs delete mode 100644 actions/govbot/src/locale_generated.rs create mode 100644 actions/govbot/src/lock.rs create mode 100644 actions/govbot/src/registry.rs rename actions/govbot/tapes/{govbot-clone-list.tape => govbot-pull-list.tape} (50%) delete mode 100644 actions/govbot/tapes/logs-basic.tape create mode 100644 actions/govbot/tapes/source-basic.tape rename actions/govbot/tests/snapshots/{cli_example_snaps__snapshot@govbot_clone_list.snap => cli_example_snaps__snapshot@govbot_pull_list.snap} (80%) delete mode 100644 actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap create mode 100644 actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap create mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap create mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap create mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap create mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap create mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap delete mode 100644 actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap create mode 100644 schemas/STREAM_PROTOCOL.md diff --git a/AGENT.md b/AGENT.md new file mode 100644 index 00000000..ad6fcbe4 --- /dev/null +++ b/AGENT.md @@ -0,0 +1,556 @@ +# AGENT.md — build a government-news bot with govbot + +You are a Claude Code session helping a user stand up, operate, or evolve a +**govbot newsbot** — a project that pulls government legislation, classifies +the bills relevant to an issue the user cares about, and publishes the matches +(today, to a Bluesky account). + +This file is the **end-user playbook**. A fresh session loads it by URL: + +> Read github.com/chihacknight/govbot/AGENT.md and follow it to set up a +> govbot project here. + +There is no plugin, no marketplace, no slash command to install for govbot +itself — this document *is* the bootstrap. (You will, near the end, add the +**fastclass** plugin to the new project so its classifier can be tuned.) + +> This is NOT `CLAUDE.md`. `CLAUDE.md` in the govbot repo is a contributor +> guide for engineers working *on* govbot. `AGENT.md` (this file) is for +> *end users* building a bot *with* govbot. Do not conflate them. + +govbot is **issue-agnostic**. Climate legislation is the first use case, not +the only one — transportation, housing, AI/data-center policy, education, and +any other topic work the same way. Interview the user for their issue; never +assume climate. + +--- + +## The three jobs + +A user comes to you for one of three things. Identify which, then jump to that +section. + +| Job | The user says… | Section | +|---|---|---| +| **make** | "set up a govbot project / newsbot here" | [§1](#1-make--scaffold-a-new-newsbot) | +| **manage** | "set up / run the Bluesky bot", "schedule it" | [§2](#2-manage--operate-the-bluesky-bot) | +| **update** | "add a dataset", "the classifier misses bills" | [§3](#3-update--evolve-an-existing-project) | + +--- + +## The model — read this before doing anything + +govbot is a **CLI** plus two companion concepts. Keep them straight: + +- **`govbot`** — the gov-data tool. Pulls datasets (git repos of legislation), + runs transforms over them, and runs publishers. Its config is `govbot.yml`, + a **manifest** (`datasets` / `transforms` / `publish` / `pipelines`). +- **`fastclass`** — a separate text-classifier CLI. govbot streams bills into + it; it scores each against a **classifier bundle** (a directory: + `classifier.yml` + `fusion.yml` + `eval/`). govbot only passes the bundle's + *path* — it never reads the taxonomy itself. +- **The userland project** (what you scaffold) — a directory holding + `govbot.yml`, the `classifier/` bundle, and a few support files. It owns + **no code**; everything is reconstructed by running the tools. + +The real CLI verbs — use these exact names, they are current: + +``` +govbot init # scaffold a govbot.yml (the setup wizard) +govbot search # search the dataset registry +govbot add # add datasets to govbot.yml's datasets: list +govbot remove # remove datasets from govbot.yml +govbot ls # list manifest + locally-cached datasets +govbot pull # clone/update datasets (git repos) into the cache +govbot source # stream legislative activity as JSON Lines +govbot apply # persist fastclass results into the dataset +govbot publish # run the manifest's publishers +govbot run # the full pipeline: pull -> source|classify|apply -> publish +fastclass classify - # score a JSON-Lines doc stream from stdin +fastclass describe classifier= # print a bundle's tags + interface +fastclass classify --eval / --backtest / --promote # the tuning primitives +``` + +Datasets are resolved at runtime through a **dataset registry** — an +index mapping a dataset id to its git repo. A bare jurisdiction code (`wy`) +and a namespaced id (`us-legislation/wy`) both resolve. `govbot search` +queries the registry; `govbot add` validates an id against it before writing +it into `govbot.yml`. `govbot pull` clones each dataset once into a shared +machine-wide cache (`~/.govbot/cache/`) and records the exact commit in +`govbot.lock` for reproducible runs. + +The classify step is a Unix pipe across the two tools: + +``` +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` + +`govbot run` wires that pipe (plus pull and publish) automatically from +`govbot.yml`. + +--- + +## 1. make — scaffold a new newsbot + +### 1.1 Verify the tools are installed + +Both `govbot` and `fastclass` must be resolvable. govbot resolves binaries in +this order: **`$PATH` → `~/.cargo/bin` → `~/.govbot/bin`**. Check: + +```bash +command -v govbot || ls ~/.cargo/bin/govbot ~/.govbot/bin/govbot 2>/dev/null +command -v fastclass || ls ~/.cargo/bin/fastclass 2>/dev/null +``` + +If `govbot` is missing, install the nightly: + +```bash +sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh)" +``` + +If `fastclass` is missing, build it from source (it is a separate repo): + +```bash +git clone && cd fastclass && just install # -> ~/.cargo/bin/fastclass +``` + +Ensure `~/.cargo/bin` and `~/.govbot/bin` are on `PATH`: + +```bash +export PATH="$HOME/.cargo/bin:$HOME/.govbot/bin:$PATH" +``` + +Do not proceed until both `govbot --help` and `fastclass --help` run. + +### 1.2 Interview the user + +Ask, and record the answers — they drive every file you generate: + +1. **Issue area.** What topic should the bot track? (climate, transit, + housing, AI/data centers, education, …) Get 2–5 specific sub-themes — these + become the classifier **tags**. +2. **Jurisdictions / datasets.** All jurisdictions (`all`), or a subset? + Don't guess the codes — query the registry: `govbot search` lists every + dataset, `govbot search wyoming` narrows it. Dataset ids are short + (`wy`, `il`, `ca`, `ny`, …). When unsure, start with 1–3 for a fast first + run. +3. **What to publish.** A Bluesky feed? An RSS feed / HTML index? Both? For + Bluesky, what handle will the bot post from? + +### 1.3 Generate the project + +Create these files in the **current directory**. Adapt every name and tag to +the user's issue — the examples below use a transit bot; do not copy them +verbatim for a climate user. + +#### `govbot.yml` — the manifest (NO `tags:` block) + +```yaml +# govbot.yml — project manifest. Declares datasets, transforms, publishers, +# and pipelines. It is NOT the classifier: the tag taxonomy lives in +# classifier/classifier.yml, referenced here only by path. +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - il + - ny + # - all # uncomment to track every jurisdiction + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier + +publish: + bluesky: + type: bluesky + select: [transit_funding, transit_safety] # tag names from classifier.yml + min_score: 0.6 # calibrated final_score threshold; 0..1 + post_template: "{title}\n\n{tags} · {link}" + # ledger: .govbot/bluesky-bluesky.ledger # default; tracks posted bills + + feed: + type: rss + select: [transit_funding, transit_safety] + base_url: "https://.github.io/" + output_dir: docs + +pipelines: + default: [classify, bluesky, feed] +``` + +Notes: +- **No `tags:` key.** It is retired; a manifest carrying it fails to parse. +- `publish..select` lists tag names — they must exist in the classifier + bundle. Validate later with `fastclass describe`. +- Drop the `feed` publisher if the user only wants Bluesky, and vice versa. +- Prefer `govbot add ` over hand-editing the `datasets:` list — it + validates each id against the registry first. Use `govbot init` to scaffold + the whole `govbot.yml` interactively. + +#### `classifier/` — the fastclass bundle + +``` +classifier/ + classifier.yml the taxonomy (tags) — REQUIRED + fusion.yml matcher fusion weights + cascade band + eval/ + constitution.yml frozen gold set — NEVER shown to an LLM + rolling.yml refreshable working eval set + proposals/ improvement-proposal history (starts empty) +``` + +`classifier/classifier.yml` — seed one tag per sub-theme from the interview: + +```yaml +# classifier.yml — the taxonomy. Owned by fastclass; govbot only references +# the directory by path. Tune it with the fastclass /fastclass:improve loop. +tags: + transit_funding: + description: >- + Bills funding public transit — operating subsidies, capital programs, + fare policy, and dedicated transit revenue. + include_keywords: + - public transit + - bus rapid transit + - rail funding + - transit operating + - farebox + exclude_keywords: + - highway fund + threshold: 0.3 + transit_safety: + description: >- + Bills addressing transit rider and worker safety — assaults on operators, + platform safety, grade-crossing safety. + include_keywords: + - transit safety + - operator assault + - grade crossing + - platform screen + threshold: 0.3 +``` + +`classifier/fusion.yml` — start minimal; fastclass applies defaults if absent: + +```yaml +# fusion.yml — fusion weights + the cascade uncertainty band. +version: fusion-v1 +``` + +`classifier/eval/constitution.yml` — the **frozen** gold set. Seed 2–3 bills +per tag from the user's knowledge. This set is the final judge of classifier +quality and is never shown to an LLM: + +```yaml +# constitution.yml — FROZEN gold standard. Curate by hand; never edit it to +# make a number go up. Never show it to an LLM. +items: + - id: tf-capital + text: >- + AN ACT appropriating funds for a regional rail capital program and + dedicated transit operating subsidies. + expected_tags: [transit_funding] + - id: ts-operator + text: >- + A BILL increasing penalties for assault on a transit bus operator and + funding platform safety improvements. + expected_tags: [transit_safety] +``` + +`classifier/eval/rolling.yml` — the refreshable working set the improvement +loop learns from. Start with the same shape; grow it as you find misses: + +```yaml +# rolling.yml — refreshable working eval set. Add bills the classifier gets +# wrong here; closing them is what /fastclass:improve does. +items: + - id: roll-tf-fare + text: >- + A BILL establishing a reduced-fare transit program for low-income riders. + expected_tags: [transit_funding] +``` + +Leave `classifier/proposals/` an empty directory (add a `.gitkeep`). + +#### `summarizer/prompt.md` — framing prompt for a future summarize stage + +```markdown +# Summarizer prompt + +A future govbot `summarize` transform will use this prompt to turn a matched +bill into publish-ready framing for the audience. + +Frame each bill in 1–2 sentences for a reader: what the bill +does, why it matters to the issue, and what stage it is at. Neutral, factual, +no hyperbole. +``` + +#### `.env.example` — credential template + +```bash +# Copy to .env and fill in. .env is git-ignored — never commit real values. +# Bluesky credentials for the `bluesky` publisher. +# Create an APP PASSWORD at: Bluesky -> Settings -> App Passwords. +# NEVER use your main account password. +BLUESKY_HANDLE=yourbot.bsky.social +BLUESKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx +# Optional — defaults to https://bsky.social +# BLUESKY_SERVICE=https://bsky.social +``` + +#### `.gitignore` + +```gitignore +# Generated by the tools — reconstructed on every run. +.govbot/ +dist/ +docs/ +# Secrets — never commit. +.env +``` + +#### `README.md` + +A short project README: what the bot tracks, the datasets, how to run it +(`govbot run`), and a pointer to this AGENT.md. + +#### `CLAUDE.md` — make every later session govbot-aware + +Write this into the **new project** so any Claude Code session opened here +loads the playbook without the user re-pasting the prompt: + +```markdown +# CLAUDE.md + +This is a **govbot newsbot** project. Before doing govbot work in this repo, +read the govbot end-user playbook and follow it: + + Read github.com/chihacknight/govbot/AGENT.md and follow it. + +Project layout: +- `govbot.yml` — the manifest (datasets / transforms / publish / pipelines) +- `classifier/` — the fastclass classifier bundle (the tag taxonomy) +- `summarizer/` — framing prompt for a future summarize stage +- `.env` — Bluesky credentials (git-ignored; see `.env.example`) + +To tune the classifier, use the fastclass plugin: `/fastclass:improve`. +Generated dirs (`.govbot/`, `dist/`, `docs/`) are git-ignored. +``` + +#### `.claude/settings.json` — import the fastclass plugin + +So the user can run `/fastclass:improve` to tune the classifier: + +```json +{ + "plugins": { + "fastclass": { + "source": "/plugins/fastclass" + } + } +} +``` + +Confirm the exact plugin-source syntax against the fastclass repo's README +(`plugins/fastclass/`); adjust if the user's fastclass checkout lives +elsewhere. + +### 1.4 First run + +```bash +govbot pull il ny # clone the datasets (or: govbot pull all) +govbot run # pull -> classify -> apply -> publish +``` + +`govbot pull` clones each dataset once into the shared `~/.govbot/cache/` and +writes `govbot.lock` pinning the exact commit each resolved to. Commit +`govbot.lock` to the project repo — it makes runs reproducible. A second +`pull` (here or in any other project) reuses the cache instead of re-cloning. + +For the Bluesky publisher, **always dry-run first** — see §2. + +--- + +## 2. manage — operate the Bluesky bot + +The `bluesky` publisher is a **posting bot**: it posts to a normal Bluesky +account via the AT Protocol and runs to completion (no server). It is +idempotent — a posted-state ledger keeps re-runs from double-posting. + +### 2.1 Create the app password + +1. In the Bluesky app: **Settings → App Passwords → Add App Password**. +2. Copy the generated password (format `xxxx-xxxx-xxxx-xxxx`). +3. Put credentials in the environment — **never in `govbot.yml`**: + +```bash +cp .env.example .env +# edit .env: +# BLUESKY_HANDLE=yourbot.bsky.social +# BLUESKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx +``` + +Load it before running: `set -a; source .env; set +a`. + +### 2.2 The publisher config + +Under `govbot.yml: publish:` (see the template in §1.3): + +| Field | Meaning | +|---|---| +| `type: bluesky` | selects the Bluesky publisher | +| `select` | tag names to post — must exist in the classifier bundle | +| `min_score` | minimum calibrated `final_score` (0..1) to post; default `0.6` | +| `post_template` | post text; placeholders `{title} {tags} {link} {identifier} {session} {score}`; truncated to 300 chars | +| `ledger` | posted-state ledger path; default `.govbot/bluesky-.ledger` | + +Credentials are **never** config fields — they are env-only. + +### 2.3 Dry-run first — always + +```bash +govbot publish --publisher bluesky --dry-run +``` + +`--dry-run` renders the posts that *would* be sent and **touches no network +and no ledger**. Review the rendered text with the user — check the template, +the 300-char truncation, and that `min_score` is neither too loose (spam) nor +too tight (silence). Adjust `post_template` / `min_score` and re-dry-run. + +### 2.4 Go live + +```bash +set -a; source .env; set +a +govbot publish --publisher bluesky +``` + +The publisher authenticates (`com.atproto.server.createSession`), posts each +matching bill not already in the ledger (`com.atproto.repo.createRecord`), and +appends each posted bill's id to the ledger. Re-running posts only new +matches. + +### 2.5 Schedule it + +The bot runs from cron/CI — no always-on server. + +**cron** (every 6 hours): + +```cron +0 */6 * * * cd /path/to/project && set -a && . ./.env && set +a && govbot run >> .govbot/run.log 2>&1 +``` + +**GitHub Actions** (`.github/workflows/newsbot.yml`): + +```yaml +name: newsbot +on: + schedule: [{ cron: "0 */6 * * *" }] + workflow_dispatch: +jobs: + run: + runs-on: ubuntu-latest + env: + BLUESKY_HANDLE: ${{ secrets.BLUESKY_HANDLE }} + BLUESKY_APP_PASSWORD: ${{ secrets.BLUESKY_APP_PASSWORD }} + steps: + - uses: actions/checkout@v4 + - name: Install govbot + fastclass + run: | + sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh)" + # install fastclass per its repo's instructions + echo "$HOME/.govbot/bin:$HOME/.cargo/bin" >> "$GITHUB_PATH" + - name: Run the newsbot + run: govbot run + # Commit the ledger back so re-runs stay idempotent across CI runs: + - name: Persist the posted-state ledger + run: | + git add -f .govbot/*.ledger || true + git commit -m "newsbot: update posted-state ledger" || true + git push || true +``` + +In CI the `.govbot/` ledger is ephemeral unless persisted — commit the +`*.ledger` file back (as above) or store it in a cache/artifact, or the bot +will re-post on every run. + +--- + +## 3. update — evolve an existing project + +Open the project, read its `govbot.yml` and `classifier/classifier.yml`, then: + +### Add or remove a dataset + +Use the registry-backed commands rather than hand-editing `govbot.yml`: + +```bash +govbot search # find the dataset id in the registry +govbot add # validate it and add it to govbot.yml datasets: +govbot pull # clone it (updates govbot.lock) +govbot run +``` + +To drop a dataset: `govbot remove `. `govbot ls` shows the manifest's +datasets and which are cached locally. + +### Add or remove a publisher / change what gets posted + +Edit `govbot.yml: publish:` — add a publisher block, or change a `select` +list or `min_score`. Validate that every `select` tag exists in the bundle: + +```bash +fastclass describe classifier=./classifier # prints the bundle's tag list +``` + +Then dry-run any Bluesky publisher before going live (§2.3). + +### Widen or narrow the classifier scope + +The taxonomy lives in `classifier/classifier.yml`. **Do not hand-tune it by +guessing keywords** — delegate to the fastclass improvement loop, which proves +each change against the frozen gold set. + +1. **Measure** where the classifier stands: + ```bash + fastclass classify --eval constitution classifier=./classifier + fastclass classify --eval rolling classifier=./classifier + ``` +2. **Find misses.** Add bills the classifier gets wrong to + `classifier/eval/rolling.yml` with their correct `expected_tags`. To widen + scope, add a new tag to `classifier.yml` plus gold examples for it in both + eval sets. +3. **Improve.** Run the fastclass plugin — it studies the rolling failures, + drafts a proposal under `classifier/proposals/`, and is the supported way + to tune the bundle: + ``` + /fastclass:improve + ``` +4. **Backtest** the proposal — proves it against the frozen constitution: + ```bash + fastclass classify --backtest classifier/proposals/prop-0001.yml classifier=./classifier + ``` +5. **Promote** a passing proposal into the bundle: + ```bash + fastclass classify --promote classifier/proposals/prop-0001.yml classifier=./classifier + ``` +6. **Re-run** the bot: `govbot run`. + +Hard rule, inherited from fastclass: **never show `classifier/eval/ +constitution.yml` to an LLM.** It is the frozen judge; seeing it would corrupt +the eval. The improvement loop only ever reads `rolling.yml`. + +--- + +## Conventions + +- Ground every command in the real CLI above. If a verb is not in the + reference list, it does not exist — check `govbot --help` / `fastclass --help`. +- `govbot.yml` never has `tags:`. The taxonomy is the classifier bundle. +- Credentials are environment-only. Never write a secret into `govbot.yml`, + `.env.example`, or any committed file. +- Bluesky: dry-run before every first live run after a config change. +- Generated dirs (`.govbot/`, `dist/`, `docs/`) are git-ignored; the project + is a dozen small text files plus tool artifacts. diff --git a/CLAUDE.md b/CLAUDE.md index 93f80a61..10ee566a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -80,13 +80,14 @@ scripts/ # Repository-level utility scripts ## Common Commands ```bash -govbot init # Create govbot.yml config -govbot clone all # Download all state legislation datasets -govbot clone wy il # Download specific states -govbot logs # Stream legislative activity as JSON Lines -govbot logs | govbot tag # Process and tag data +govbot # Scaffold govbot.yml (interactive wizard), then run the pipeline +govbot pull all # Download all state legislation datasets +govbot pull wy il # Download specific states +govbot source # Stream legislative activity as JSON Lines +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply govbot load # Load bill metadata into DuckDB -govbot build # Generate RSS feeds +govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB) +govbot run # Run the full pipeline: pull -> classify -> apply -> publish ``` ## DuckDB Integration @@ -103,7 +104,7 @@ The `govbot load` command loads bill metadata into a DuckDB database for SQL ana **Usage**: ```bash -govbot clone all # First, get the data +govbot pull all # First, get the data govbot load # Load into DuckDB govbot load --memory-limit 32GB # For large datasets duckdb --ui ~/.govbot/govbot.duckdb # Open in browser UI @@ -111,12 +112,54 @@ duckdb --ui ~/.govbot/govbot.duckdb # Open in browser UI See `actions/govbot/DUCKDB.md` for query examples and schema documentation. +## Classifying with fastclass + +Classification is a **pipe** of two composable tools that compose over a +process boundary — govbot streams the data, **fastclass** (a standalone, +self-improving text classifier) classifies it, govbot persists the result: + +```bash +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` + +- **`govbot source --select docs`** emits one `{"id","text","kind":"docs"}` + document per bill carrying the **full bill text** from `metadata.json`; the + `id` is the bill's dataset path, which routes the result back. +- **`fastclass classify -`** scores each document against a **classifier + bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` + + `eval/`). govbot passes only the bundle path; it never reads the bundle. +- **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag + `.tag.json` files into the dataset — the files `govbot publish` turns into + feeds. It classifies nothing itself; it is purely the persistence sink. + +**`govbot.yml` is NOT the classifier — it is a manifest.** It declares +`datasets`, `transforms`, `publish`, and `pipelines`; it has **no `tags:` +block**. The tag taxonomy lives in a separate **fastclass classifier bundle** +that the manifest's `transforms..classifier` field references by path. +The two configs change at different cadences and are read by different tools: +`govbot.yml` answers *"what data, what transforms, what publishers"*; the +classifier bundle's `classifier.yml` answers *"what's relevant"*. + +To run the self-improving loop, work inside the classifier bundle directory and +use the fastclass Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`) +and the fastclass `classify --eval` / `--backtest` / `--promote` primitives. The +retired `fastclass --propose` flag no longer exists. + +**Prerequisite**: the `fastclass` binary must be resolvable on `PATH`, +`~/.cargo/bin`, or `~/.govbot/bin` (`cargo install --path `). +`govbot run`'s transform stage resolves transform binaries the same way. + +To improve tag quality, read **`AGENTS.md` in the fastclass repo** — the +operational playbook for the classify → eval → propose → backtest → promote +loop. Its one hard rule: never show the frozen `eval/constitution.yml` gold set +to an LLM. + ## Testing with Mock Data Mock legislative data is available for offline development: - Location: `actions/govbot/mocks/.govbot/repos/` - Contains: Wyoming (wy) and Guam (gu) sample data -- Usage: `govbot logs --govbot-dir ./actions/govbot/mocks/.govbot` +- Usage: `govbot source --govbot-dir ./actions/govbot/mocks/.govbot` ## govbot Development @@ -125,7 +168,7 @@ cd actions/govbot just setup # Install Rust toolchain and dependencies just test # Run snapshot tests just review # Review snapshot changes (insta) -just govbot logs # Run CLI in dev mode (uses mocks/.govbot) +just govbot source # Run CLI in dev mode (uses mocks/.govbot) just mocks wy il # Update mock data for testing ``` diff --git a/README.md b/README.md index 82b1a27f..95a60ef5 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,19 @@ `govbot` enables distributed data anaylsis of government updates via a friendly terminal interface. Git repos function as datasets, including the legislation of all 47 states/jurisdictions. +## 🤖 Build a newsbot with Claude Code + +The fastest way to stand up a govbot project — a classified, auto-publishing +legislation feed (e.g. a Bluesky bot) — is to let Claude Code drive it. +Open a Claude Code session in an empty directory and paste: + +> **Read github.com/chihacknight/govbot/AGENT.md and follow it to set up a govbot project here.** + +[`AGENT.md`](AGENT.md) is a self-contained playbook: Claude verifies the +tools, interviews you about the issue you want to track, scaffolds the +`govbot.yml` manifest + a `fastclass` classifier bundle, and walks you through +running and scheduling the bot. No plugin or marketplace install needed. + ## Example Projects - [Transportation Legislation Bluesky Bot](https://bsky.app/profile/govbottransport.bsky.social) @@ -48,11 +61,12 @@ With a `govbot.yml` in your directory, running `govbot` executes the full pipeli ### Other Commands ```bash -govbot clone all # download all state legislation datasets -govbot clone il ca ny # download specific states -govbot logs # stream legislative activity as JSON Lines -govbot logs | govbot tag # process and tag data -govbot build # generate RSS feeds +govbot pull all # download all state legislation datasets +govbot pull il ca ny # download specific states +govbot source # stream legislative activity as JSON Lines +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +govbot publish # run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky) +govbot run # the full pipeline: pull -> classify -> apply -> publish govbot load # load bill metadata into DuckDB govbot delete all # remove all downloaded data govbot update # update govbot to latest version diff --git a/actions/govbot/Cargo.lock b/actions/govbot/Cargo.lock index e6b5fc7f..62a3a2cc 100644 --- a/actions/govbot/Cargo.lock +++ b/actions/govbot/Cargo.lock @@ -123,18 +123,6 @@ version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" -[[package]] -name = "base64" -version = "0.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9e1b586273c5702936fe7b7d6896644d8be71e6314cfe09d3167c95f712589e8" - -[[package]] -name = "base64" -version = "0.21.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9d297deb1925b89f2ccc13d7635fa0714f12c87adce1c75356b39ca9b7178567" - [[package]] name = "base64" version = "0.22.1" @@ -143,15 +131,9 @@ checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6" [[package]] name = "base64ct" -version = "1.8.1" +version = "1.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0e050f626429857a27ddccb31e0aca21356bfa709c04041aefddac081a8f068a" - -[[package]] -name = "bitflags" -version = "1.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" +checksum = "2af50177e190e07a26ab74f8b1efbfe2ef87da2116221318cb1c2e82baf7de06" [[package]] name = "bitflags" @@ -174,12 +156,6 @@ version = "3.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "46c5e41b57b8bba42a04676d81cb89e9ee8e859a1a66f80a5a72e1cb76b34d43" -[[package]] -name = "byteorder" -version = "1.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b" - [[package]] name = "bytes" version = "1.11.0" @@ -277,6 +253,35 @@ dependencies = [ "windows-sys 0.59.0", ] +[[package]] +name = "cookie" +version = "0.18.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4ddef33a339a91ea89fb53151bd0a4689cfce27055c291dfa69945475d22c747" +dependencies = [ + "percent-encoding", + "time", + "version_check", +] + +[[package]] +name = "cookie_store" +version = "0.22.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "15b2c103cf610ec6cae3da84a766285b42fd16aad564758459e6ecf128c75206" +dependencies = [ + "cookie", + "document-features", + "idna", + "indexmap", + "log", + "serde", + "serde_derive", + "serde_json", + "time", + "url", +] + [[package]] name = "core-foundation" version = "0.9.4" @@ -414,14 +419,23 @@ dependencies = [ [[package]] name = "der" -version = "0.7.10" +version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e7c1832837b905bbfb5101e07cc24c8deddf52f93225eee6ead5f4d63d53ddcb" +checksum = "71fd89660b2dc699704064e59e9dba0147b903e85319429e131620d022be411b" dependencies = [ "pem-rfc7468", "zeroize", ] +[[package]] +name = "deranged" +version = "0.5.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7cd812cc2bc1d69d4764bd80df88b4317eaef9e773c75226407d9bc0876b211c" +dependencies = [ + "powerfmt", +] + [[package]] name = "derive_builder" version = "0.20.2" @@ -496,6 +510,15 @@ dependencies = [ "syn", ] +[[package]] +name = "document-features" +version = "0.2.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d4b8a88685455ed29a21542a33abd9cb6510b6b129abadabdcef0f4c55bc8f61" +dependencies = [ + "litrs", +] + [[package]] name = "either" version = "1.15.0" @@ -533,33 +556,12 @@ dependencies = [ "windows-sys 0.61.2", ] -[[package]] -name = "esaxx-rs" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d817e038c30374a4bcb22f94d0a8a0e216958d4c3dcde369b1439fec4bdda6e6" -dependencies = [ - "cc", -] - [[package]] name = "fastrand" version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be" -[[package]] -name = "filetime" -version = "0.2.26" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bc0505cd1b6fa6580283f6bdf70a73fcf4aba1184038c90902b92b3dd0df63ed" -dependencies = [ - "cfg-if", - "libc", - "libredox", - "windows-sys 0.60.2", -] - [[package]] name = "find-msvc-tools" version = "0.1.5" @@ -705,17 +707,6 @@ dependencies = [ "version_check", ] -[[package]] -name = "getrandom" -version = "0.2.16" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "335ff9f135e4384c8150d6f27c6daed433577f86b4750418338c01a1a2528592" -dependencies = [ - "cfg-if", - "libc", - "wasi", -] - [[package]] name = "getrandom" version = "0.3.4" @@ -734,7 +725,7 @@ version = "0.18.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "232e6a7bfe35766bf715e55a88b39a700596c0ccfd88cd3680b4cdb40d66ef70" dependencies = [ - "bitflags 2.10.0", + "bitflags", "libc", "libgit2-sys", "log", @@ -756,11 +747,8 @@ dependencies = [ "git2", "insta", "jwalk", - "ndarray 0.15.6", - "ort", "pathdiff", "regex", - "reqwest", "rss", "serde", "serde_json", @@ -768,29 +756,9 @@ dependencies = [ "sha2", "tempfile", "thiserror", - "tokenizers", "tokio", "tokio-test", - "toml", -] - -[[package]] -name = "h2" -version = "0.3.27" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0beca50380b1fc32983fc1cb4587bfa4bb9e78fc259aad4a0032d2080309222d" -dependencies = [ - "bytes", - "fnv", - "futures-core", - "futures-sink", - "futures-util", - "http 0.2.12", - "indexmap", - "slab", - "tokio", - "tokio-util", - "tracing", + "ureq", ] [[package]] @@ -805,17 +773,6 @@ version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" -[[package]] -name = "http" -version = "0.2.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "601cbb57e577e2f5ef5be8e7b83f0f63994f25aa94d673e54a92d5c516d101f1" -dependencies = [ - "bytes", - "fnv", - "itoa", -] - [[package]] name = "http" version = "1.4.0" @@ -826,66 +783,12 @@ dependencies = [ "itoa", ] -[[package]] -name = "http-body" -version = "0.4.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7ceab25649e9960c0311ea418d17bee82c0dcec1bd053b5f9a66e265a693bed2" -dependencies = [ - "bytes", - "http 0.2.12", - "pin-project-lite", -] - [[package]] name = "httparse" version = "1.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87" -[[package]] -name = "httpdate" -version = "1.0.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9" - -[[package]] -name = "hyper" -version = "0.14.32" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "41dfc780fdec9373c01bae43289ea34c972e40ee3c9f6b3c8801a35f35586ce7" -dependencies = [ - "bytes", - "futures-channel", - "futures-core", - "futures-util", - "h2", - "http 0.2.12", - "http-body", - "httparse", - "httpdate", - "itoa", - "pin-project-lite", - "socket2 0.5.10", - "tokio", - "tower-service", - "tracing", - "want", -] - -[[package]] -name = "hyper-tls" -version = "0.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d6183ddfa99b85da61a140bea0efc93fdf56ceaa041b37d553518030827f9905" -dependencies = [ - "bytes", - "hyper", - "native-tls", - "tokio", - "tokio-native-tls", -] - [[package]] name = "iana-time-zone" version = "0.1.64" @@ -946,7 +849,7 @@ dependencies = [ "icu_normalizer_data", "icu_properties", "icu_provider", - "smallvec 1.15.1", + "smallvec", "zerovec", ] @@ -1004,7 +907,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3b0875f23caa03898994f6ddc501886a45c7d3d62d04d2d90788d47be1b1e4de" dependencies = [ "idna_adapter", - "smallvec 1.15.1", + "smallvec", "utf8_iter", ] @@ -1028,19 +931,6 @@ dependencies = [ "hashbrown", ] -[[package]] -name = "indicatif" -version = "0.17.11" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "183b3088984b400f4cfac3620d5e076c84da5364016b4f49473de574b2586235" -dependencies = [ - "console", - "number_prefix", - "portable-atomic", - "unicode-width", - "web-time", -] - [[package]] name = "insta" version = "1.44.3" @@ -1053,36 +943,12 @@ dependencies = [ "similar", ] -[[package]] -name = "ipnet" -version = "2.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "469fb0b9cefa57e3ef31275ee7cacb78f2fdca44e4765491884a2b119d4eb130" - [[package]] name = "is_terminal_polyfill" version = "1.70.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a6cb138bb79a146c1bd460005623e142ef0181e3d0219cb493e02f7d08a35695" -[[package]] -name = "itertools" -version = "0.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b1c173a5686ce8bfa551b3563d0c2170bf24ca44da99c7ca4bfdab5418c3fe57" -dependencies = [ - "either", -] - -[[package]] -name = "itertools" -version = "0.12.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba291022dbbd398a455acf126c1e341954079855bc60dfdda641363bd6922569" -dependencies = [ - "either", -] - [[package]] name = "itoa" version = "1.0.15" @@ -1095,7 +961,7 @@ version = "0.1.34" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9afb3de4395d6b3e67a780b6de64b51c978ecf11cb9a462c66be7d4ca9039d33" dependencies = [ - "getrandom 0.3.4", + "getrandom", "libc", ] @@ -1119,12 +985,6 @@ dependencies = [ "rayon", ] -[[package]] -name = "lazy_static" -version = "1.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe" - [[package]] name = "libc" version = "0.2.178" @@ -1145,17 +1005,6 @@ dependencies = [ "pkg-config", ] -[[package]] -name = "libredox" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "416f7e718bdb06000964960ffa43b4335ad4012ae8b99060261aa4a8088d5ccb" -dependencies = [ - "bitflags 2.10.0", - "libc", - "redox_syscall", -] - [[package]] name = "libssh2-sys" version = "0.3.1" @@ -1195,36 +1044,16 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6373607a59f0be73a39b6fe456b8192fcc3585f602af20751600e974dd455e77" [[package]] -name = "log" -version = "0.4.29" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" - -[[package]] -name = "macro_rules_attribute" -version = "0.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "65049d7923698040cd0b1ddcced9b0eb14dd22c5f86ae59c3740eab64a676520" -dependencies = [ - "macro_rules_attribute-proc_macro", - "paste", -] - -[[package]] -name = "macro_rules_attribute-proc_macro" -version = "0.2.2" +name = "litrs" +version = "1.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "670fdfda89751bc4a84ac13eaa63e205cf0fd22b4c9a5fbfa085b63c1f1d3a30" +checksum = "11d3d7f243d5c5a8b9bb5d6dd2b1602c0cb0b9db1621bafc7ed66e35ff9fe092" [[package]] -name = "matrixmultiply" -version = "0.3.10" +name = "log" +version = "0.4.29" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a06de3016e9fae57a36fd14dba131fccf49f74b40b7fbdb472f96e361ec71a08" -dependencies = [ - "autocfg", - "rawpointer", -] +checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" [[package]] name = "memchr" @@ -1232,18 +1061,6 @@ version = "2.7.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f52b00d39961fc5b2736ea853c9cc86238e165017a493d1d5c8eac6bdc4cc273" -[[package]] -name = "mime" -version = "0.3.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a" - -[[package]] -name = "minimal-lexical" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" - [[package]] name = "miniz_oxide" version = "0.8.9" @@ -1254,39 +1071,6 @@ dependencies = [ "simd-adler32", ] -[[package]] -name = "mio" -version = "1.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a69bcab0ad47271a0234d9422b131806bf3968021e5dc9328caf2d4cd58557fc" -dependencies = [ - "libc", - "wasi", - "windows-sys 0.61.2", -] - -[[package]] -name = "monostate" -version = "0.1.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3341a273f6c9d5bef1908f17b7267bbab0e95c9bf69a0d4dcf8e9e1b2c76ef67" -dependencies = [ - "monostate-impl", - "serde", - "serde_core", -] - -[[package]] -name = "monostate-impl" -version = "0.1.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e4db6d5580af57bf992f59068d4ea26fd518574ff48d7639b255a36f9de6e7e9" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "native-tls" version = "0.2.14" @@ -1304,34 +1088,6 @@ dependencies = [ "tempfile", ] -[[package]] -name = "ndarray" -version = "0.15.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "adb12d4e967ec485a5f71c6311fe28158e9d6f4bc4a447b474184d0f91a8fa32" -dependencies = [ - "matrixmultiply", - "num-complex", - "num-integer", - "num-traits", - "rawpointer", -] - -[[package]] -name = "ndarray" -version = "0.16.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "882ed72dce9365842bf196bdeedf5055305f11fc8c03dee7bb0194a6cad34841" -dependencies = [ - "matrixmultiply", - "num-complex", - "num-integer", - "num-traits", - "portable-atomic", - "portable-atomic-util", - "rawpointer", -] - [[package]] name = "never" version = "0.1.0" @@ -1339,32 +1095,10 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c96aba5aa877601bb3f6dd6a63a969e1f82e60646e81e71b14496995e9853c91" [[package]] -name = "nom" -version = "7.1.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d273983c5a657a70a3e8f2a01329822f3b8c8172b73826411a55751e404a0a4a" -dependencies = [ - "memchr", - "minimal-lexical", -] - -[[package]] -name = "num-complex" -version = "0.4.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495" -dependencies = [ - "num-traits", -] - -[[package]] -name = "num-integer" -version = "0.1.46" +name = "num-conv" +version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7969661fd2958a5cb096e56c8e1ad0444ac2bbcd0061bd28660485a44879858f" -dependencies = [ - "num-traits", -] +checksum = "521739c6d2bac4aa25192232afe6841231376b2b26d4d9fae5ecf8ca5772e441" [[package]] name = "num-traits" @@ -1375,12 +1109,6 @@ dependencies = [ "autocfg", ] -[[package]] -name = "number_prefix" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "830b246a0e5f20af87141b25c173cd1b609bd7779a4617d6ec582abaf90870f3" - [[package]] name = "once_cell" version = "1.21.3" @@ -1393,35 +1121,13 @@ version = "1.70.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "384b8ab6d37215f3c5301a95a4accb5d64aa607f1fcb26a11b5303878451b4fe" -[[package]] -name = "onig" -version = "6.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "336b9c63443aceef14bea841b899035ae3abe89b7c486aaf4c5bd8aafedac3f0" -dependencies = [ - "bitflags 2.10.0", - "libc", - "once_cell", - "onig_sys", -] - -[[package]] -name = "onig_sys" -version = "69.9.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c7f86c6eef3d6df15f23bcfb6af487cbd2fed4e5581d58d5bf1f5f8b7f6727dc" -dependencies = [ - "cc", - "pkg-config", -] - [[package]] name = "openssl" version = "0.10.75" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "08838db121398ad17ab8531ce9de97b244589089e290a384c900cb9ff7434328" dependencies = [ - "bitflags 2.10.0", + "bitflags", "cfg-if", "foreign-types", "libc", @@ -1460,47 +1166,16 @@ dependencies = [ ] [[package]] -name = "ort" -version = "2.0.0-rc.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fa7e49bd669d32d7bc2a15ec540a527e7764aec722a45467814005725bcd721" -dependencies = [ - "ndarray 0.16.1", - "ort-sys", - "smallvec 2.0.0-alpha.10", - "tracing", -] - -[[package]] -name = "ort-sys" -version = "2.0.0-rc.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e2aba9f5c7c479925205799216e7e5d07cc1d4fa76ea8058c60a9a30f6a4e890" -dependencies = [ - "flate2", - "pkg-config", - "sha2", - "tar", - "ureq", -] - -[[package]] -name = "paste" -version = "1.0.15" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "57c0d7b74b563b49d38dae00a0c37d4d6de9b432382b2892f0574ddcae73fd0a" - -[[package]] -name = "pathdiff" -version = "0.2.3" +name = "pathdiff" +version = "0.2.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "df94ce210e5bc13cb6651479fa48d14f601d9858cfe0467f43ae157023b938d3" [[package]] name = "pem-rfc7468" -version = "0.7.0" +version = "1.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "88b39c9bfcfc231068454382784bb460aae594343fb030d46e9f50a645418412" +checksum = "a6305423e0e7738146434843d1694d621cce767262b2a86910beab705e4493d9" dependencies = [ "base64ct", ] @@ -1529,21 +1204,6 @@ version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c" -[[package]] -name = "portable-atomic" -version = "1.11.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f84267b20a16ea918e43c6a88433c2d54fa145c92a811b5b047ccbe153674483" - -[[package]] -name = "portable-atomic-util" -version = "0.2.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d8a2f0d8d040d7848a709caf78912debcc3f33ee4b3cac47d73d1e1069e83507" -dependencies = [ - "portable-atomic", -] - [[package]] name = "potential_utf" version = "0.1.4" @@ -1554,13 +1214,10 @@ dependencies = [ ] [[package]] -name = "ppv-lite86" -version = "0.2.21" +name = "powerfmt" +version = "0.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9" -dependencies = [ - "zerocopy", -] +checksum = "439ee305def115ba05938db6eb1644ff94165c5ab5e9420d1c1bcedbba909391" [[package]] name = "proc-macro2" @@ -1596,42 +1253,6 @@ version = "5.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f" -[[package]] -name = "rand" -version = "0.8.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404" -dependencies = [ - "libc", - "rand_chacha", - "rand_core", -] - -[[package]] -name = "rand_chacha" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" -dependencies = [ - "ppv-lite86", - "rand_core", -] - -[[package]] -name = "rand_core" -version = "0.6.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c" -dependencies = [ - "getrandom 0.2.16", -] - -[[package]] -name = "rawpointer" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "60a357793950651c4ed0f3f52338f53b2f809f32d83a07f72909fa13e4c6c1e3" - [[package]] name = "rayon" version = "1.11.0" @@ -1642,17 +1263,6 @@ dependencies = [ "rayon-core", ] -[[package]] -name = "rayon-cond" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "059f538b55efd2309c9794130bc149c6a553db90e9d99c2030785c82f0bd7df9" -dependencies = [ - "either", - "itertools 0.11.0", - "rayon", -] - [[package]] name = "rayon-core" version = "1.13.0" @@ -1663,15 +1273,6 @@ dependencies = [ "crossbeam-utils", ] -[[package]] -name = "redox_syscall" -version = "0.5.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ed2bf2547551a7053d6fdfafda3f938979645c44812fbfcda098faae3f1a362d" -dependencies = [ - "bitflags 2.10.0", -] - [[package]] name = "regex" version = "1.12.2" @@ -1701,46 +1302,6 @@ version = "0.8.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7a2d987857b319362043e95f5353c0535c1f58eec5336fdfcf626430af7def58" -[[package]] -name = "reqwest" -version = "0.11.27" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dd67538700a17451e7cba03ac727fb961abb7607553461627b97de0b89cf4a62" -dependencies = [ - "base64 0.21.7", - "bytes", - "encoding_rs", - "futures-core", - "futures-util", - "h2", - "http 0.2.12", - "http-body", - "hyper", - "hyper-tls", - "ipnet", - "js-sys", - "log", - "mime", - "native-tls", - "once_cell", - "percent-encoding", - "pin-project-lite", - "rustls-pemfile", - "serde", - "serde_json", - "serde_urlencoded", - "sync_wrapper", - "system-configuration", - "tokio", - "tokio-native-tls", - "tower-service", - "url", - "wasm-bindgen", - "wasm-bindgen-futures", - "web-sys", - "winreg", -] - [[package]] name = "rss" version = "2.0.12" @@ -1759,27 +1320,18 @@ version = "1.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "cd15f8a2c5551a84d56efdc1cd049089e409ac19a3072d5037a17fd70719ff3e" dependencies = [ - "bitflags 2.10.0", + "bitflags", "errno", "libc", "linux-raw-sys", "windows-sys 0.61.2", ] -[[package]] -name = "rustls-pemfile" -version = "1.0.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1c74cae0a4cf6ccbbf5f359f08efdf8ee7e1dc532573bf0db71968cb56b1448c" -dependencies = [ - "base64 0.21.7", -] - [[package]] name = "rustls-pki-types" -version = "1.13.1" +version = "1.14.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "708c0f9d5f54ba0272468c1d306a52c495b31fa155e91bc25371e6df7996908c" +checksum = "30a7197ae7eb376e574fe940d068c30fe0462554a3ddbe4eca7838e049c937a9" dependencies = [ "zeroize", ] @@ -1798,9 +1350,9 @@ checksum = "28d3b2b1366ec20994f1fd18c3c594f05c5dd4bc44d8bb0c1c632c8d6829481f" [[package]] name = "schannel" -version = "0.1.28" +version = "0.1.29" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "891d81b926048e76efe18581bf793546b4c0eaf8448d72be8de2bbee5fd166e1" +checksum = "91c1b7e4904c873ef0710c1f407dde2e6287de2bebc1bbbf7d430bb7cbffd939" dependencies = [ "windows-sys 0.61.2", ] @@ -1811,7 +1363,7 @@ version = "2.11.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "897b2245f0b511c87893af39b033e5ca9cce68824c4d7e7630b5a1d339658d02" dependencies = [ - "bitflags 2.10.0", + "bitflags", "core-foundation", "core-foundation-sys", "libc", @@ -1820,9 +1372,9 @@ dependencies = [ [[package]] name = "security-framework-sys" -version = "2.15.0" +version = "2.17.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cc1f0cbffaac4852523ce30d8bd3c5cdc873501d96ff467ca09b6767bb8cd5c0" +checksum = "6ce2691df843ecc5d231c0b14ece2acc3efb62c0a398c7e1d875f3983ce020e3" dependencies = [ "core-foundation-sys", "libc", @@ -1871,27 +1423,6 @@ dependencies = [ "serde_core", ] -[[package]] -name = "serde_spanned" -version = "0.6.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bf41e0cfaf7226dca15e8197172c295a782857fcb97fad1808a166870dee75a3" -dependencies = [ - "serde", -] - -[[package]] -name = "serde_urlencoded" -version = "0.7.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d3491c14715ca2294c4d6a88f15e84739788c1d030eed8c110436aafdaa2f3fd" -dependencies = [ - "form_urlencoded", - "itoa", - "ryu", - "serde", -] - [[package]] name = "serde_yaml" version = "0.9.34+deprecated" @@ -1952,55 +1483,6 @@ version = "1.15.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03" -[[package]] -name = "smallvec" -version = "2.0.0-alpha.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "51d44cfb396c3caf6fbfd0ab422af02631b69ddd96d2eff0b0f0724f9024051b" - -[[package]] -name = "socket2" -version = "0.5.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e22376abed350d73dd1cd119b57ffccad95b4e585a7cda43e286245ce23c0678" -dependencies = [ - "libc", - "windows-sys 0.52.0", -] - -[[package]] -name = "socket2" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "17129e116933cf371d018bb80ae557e889637989d8638274fb25622827b03881" -dependencies = [ - "libc", - "windows-sys 0.60.2", -] - -[[package]] -name = "socks" -version = "0.3.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f0c3dbbd9ae980613c6dd8e28a9407b50509d3803b57624d5dfe8315218cd58b" -dependencies = [ - "byteorder", - "libc", - "winapi", -] - -[[package]] -name = "spm_precompiled" -version = "0.1.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5851699c4033c63636f7ea4cf7b7c1f1bf06d0cc03cfb42e711de5a5c46cf326" -dependencies = [ - "base64 0.13.1", - "nom", - "serde", - "unicode-segmentation", -] - [[package]] name = "stable_deref_trait" version = "1.2.1" @@ -2024,12 +1506,6 @@ dependencies = [ "unicode-ident", ] -[[package]] -name = "sync_wrapper" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2047c6ded9c721764247e62cd3b03c09ffc529b2ba5b10ec482ae507a4a70160" - [[package]] name = "synstructure" version = "0.13.2" @@ -2041,38 +1517,6 @@ dependencies = [ "syn", ] -[[package]] -name = "system-configuration" -version = "0.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba3a3adc5c275d719af8cb4272ea1c4a6d668a777f37e115f6d11ddbc1c8e0e7" -dependencies = [ - "bitflags 1.3.2", - "core-foundation", - "system-configuration-sys", -] - -[[package]] -name = "system-configuration-sys" -version = "0.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a75fb188eb626b924683e3b95e3a48e63551fcfb51949de2f06a9d91dbee93c9" -dependencies = [ - "core-foundation-sys", - "libc", -] - -[[package]] -name = "tar" -version = "0.4.44" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1d863878d212c87a19c1a610eb53bb01fe12951c0501cf5a0d65f724914a667a" -dependencies = [ - "filetime", - "libc", - "xattr", -] - [[package]] name = "tempfile" version = "3.23.0" @@ -2080,7 +1524,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2d31c77bdf42a745371d260a26ca7163f1e0924b64afa0b688e61b5a9fa02f16" dependencies = [ "fastrand", - "getrandom 0.3.4", + "getrandom", "once_cell", "rustix", "windows-sys 0.61.2", @@ -2107,45 +1551,44 @@ dependencies = [ ] [[package]] -name = "tinystr" -version = "0.8.2" +name = "time" +version = "0.3.47" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "42d3e9c45c09de15d06dd8acf5f4e0e399e85927b7f00711024eb7ae10fa4869" +checksum = "743bd48c283afc0388f9b8827b976905fb217ad9e647fae3a379a9283c4def2c" dependencies = [ - "displaydoc", - "zerovec", + "deranged", + "itoa", + "num-conv", + "powerfmt", + "serde_core", + "time-core", + "time-macros", ] [[package]] -name = "tokenizers" -version = "0.19.1" +name = "time-core" +version = "0.1.8" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e500fad1dd3af3d626327e6a3fe5050e664a6eaa4708b8ca92f1794aaf73e6fd" +checksum = "7694e1cfe791f8d31026952abf09c69ca6f6fa4e1a1229e18988f06a04a12dca" + +[[package]] +name = "time-macros" +version = "0.2.27" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2e70e4c5a0e0a8a4823ad65dfe1a6930e4f4d756dcd9dd7939022b5e8c501215" dependencies = [ - "aho-corasick", - "derive_builder", - "esaxx-rs", - "getrandom 0.2.16", - "indicatif", - "itertools 0.12.1", - "lazy_static", - "log", - "macro_rules_attribute", - "monostate", - "onig", - "paste", - "rand", - "rayon", - "rayon-cond", - "regex", - "regex-syntax", - "serde", - "serde_json", - "spm_precompiled", - "thiserror", - "unicode-normalization-alignments", - "unicode-segmentation", - "unicode_categories", + "num-conv", + "time-core", +] + +[[package]] +name = "tinystr" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "42d3e9c45c09de15d06dd8acf5f4e0e399e85927b7f00711024eb7ae10fa4869" +dependencies = [ + "displaydoc", + "zerovec", ] [[package]] @@ -2154,13 +1597,8 @@ version = "1.48.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ff360e02eab121e0bc37a2d3b4d4dc622e6eda3a8e5253d5435ecf5bd4c68408" dependencies = [ - "bytes", - "libc", - "mio", "pin-project-lite", - "socket2 0.6.1", "tokio-macros", - "windows-sys 0.61.2", ] [[package]] @@ -2174,16 +1612,6 @@ dependencies = [ "syn", ] -[[package]] -name = "tokio-native-tls" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bbae76ab933c85776efabc971569dd6119c580d8f5d448769dec1764bf796ef2" -dependencies = [ - "native-tls", - "tokio", -] - [[package]] name = "tokio-stream" version = "0.1.17" @@ -2208,91 +1636,6 @@ dependencies = [ "tokio-stream", ] -[[package]] -name = "tokio-util" -version = "0.7.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2efa149fe76073d6e8fd97ef4f4eca7b67f599660115591483572e406e165594" -dependencies = [ - "bytes", - "futures-core", - "futures-sink", - "pin-project-lite", - "tokio", -] - -[[package]] -name = "toml" -version = "0.8.23" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dc1beb996b9d83529a9e75c17a1686767d148d70663143c7854d8b4a09ced362" -dependencies = [ - "serde", - "serde_spanned", - "toml_datetime", - "toml_edit", -] - -[[package]] -name = "toml_datetime" -version = "0.6.11" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22cddaf88f4fbc13c51aebbf5f8eceb5c7c5a9da2ac40a13519eb5b0a0e8f11c" -dependencies = [ - "serde", -] - -[[package]] -name = "toml_edit" -version = "0.22.27" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "41fe8c660ae4257887cf66394862d21dbca4a6ddd26f04a3560410406a2f819a" -dependencies = [ - "indexmap", - "serde", - "serde_spanned", - "toml_datetime", - "toml_write", - "winnow", -] - -[[package]] -name = "toml_write" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5d99f8c9a7727884afe522e9bd5edbfc91a3312b36a77b5fb8926e4c31a41801" - -[[package]] -name = "tower-service" -version = "0.3.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8df9b6e13f2d32c91b9bd719c00d1958837bc7dec474d94952798cc8e69eeec3" - -[[package]] -name = "tracing" -version = "0.1.43" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2d15d90a0b5c19378952d479dc858407149d7bb45a14de0142f6c534b16fc647" -dependencies = [ - "pin-project-lite", - "tracing-core", -] - -[[package]] -name = "tracing-core" -version = "0.1.35" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a04e24fab5c89c6a36eb8558c9656f30d81de51dfa4d3b45f26b21d61fa0a6c" -dependencies = [ - "once_cell", -] - -[[package]] -name = "try-lock" -version = "0.2.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e421abadd41a4225275504ea4d6566923418b7f05506fbc9c0fe86ba7396114b" - [[package]] name = "typenum" version = "1.19.0" @@ -2305,33 +1648,12 @@ version = "1.0.22" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9312f7c4f6ff9069b165498234ce8be658059c6728633667c526e27dc2cf1df5" -[[package]] -name = "unicode-normalization-alignments" -version = "0.1.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43f613e4fa046e69818dd287fdc4bc78175ff20331479dab6e1b0f98d57062de" -dependencies = [ - "smallvec 1.15.1", -] - -[[package]] -name = "unicode-segmentation" -version = "1.12.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f6ccf251212114b54433ec949fd6a7841275f9ada20dddd2f29e9ceea4501493" - [[package]] name = "unicode-width" version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254" -[[package]] -name = "unicode_categories" -version = "0.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "39ec24b3121d976906ece63c9daad25b85969647682eee313cb5779fdd69e14e" - [[package]] name = "unsafe-libyaml" version = "0.2.11" @@ -2340,30 +1662,33 @@ checksum = "673aac59facbab8a9007c7f6108d11f63b603f7cabff99fabf650fea5c32b861" [[package]] name = "ureq" -version = "3.1.4" +version = "3.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d39cb1dbab692d82a977c0392ffac19e188bd9186a9f32806f0aaa859d75585a" +checksum = "dea7109cdcd5864d4eeb1b58a1648dc9bf520360d7af16ec26d0a9354bafcfc0" dependencies = [ - "base64 0.22.1", + "base64", + "cookie_store", "der", + "flate2", "log", "native-tls", "percent-encoding", "rustls-pki-types", - "socks", + "serde", + "serde_json", "ureq-proto", - "utf-8", + "utf8-zero", "webpki-root-certs", ] [[package]] name = "ureq-proto" -version = "0.5.3" +version = "0.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d81f9efa9df032be5934a46a068815a10a042b494b6a58cb0a1a97bb5467ed6f" +checksum = "e994ba84b0bd1b1b0cf92878b7ef898a5c1760108fe7b6010327e274917a808c" dependencies = [ - "base64 0.22.1", - "http 1.4.0", + "base64", + "http", "httparse", "log", ] @@ -2381,10 +1706,10 @@ dependencies = [ ] [[package]] -name = "utf-8" -version = "0.7.6" +name = "utf8-zero" +version = "0.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "09cc8ee72d2a9becf2f2febe0205bbed8fc6615b7cb429ad062dc7b7ddd036a9" +checksum = "b8c0a043c9540bae7c578c88f91dda8bd82e59ae27c21baca69c8b191aaf5a6e" [[package]] name = "utf8_iter" @@ -2410,21 +1735,6 @@ version = "0.9.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" -[[package]] -name = "want" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bfa7760aed19e106de2c7c0b581b509f2f25d3dacaf737cb82ac61bc6d760b0e" -dependencies = [ - "try-lock", -] - -[[package]] -name = "wasi" -version = "0.11.1+wasi-snapshot-preview1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b" - [[package]] name = "wasip2" version = "1.0.1+wasi-0.2.4" @@ -2447,19 +1757,6 @@ dependencies = [ "wasm-bindgen-shared", ] -[[package]] -name = "wasm-bindgen-futures" -version = "0.4.56" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "836d9622d604feee9e5de25ac10e3ea5f2d65b41eac0d9ce72eb5deae707ce7c" -dependencies = [ - "cfg-if", - "js-sys", - "once_cell", - "wasm-bindgen", - "web-sys", -] - [[package]] name = "wasm-bindgen-macro" version = "0.2.106" @@ -2492,57 +1789,15 @@ dependencies = [ "unicode-ident", ] -[[package]] -name = "web-sys" -version = "0.3.83" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9b32828d774c412041098d182a8b38b16ea816958e07cf40eec2bc080ae137ac" -dependencies = [ - "js-sys", - "wasm-bindgen", -] - -[[package]] -name = "web-time" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb" -dependencies = [ - "js-sys", - "wasm-bindgen", -] - [[package]] name = "webpki-root-certs" -version = "1.0.4" +version = "1.0.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ee3e3b5f5e80bc89f30ce8d0343bf4e5f12341c51f3e26cbeecbc7c85443e85b" +checksum = "f31141ce3fc3e300ae89b78c0dd67f9708061d1d2eda54b8209346fd6be9a92c" dependencies = [ "rustls-pki-types", ] -[[package]] -name = "winapi" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" -dependencies = [ - "winapi-i686-pc-windows-gnu", - "winapi-x86_64-pc-windows-gnu", -] - -[[package]] -name = "winapi-i686-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" - -[[package]] -name = "winapi-x86_64-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" - [[package]] name = "windows-core" version = "0.62.2" @@ -2602,40 +1857,13 @@ dependencies = [ "windows-link", ] -[[package]] -name = "windows-sys" -version = "0.48.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" -dependencies = [ - "windows-targets 0.48.5", -] - -[[package]] -name = "windows-sys" -version = "0.52.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d" -dependencies = [ - "windows-targets 0.52.6", -] - [[package]] name = "windows-sys" version = "0.59.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" dependencies = [ - "windows-targets 0.52.6", -] - -[[package]] -name = "windows-sys" -version = "0.60.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb" -dependencies = [ - "windows-targets 0.53.5", + "windows-targets", ] [[package]] @@ -2647,211 +1875,70 @@ dependencies = [ "windows-link", ] -[[package]] -name = "windows-targets" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c" -dependencies = [ - "windows_aarch64_gnullvm 0.48.5", - "windows_aarch64_msvc 0.48.5", - "windows_i686_gnu 0.48.5", - "windows_i686_msvc 0.48.5", - "windows_x86_64_gnu 0.48.5", - "windows_x86_64_gnullvm 0.48.5", - "windows_x86_64_msvc 0.48.5", -] - [[package]] name = "windows-targets" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973" dependencies = [ - "windows_aarch64_gnullvm 0.52.6", - "windows_aarch64_msvc 0.52.6", - "windows_i686_gnu 0.52.6", - "windows_i686_gnullvm 0.52.6", - "windows_i686_msvc 0.52.6", - "windows_x86_64_gnu 0.52.6", - "windows_x86_64_gnullvm 0.52.6", - "windows_x86_64_msvc 0.52.6", -] - -[[package]] -name = "windows-targets" -version = "0.53.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4945f9f551b88e0d65f3db0bc25c33b8acea4d9e41163edf90dcd0b19f9069f3" -dependencies = [ - "windows-link", - "windows_aarch64_gnullvm 0.53.1", - "windows_aarch64_msvc 0.53.1", - "windows_i686_gnu 0.53.1", - "windows_i686_gnullvm 0.53.1", - "windows_i686_msvc 0.53.1", - "windows_x86_64_gnu 0.53.1", - "windows_x86_64_gnullvm 0.53.1", - "windows_x86_64_msvc 0.53.1", + "windows_aarch64_gnullvm", + "windows_aarch64_msvc", + "windows_i686_gnu", + "windows_i686_gnullvm", + "windows_i686_msvc", + "windows_x86_64_gnu", + "windows_x86_64_gnullvm", + "windows_x86_64_msvc", ] -[[package]] -name = "windows_aarch64_gnullvm" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2b38e32f0abccf9987a4e3079dfb67dcd799fb61361e53e2882c3cbaf0d905d8" - [[package]] name = "windows_aarch64_gnullvm" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3" -[[package]] -name = "windows_aarch64_gnullvm" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a9d8416fa8b42f5c947f8482c43e7d89e73a173cead56d044f6a56104a6d1b53" - -[[package]] -name = "windows_aarch64_msvc" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dc35310971f3b2dbbf3f0690a219f40e2d9afcf64f9ab7cc1be722937c26b4bc" - [[package]] name = "windows_aarch64_msvc" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469" -[[package]] -name = "windows_aarch64_msvc" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b9d782e804c2f632e395708e99a94275910eb9100b2114651e04744e9b125006" - -[[package]] -name = "windows_i686_gnu" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a75915e7def60c94dcef72200b9a8e58e5091744960da64ec734a6c6e9b3743e" - [[package]] name = "windows_i686_gnu" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b" -[[package]] -name = "windows_i686_gnu" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "960e6da069d81e09becb0ca57a65220ddff016ff2d6af6a223cf372a506593a3" - [[package]] name = "windows_i686_gnullvm" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66" -[[package]] -name = "windows_i686_gnullvm" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fa7359d10048f68ab8b09fa71c3daccfb0e9b559aed648a8f95469c27057180c" - -[[package]] -name = "windows_i686_msvc" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8f55c233f70c4b27f66c523580f78f1004e8b5a8b659e05a4eb49d4166cca406" - [[package]] name = "windows_i686_msvc" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66" -[[package]] -name = "windows_i686_msvc" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1e7ac75179f18232fe9c285163565a57ef8d3c89254a30685b57d83a38d326c2" - -[[package]] -name = "windows_x86_64_gnu" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "53d40abd2583d23e4718fddf1ebec84dbff8381c07cae67ff7768bbf19c6718e" - [[package]] name = "windows_x86_64_gnu" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78" -[[package]] -name = "windows_x86_64_gnu" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9c3842cdd74a865a8066ab39c8a7a473c0778a3f29370b5fd6b4b9aa7df4a499" - -[[package]] -name = "windows_x86_64_gnullvm" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0b7b52767868a23d5bab768e390dc5f5c55825b6d30b86c844ff2dc7414044cc" - [[package]] name = "windows_x86_64_gnullvm" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d" -[[package]] -name = "windows_x86_64_gnullvm" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0ffa179e2d07eee8ad8f57493436566c7cc30ac536a3379fdf008f47f6bb7ae1" - -[[package]] -name = "windows_x86_64_msvc" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ed94fce61571a4006852b7389a063ab983c02eb1bb37b47f8272ce92d06d9538" - [[package]] name = "windows_x86_64_msvc" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec" -[[package]] -name = "windows_x86_64_msvc" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d6bbff5f0aada427a1e5a6da5f1f98158182f26556f345ac9e04d36d0ebed650" - -[[package]] -name = "winnow" -version = "0.7.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5a5364e9d77fcdeeaa6062ced926ee3381faa2ee02d3eb83a5c27a8825540829" -dependencies = [ - "memchr", -] - -[[package]] -name = "winreg" -version = "0.50.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "524e57b2c537c0f9b1e69f1965311ec12182b4122e45035b1508cd24d2adadb1" -dependencies = [ - "cfg-if", - "windows-sys 0.48.0", -] - [[package]] name = "wit-bindgen" version = "0.46.0" @@ -2864,16 +1951,6 @@ version = "0.6.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9edde0db4769d2dc68579893f2306b26c6ecfbe0ef499b013d731b7b9247e0b9" -[[package]] -name = "xattr" -version = "1.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "32e45ad4206f6d2479085147f02bc2ef834ac85886624a23575ae137c8aa8156" -dependencies = [ - "libc", - "rustix", -] - [[package]] name = "yoke" version = "0.8.1" @@ -2897,26 +1974,6 @@ dependencies = [ "synstructure", ] -[[package]] -name = "zerocopy" -version = "0.8.31" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd74ec98b9250adb3ca554bdde269adf631549f51d8a8f8f0a10b50f1cb298c3" -dependencies = [ - "zerocopy-derive", -] - -[[package]] -name = "zerocopy-derive" -version = "0.8.31" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d8a8d209fdf45cf5138cbb5a506f6b52522a25afccc534d1475dad8e31105c6a" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "zerofrom" version = "0.1.6" diff --git a/actions/govbot/Cargo.toml b/actions/govbot/Cargo.toml index 2c2720aa..6eb86717 100644 --- a/actions/govbot/Cargo.toml +++ b/actions/govbot/Cargo.toml @@ -40,29 +40,23 @@ pathdiff = "0.2" # Git operations git2 = { version = "0.18" } -# Text similarity and embeddings (lightweight, no external models) -# Using ONNX Runtime + tokenizers for semantic embeddings -ort = { version = "2.0.0-rc.10", default-features = true, features = ["ndarray"] } -tokenizers = "0.19" -ndarray = "0.15" -toml = "0.8" -# HTTP client for downloading models -reqwest = { version = "0.11", features = ["blocking"] } -# Hashing for text deduplication +# Hashing for text deduplication (.tag.json text hashes) sha2 = "0.10" # Timestamps chrono = { version = "0.4", features = ["serde"] } # RSS feed generation rss = "2.0" +# HTTP client for the Bluesky publisher's AT Protocol XRPC calls. +# `ureq` is a small, synchronous, blocking HTTP client — it suits a CLI that +# posts from cron/CI (no async runtime needed on the publish path) and keeps +# the dependency tree light. `native-tls` uses the platform TLS stack +# (Secure Transport / SChannel / OpenSSL) — no extra vendored crypto crate. +ureq = { version = "3", default-features = false, features = ["json", "native-tls", "gzip"] } [[bin]] name = "govbot" path = "src/main.rs" -[[bin]] -name = "generate-locale-enum" -path = "src/bin/generate-locale-enum.rs" - [dev-dependencies] tokio-test = "0.4" insta = { version = "1.39", features = ["json"] } diff --git a/actions/govbot/DUCKDB.md b/actions/govbot/DUCKDB.md index 0c63c4c8..42fe67f8 100644 --- a/actions/govbot/DUCKDB.md +++ b/actions/govbot/DUCKDB.md @@ -22,8 +22,8 @@ duckdb --version ## Quick Start ```bash -# 1. Clone repositories first -govbot clone all +# 1. Pull datasets first +govbot pull all # 2. Load into DuckDB govbot load diff --git a/actions/govbot/README.md b/actions/govbot/README.md index 17a7ec85..ec6f5d2d 100644 --- a/actions/govbot/README.md +++ b/actions/govbot/README.md @@ -18,11 +18,12 @@ govbot That's it. If no `govbot.yml` exists, an interactive wizard walks you through setup: -1. **Sources** - Choose all 47 states or pick specific ones -2. **Tags** - Start with an example tag, or get an AI prompt to create your own -3. **Publishing** - RSS feeds configured automatically +1. **Datasets** - Choose all 47 states or pick specific ones +2. **Classification** - Point the manifest at a fastclass classifier bundle +3. **Publishing** - An RSS feed publisher configured automatically -The wizard creates `govbot.yml`, `.gitignore`, and a GitHub Actions workflow. +The wizard creates `govbot.yml` (a project manifest: `datasets` / `transforms` / +`publish` / `pipelines`), `.gitignore`, and a GitHub Actions workflow. ### 3. Run the pipeline @@ -32,18 +33,19 @@ govbot With `govbot.yml` present, running `govbot` executes the full pipeline: -1. Clones/updates legislation repositories (smart: only clones on first run, pulls after) -2. Tags bills based on your tag definitions -3. Generates RSS feeds in the `docs/` directory +1. Pulls/updates legislation datasets (smart: only clones on first run, pulls after) +2. Classifies bills with fastclass and applies the results into the dataset +3. Runs the manifest's publishers (RSS feeds into the `docs/` directory) ### Other Commands ```bash -govbot clone all # download all state legislation datasets -govbot clone il ca ny # download specific states -govbot logs # stream legislative activity as JSON Lines -govbot logs | govbot tag # process and tag data -govbot build # generate RSS feeds +govbot pull all # download all state legislation datasets +govbot pull il ca ny # download specific states +govbot source # stream legislative activity as JSON Lines +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +govbot publish # run the manifest's publishers (RSS / HTML / JSON / DuckDB) +govbot run # run the full pipeline govbot load # load bill metadata into DuckDB govbot delete all # remove all downloaded data govbot update # update govbot to latest version @@ -74,22 +76,30 @@ We build snapshots off `examples`. Add examples to make a test. ## Advanced +Datasets are resolved at runtime through the **dataset registry** (see +[`REGISTRY.md`](./REGISTRY.md)). To point govbot at a custom registry: + ```bash -GOVBOT_REPO_URL_TEMPLATE="https://gitsite.com/org/{locale}.git" govbot ... +# An http(s):// URL or a local file path. +GOVBOT_REGISTRY_URL="https://example.com/registry.json" govbot pull all ``` -## Working with Logs +A project-local `.govbot/registry.json` is also honored. `govbot search` +queries the registry; `govbot pull` clones datasets once into the shared +`~/.govbot/cache/` and pins resolved commits in `govbot.lock`. + +## Working with the Record Stream -The `govbot logs` command outputs JSON Lines (JSONL) format, making it easy to pipe to tools like `jq`, `yq`, and `jl` for filtering, transformation, and pretty-printing, and even sending to AI CLI tools like `claude`. +The `govbot source` command outputs JSON Lines (JSONL) format, making it easy to pipe to tools like `jq`, `yq`, and `jl` for filtering, transformation, and pretty-printing, and even sending to AI CLI tools like `claude`. ### Basic Usage ```bash # Easiest way with smart defaults -govbot logs +govbot source # Get more args and their help -govbot logs --help +govbot source --help ``` ### modular CLI Examples @@ -100,10 +110,10 @@ Convert JSON Lines to prettified YAML: ```bash # Output prettified yaml -just govbot logs | yq -p=json -o=yaml '.' +just govbot source | yq -p=json -o=yaml '.' # Multiple documents (separated by ---) -govbot logs --repos="il" --limit=10 --filter=default | yq -p json -P +govbot source --repos="il" --limit=10 --filter=default | yq -p json -P ``` #### Filtering with `jq` @@ -112,16 +122,16 @@ Filter and transform JSON Lines: ```bash # Filter by specific fields -govbot logs| jq 'select(.log.action.classification[] == "passage")' +govbot source| jq 'select(.log.action.classification[] == "passage")' # Extract specific fields -govbot logs | jq '{bill_id: .log.bill_id, date: .log.action.date, description: .log.action.description}' +govbot source | jq '{bill_id: .log.bill_id, date: .log.action.date, description: .log.action.description}' # Count by bill -govbot logs | jq -s 'group_by(.log.bill_id) | map({bill_id: .[0].log.bill_id, count: length})' +govbot source | jq -s 'group_by(.log.bill_id) | map({bill_id: .[0].log.bill_id, count: length})' # Filter by date range -govbot logs | jq 'select(.timestamp >= "20250301" and .timestamp <= "20250331")' +govbot source | jq 'select(.timestamp >= "20250301" and .timestamp <= "20250331")' ``` #### Using `jl` (JSON Lines processor) @@ -130,10 +140,10 @@ govbot logs | jq 'select(.timestamp >= "20250301" and .timestamp <= "20250331")' ```bash # Pretty print JSON Lines -govbot logs | jl +govbot source | jl # Filter with jl -govbot logs | jl 'select(.log.action.classification[] == "passage")' +govbot source | jl 'select(.log.action.classification[] == "passage")' ``` ### Combining Tools @@ -142,17 +152,17 @@ Chain multiple tools for powerful data processing: ```bash # Filter with jq, then convert to YAML -govbot logs --repos="il" --limit=100 | \ +govbot source --repos="il" --limit=100 | \ jq 'select(.log.action.classification[] == "passage")' | \ yq -p json -P # Extract and format specific fields, then output as YAML -govbot logs --repos="il" --limit=10 | \ +govbot source --repos="il" --limit=10 | \ jq '{bill: .log.bill_id, action: .log.action.description, date: .log.action.date}' | \ yq -p json -P # Aggregate data with jq, then format as YAML array -govbot logs --repos="il" --limit=100 | \ +govbot source --repos="il" --limit=100 | \ jq -s 'group_by(.log.bill_id) | map({bill_id: .[0].log.bill_id, actions: length})' | \ yq -P ``` @@ -161,16 +171,16 @@ govbot logs --repos="il" --limit=100 | \ ```bash # Find all bills with multiple actions in a single day -govbot logs --repos="il" --limit=1000 | \ +govbot source --repos="il" --limit=1000 | \ jq -s 'group_by(.log.bill_id + .timestamp) | map(select(length > 1)) | flatten' # Extract action classifications and count them -govbot logs --repos="il" --limit=1000 | \ +govbot source --repos="il" --limit=1000 | \ jq -r '.log.action.classification[]?' | \ sort | uniq -c | sort -rn # Join with bill metadata and filter by title -govbot logs --repos="il" --limit=10 --join=bill | \ +govbot source --repos="il" --limit=10 --join=bill | \ jq 'select(.bill.title | contains("Education"))' | \ yq -p json -P ``` @@ -181,61 +191,69 @@ Generate RSS feeds using the `govbot publish` command, which reads from `govbot. **Note:** The Python scripts have been replaced by a Rust implementation. Use `govbot publish` instead. -## Publishing RSS Feeds +## Publishing -Generate RSS feeds for each tag defined in `govbot.yml` using the declarative publishing system. +Publishers consume the classified result stream and emit artifacts. RSS, HTML, +JSON, and DuckDB are built-in publishers, declared in the manifest's `publish:` +map. ### Quick Start -1. **Configure `govbot.yml`** with your tags and publish settings: +1. **Configure `govbot.yml`** with your datasets, transforms, and publishers. + The tag taxonomy is NOT in `govbot.yml` — it lives in a separate fastclass + classifier bundle that `transforms.classify.classifier` references by path: ```yaml - repos: + datasets: - all - tags: - lgbtq: - description: "Legislation related to LGBTQ+ issues..." + transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier publish: - base_url: "https://yourusername.github.io/repo-name" - output_dir: "feeds" + lgbtq-feed: + type: rss + select: [lgbtq] # tag names from the classifier bundle + base_url: "https://yourusername.github.io/repo-name" + output_dir: "feeds" + pipelines: + default: + - classify + - lgbtq-feed ``` -2. **Generate RSS feed:** +2. **Run all publishers:** ```bash govbot publish ``` -3. **Generate feed for specific tags:** +3. **Run a specific publisher:** ```bash - govbot publish --tags lgbtq education + govbot publish --publisher lgbtq-feed ``` 4. **Customize output:** + ```bash govbot publish --output-dir ./feeds --limit 100 ``` -### Configuration - -The `publish:` section in `govbot.yml` supports: +### Publisher configuration -- `base_url`: Base URL for RSS feed links (required for GitHub Pages) -- `output_dir`: Directory where RSS feeds are generated (default: `feeds`) -- `limit`: Maximum entries per feed (optional) +Each entry in `publish:` declares a `type` (`rss` / `html` / `json` / `duckdb`) +plus type-specific keys: -### Per-Tag Customization - -Tags can override default RSS feed settings: - -```yaml -tags: - lgbtq: - description: "..." - rss_title: "LGBTQ+ Legislation Updates" # Optional - rss_description: "Custom description" # Optional -``` +- `select`: tag names to include — only records carrying one of these tags are + published. Tag names must exist in the classifier bundle. +- `base_url`: base URL for generated links (required for `rss`/`html`). +- `output_dir`: directory the publisher writes into (default: `docs`). +- `output_file`: the primary artifact filename. +- `title` / `description`: custom feed/index metadata. +- `limit`: maximum entries (`"none"` for unlimited). ## Using DuckDB diff --git a/actions/govbot/REGISTRY.md b/actions/govbot/REGISTRY.md new file mode 100644 index 00000000..2a35f7fa --- /dev/null +++ b/actions/govbot/REGISTRY.md @@ -0,0 +1,94 @@ +# The govbot dataset registry + +govbot resolves datasets at **runtime** through a registry — an index that +maps a dataset identifier to the git repo holding its data. This is the +"npm/docker for government data" layer: it replaces the old compiled +52-variant `WorkingLocale` enum, so adding counties, cities, or agencies is a +data change, not a recompile. + +## Identifier scheme + +A canonical identifier is `namespace/name[@channel]`: + +| Part | Meaning | +|---|---| +| `namespace` | a grouping — `us-legislation`, a county set, an agency set | +| `name` | the dataset within the namespace — `wy`, `il`, … | +| `@channel` | optional release channel / git branch (defaults to the repo's default branch) | + +**Plain jurisdiction codes stay valid.** A bare identifier with no `/` (e.g. +`wy`) is resolved against the registry's `default_namespace`, so an existing +`govbot.yml` with `datasets: [wy]` keeps working unchanged. `all` is a +reserved alias meaning "every dataset in the registry." + +Examples — all valid in `govbot.yml` / `govbot add` / `govbot pull`: + +``` +wy # bare code -> us-legislation/wy +us-legislation/wy # canonical +us-legislation/wy@main # pinned to a channel/branch +all # every dataset +``` + +## File format + +The registry is a JSON file. The bundled default lives at +`actions/govbot/data/registry.json` and is **compiled into the binary** via +`include_str!`, so a fresh install resolves the seed jurisdictions with zero +network access. + +```json +{ + "$schema_version": "govbot-registry-1", + "description": "…", + "default_namespace": "us-legislation", + "datasets": { + "us-legislation/wy": { + "git_url": "https://github.com/chn-openstates-files/wy-legislation.git", + "schema": "ocdfiles", + "path_pattern": "**/logs/*.json", + "name": "Wyoming" + } + } +} +``` + +Per-dataset fields: + +| Field | Required | Meaning | +|---|---|---| +| `git_url` | yes | the git repo the dataset's data is cloned from | +| `schema` | no | the data schema the dataset follows (e.g. `ocdfiles`) | +| `path_pattern` | no | a glob, relative to the repo root, locating the dataset's records | +| `name` | no | a human-readable display name | + +## Where the registry comes from / how it is fetched + +`Registry::load` resolves the active registry in priority order: + +1. **`GOVBOT_REGISTRY_URL`** — an `http(s)://` URL (fetched over HTTP) or a + local file path. A fetched registry is cached at `~/.govbot/registry.json`. +2. **`/.govbot/registry.json`** — a project-local registry file. +3. **The bundled default** compiled into the binary. + +This makes the registry both a shipped default and a fetchable/overridable +catalog — an open, PR-based registry repo or a hosted catalog can both be +pointed at via `GOVBOT_REGISTRY_URL`. + +## `govbot.lock` — the dataset lockfile + +`govbot.yml` declares *which* datasets a project wants; `govbot.lock` records +the *exact git commit* each resolved to. It is govbot's `Cargo.lock`. + +- **Written/updated** by `govbot pull` and `govbot run`, next to `govbot.yml`. +- **Format** — JSON; see `src/lock.rs`. Each entry pins `git_url`, `channel`, + `commit`, `cache_key`, and `resolved_at`. +- **Commit it** to the project repo for reproducible runs. + +## The shared content-addressed cache + +A dataset is cloned **once per machine** into `~/.govbot/cache/`, where +`` is `-`. A project's +`.govbot/repos/` is a symlink into that cache. A second `pull` — in this +or any other project — finds the cache populated and only fetches deltas. See +`src/cache.rs`. diff --git a/actions/govbot/TAGGING.md b/actions/govbot/TAGGING.md index 604cec8f..3c9dadbe 100644 --- a/actions/govbot/TAGGING.md +++ b/actions/govbot/TAGGING.md @@ -1,93 +1,65 @@ -# Tagging Bills with Semantic Similarity +# Classifying and tagging bills -The `govbot tag` command can automatically tag legislative logs using semantic similarity matching. +govbot does **not** classify bills itself. Classification is delegated to +**fastclass**, a standalone, self-improving text classifier that runs as an +external transform. govbot streams documents in, fastclass classifies them, and +`govbot apply` persists the results. -### How tagging works +## The pipe -- **Primary mode (embeddings)**: Uses a sentence-transformer model (`model.onnx` + `tokenizer.json`) to embed logs and tags, combining: - - **Base similarity** between the log text and each tag’s description/examples - - **Example similarity** to individual positive examples - - **Keyword boosts** from `include_keywords` / `exclude_keywords` - - **Negative examples** penalties via `negative_examples` -- **Fallback mode (keywords only)**: If the embedding model or tokenizer cannot be loaded, govbot falls back to **keyword-based tagging** using `include_keywords` / `exclude_keywords` from the tag definitions. - -In both modes, each tag has a **`threshold`** and a structured **score breakdown** is stored in per-tag `.tag.json` files. - -## Quick Start - -1. **Place required files in your working directory:** - - - `govbot.yml` – Tag definitions (see below) - - `model.onnx` – ONNX sentence transformer model (e.g., all-MiniLM-L6-v2) - - `tokenizer.json` – Tokenizer file for the model - -2. **Run the command:** - - ```bash - just govbot logs --repos il --limit 10 | just govbot tag - ``` - -govbot will: - -- Require `govbot.yml` -- Try to use **embedding mode** (`model.onnx` + `tokenizer.json`) -- If embeddings are unavailable or fail to initialize, automatically **fall back to keyword-based matching** (using `include_keywords` / `exclude_keywords`). +```bash +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` -## Tag Configuration (`govbot.yml`) +- **`govbot source --select docs`** emits one `{"id","text","kind":"docs"}` + document per bill carrying the **full bill text** from `metadata.json`. The + `id` is the bill's dataset path, which routes the result back to the right + place. +- **`fastclass classify -`** scores each document against a **classifier + bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` + + `eval/`). govbot passes only the bundle path; it never reads the bundle. +- **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag + `.tag.json` files into the dataset. It classifies nothing — it is purely the + persistence sink. `govbot publish` later turns those files into feeds. -Each tag defines (YAML schema): +`govbot run` (or bare `govbot`) orchestrates this whole pipe automatically from +the manifest's `transforms:`/`pipelines:`. -- `name`: Tag identifier (key name in `tags:` map) -- `description`: Semantic description of what the tag represents -- `threshold`: Minimum similarity score (0.0–1.0) to match -- `examples`: Optional positive example phrases (improves embeddings) -- `include_keywords`: Phrases whose presence should strongly favor this tag -- `exclude_keywords`: Phrases that should block this tag -- `negative_examples`: Texts that should **not** match this tag (used as embedding negatives) +## The manifest declares the transform — not the taxonomy -Example: +`govbot.yml` is a project **manifest**. It has **no `tags:` block**. The +classify transform is declared under `transforms:` and points at a fastclass +classifier bundle by path: ```yaml -tags: - education: - description: > - Legislation related to schools, education funding, curriculum standards, - teacher certification, higher education policy, student loans, charter schools - threshold: 0.6 - examples: - - School funding bill - - Teacher certification requirements - include_keywords: - - education - - school funding - - curriculum - exclude_keywords: - - driver education - negative_examples: - - Resolution honoring local high school sports teams +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier # path to the fastclass bundle (classifier.yml) ``` -## Getting the Model Files - -To use embedding mode, you need: +The tag taxonomy — descriptions, examples, keywords, thresholds, fusion +weights — lives entirely inside the fastclass classifier bundle's +`classifier.yml`, owned and versioned separately. See the fastclass docs and +its Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`) for building +and improving a bundle. -1. **ONNX Model**: Convert a sentence transformer model to ONNX +## Prerequisite - ```bash - # Using optimum-cli (requires Python) - pip install optimum[onnxruntime] - optimum-cli export onnx --model sentence-transformers/all-MiniLM-L6-v2 minilm-l6-v2-onnx/ - ``` +The `fastclass` binary must be resolvable on `PATH`, `~/.cargo/bin`, or +`~/.govbot/bin`: -2. **Tokenizer**: The `tokenizer.json` file is included in the exported model directory. - -3. **Copy files**: Place `model.onnx` and `tokenizer.json` in your working directory (or in the directory pointed to by `--govbot-dir` / `GOVBOT_DIR`). +```bash +cd && cargo install --path . +``` -If either file is missing or cannot be loaded, govbot will **still run** using the keyword-based fallback described above. +govbot's transform runner resolves transform binaries the same way. ## Output -Tagged results are written to per-tag files under the session’s `tags/` directory: +`govbot apply` writes per-tag files under each session's `tags/` directory: ```text country:us/state:{state}/sessions/{session_id}/tags/{tag_name}.tag.json @@ -95,15 +67,9 @@ country:us/state:{state}/sessions/{session_id}/tags/{tag_name}.tag.json Each `{tag_name}.tag.json` file contains: -- `metadata`: Model info, last run timestamp, hash of the tag config -- `tag_config`: The tag definition as used on the last run -- `text_cache`: Deduplicated bill/log texts keyed by content hash -- `bills`: Map of bill identifiers to their `ScoreBreakdown` - -`ScoreBreakdown` includes: +- `metadata`: classifier info, last-run timestamp, tag-config hash +- `tag_config`: a stub tag definition (the real taxonomy lives in the bundle) +- `text_cache`: deduplicated bill texts keyed by content hash +- `bills`: a map of bill identifiers to their `ScoreBreakdown` -- `final_score`: Final score used for threshold comparison -- `base_embedding`: Base embedding similarity (if embeddings were used) -- `example_similarity`: Max similarity to positive examples -- `keyword_match`: Whether include_keywords matched -- `negative_penalty`: Penalty applied from negative examples (if any) +`ScoreBreakdown.final_score` is fastclass's calibrated probability for the tag. diff --git a/actions/govbot/action.yml b/actions/govbot/action.yml index cc0890f1..ee80424f 100644 --- a/actions/govbot/action.yml +++ b/actions/govbot/action.yml @@ -1,24 +1,21 @@ name: "Govbot" -description: "Clone repos, tag bills, and create RSS feeds from govbot.yml configuration" +description: "Pull datasets, classify bills with fastclass, and publish feeds from a govbot.yml manifest" branding: icon: 'rss' color: 'orange' inputs: - tags: - description: 'Comma-separated list of tags to include in feed (default: all tags from govbot.yml)' - required: false limit: - description: 'Limit number of entries per feed' + description: 'Limit number of entries per published artifact (use "none" for all)' required: false output-dir: - description: 'Output directory for RSS feed (default: from govbot.yml build.output_dir)' + description: 'Output directory override for publishers' required: false output-file: - description: 'Output filename for RSS feed (default: from govbot.yml build.output_file)' + description: 'Output filename override for publishers' required: false govbot-dir: - description: 'Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var)' + description: 'Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var)' required: false outputs: @@ -39,7 +36,7 @@ runs: curl -fsSL "https://github.com/chihacknight/govbot/releases/download/nightly/govbot-linux-x86_64" \ -o "${{ github.action_path }}/bin/govbot" chmod +x "${{ github.action_path }}/bin/govbot" - + - name: Set GOVBOT_DIR id: set-govbot-dir shell: bash @@ -52,8 +49,8 @@ runs: fi echo "GOVBOT_DIR=$GOVBOT_DIR" >> $GITHUB_ENV echo "repos-dir=$GOVBOT_DIR/repos" >> $GITHUB_OUTPUT - - - name: Restore repos cache + + - name: Restore datasets cache id: cache-repos uses: actions/cache@v4 with: @@ -61,110 +58,88 @@ runs: key: govbot-repos-${{ runner.os }}-${{ hashFiles('govbot.yml') }} restore-keys: | govbot-repos-${{ runner.os }}- - + - name: Debug cache status shell: bash run: | echo "Cache hit: ${{ steps.cache-repos.outputs.cache-hit }}" if [ "${{ steps.cache-repos.outputs.cache-hit }}" == "true" ]; then - echo "✅ Using cached repos - will only update existing repos" + echo "Using cached datasets - will only update existing datasets" else - echo "❌ Cache miss - will clone all repos" + echo "Cache miss - will pull all datasets" fi - - - name: Clone legislation repositories - id: clone + + - name: Pull legislation datasets + id: pull shell: bash working-directory: ${{ github.workspace }} run: | - # Check if govbot.yml exists + # govbot.yml must exist at the repository root. if [ ! -f "${{ github.workspace }}/govbot.yml" ]; then echo "::error::govbot.yml not found in repository root" exit 1 fi - - # If cache was hit, govbot clone will just update existing repos (git pull) - # If cache was missed, govbot clone will do fresh clones - # When no repos are specified, govbot clone updates existing repos only + + # Cache hit: `govbot pull` (no args) updates existing datasets only. + # Cache miss: `govbot pull all` does fresh clones. if [ "${{ steps.cache-repos.outputs.cache-hit }}" == "true" ]; then - echo "📥 Cache hit - updating existing repositories..." - # Update existing repos (no args = update existing only) - ${{ github.action_path }}/bin/govbot clone \ + echo "Cache hit - updating existing datasets..." + ${{ github.action_path }}/bin/govbot pull \ --govbot-dir "$GOVBOT_DIR" || true else - echo "📥 Cache miss - cloning all repositories..." - # Clone all repos - ${{ github.action_path }}/bin/govbot clone all \ + echo "Cache miss - pulling all datasets..." + ${{ github.action_path }}/bin/govbot pull all \ --govbot-dir "$GOVBOT_DIR" || true fi - - - name: Save repos cache + + - name: Save datasets cache if: steps.cache-repos.outputs.cache-hit != 'true' uses: actions/cache@v4 with: path: ${{ steps.set-govbot-dir.outputs.repos-dir }} key: govbot-repos-${{ runner.os }}-${{ hashFiles('govbot.yml') }} - - - name: Tag bills + + - name: Classify and apply shell: bash working-directory: ${{ github.workspace }} run: | - echo "🏷️ Tagging bills..." - ${{ github.action_path }}/bin/govbot logs | ${{ github.action_path }}/bin/govbot tag || true - - - name: Generate RSS feed + # Classification is delegated to fastclass, an external transform that + # must be resolvable on PATH / ~/.cargo/bin / ~/.govbot/bin. The + # classifier bundle path is declared in govbot.yml under + # transforms..classifier. + echo "Classifying bills (source | fastclass classify | apply)..." + ${{ github.action_path }}/bin/govbot source --select docs \ + | fastclass classify - \ + | ${{ github.action_path }}/bin/govbot apply || true + + - name: Publish feeds id: publish shell: bash working-directory: ${{ github.workspace }} run: | - # Build command arguments ARGS="" - - # Add tags if specified - if [ -n "${{ inputs.tags }}" ]; then - # Convert comma-separated to space-separated - TAGS=$(echo "${{ inputs.tags }}" | tr ',' ' ') - ARGS="$ARGS --tags $TAGS" - fi - - # Add limit if specified if [ -n "${{ inputs.limit }}" ]; then ARGS="$ARGS --limit ${{ inputs.limit }}" fi - - # Add output directory if specified if [ -n "${{ inputs.output-dir }}" ]; then ARGS="$ARGS --output-dir ${{ inputs.output-dir }}" fi - - # Add output file if specified if [ -n "${{ inputs.output-file }}" ]; then ARGS="$ARGS --output-file ${{ inputs.output-file }}" fi - - # Add govbot-dir if specified if [ -n "${{ inputs.govbot-dir }}" ]; then ARGS="$ARGS --govbot-dir ${{ inputs.govbot-dir }}" fi - - # Run build command - # govbot.yml is automatically found in workspace root - ${{ github.action_path }}/bin/govbot build $ARGS - - # Determine output path (read from govbot.yml or use defaults) - if [ -n "${{ inputs.output-dir }}" ]; then - OUTPUT_DIR="${{ inputs.output-dir }}" - else - # Try to read from govbot.yml, default to docs - OUTPUT_DIR=$(grep -A 5 "^build:" "${{ github.workspace }}/govbot.yml" | grep "output_dir:" | awk '{print $2}' | tr -d '"' || echo "docs") - fi - - if [ -n "${{ inputs.output-file }}" ]; then - OUTPUT_FILE="${{ inputs.output-file }}" - else - # Try to read from govbot.yml, default to feed.xml - OUTPUT_FILE=$(grep -A 5 "^build:" "${{ github.workspace }}/govbot.yml" | grep "output_file:" | awk '{print $2}' | tr -d '"' || echo "feed.xml") - fi - + + # Run the manifest's publishers. govbot.yml is found in the workspace root. + ${{ github.action_path }}/bin/govbot publish $ARGS + + # Determine the primary output path from the override inputs, defaulting + # to docs/feed.xml (the wizard's default RSS publisher). + OUTPUT_DIR="${{ inputs.output-dir }}" + [ -z "$OUTPUT_DIR" ] && OUTPUT_DIR="docs" + OUTPUT_FILE="${{ inputs.output-file }}" + [ -z "$OUTPUT_FILE" ] && OUTPUT_FILE="feed.xml" + echo "feed-path=${{ github.workspace }}/$OUTPUT_DIR/$OUTPUT_FILE" >> $GITHUB_OUTPUT echo "feed-dir=${{ github.workspace }}/$OUTPUT_DIR" >> $GITHUB_OUTPUT diff --git a/actions/govbot/data/registry.json b/actions/govbot/data/registry.json new file mode 100644 index 00000000..e88f79e2 --- /dev/null +++ b/actions/govbot/data/registry.json @@ -0,0 +1,62 @@ +{ + "$schema_version": "govbot-registry-1", + "description": "The govbot dataset registry. Maps a dataset identifier to the git repo that holds its data, the data schema it follows, and the glob that locates its records within the repo. Datasets are git repos; this index is 'npm/docker for government data'. Bundled as a default in the govbot binary and overridable from a URL via GOVBOT_REGISTRY_URL. See actions/govbot/REGISTRY.md.", + "default_namespace": "us-legislation", + "datasets": { + "us-legislation/al": {"git_url": "https://github.com/chn-openstates-files/al-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Alabama"}, + "us-legislation/ak": {"git_url": "https://github.com/chn-openstates-files/ak-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Alaska"}, + "us-legislation/az": {"git_url": "https://github.com/chn-openstates-files/az-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Arizona"}, + "us-legislation/ar": {"git_url": "https://github.com/chn-openstates-files/ar-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Arkansas"}, + "us-legislation/ca": {"git_url": "https://github.com/chn-openstates-files/ca-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "California"}, + "us-legislation/co": {"git_url": "https://github.com/chn-openstates-files/co-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Colorado"}, + "us-legislation/ct": {"git_url": "https://github.com/chn-openstates-files/ct-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Connecticut"}, + "us-legislation/de": {"git_url": "https://github.com/chn-openstates-files/de-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Delaware"}, + "us-legislation/fl": {"git_url": "https://github.com/chn-openstates-files/fl-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Florida"}, + "us-legislation/ga": {"git_url": "https://github.com/chn-openstates-files/ga-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Georgia"}, + "us-legislation/hi": {"git_url": "https://github.com/chn-openstates-files/hi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Hawaii"}, + "us-legislation/id": {"git_url": "https://github.com/chn-openstates-files/id-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Idaho"}, + "us-legislation/il": {"git_url": "https://github.com/chn-openstates-files/il-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Illinois"}, + "us-legislation/in": {"git_url": "https://github.com/chn-openstates-files/in-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Indiana"}, + "us-legislation/ia": {"git_url": "https://github.com/chn-openstates-files/ia-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Iowa"}, + "us-legislation/ks": {"git_url": "https://github.com/chn-openstates-files/ks-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Kansas"}, + "us-legislation/ky": {"git_url": "https://github.com/chn-openstates-files/ky-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Kentucky"}, + "us-legislation/la": {"git_url": "https://github.com/chn-openstates-files/la-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Louisiana"}, + "us-legislation/me": {"git_url": "https://github.com/chn-openstates-files/me-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Maine"}, + "us-legislation/md": {"git_url": "https://github.com/chn-openstates-files/md-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Maryland"}, + "us-legislation/ma": {"git_url": "https://github.com/chn-openstates-files/ma-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Massachusetts"}, + "us-legislation/mi": {"git_url": "https://github.com/chn-openstates-files/mi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Michigan"}, + "us-legislation/mn": {"git_url": "https://github.com/chn-openstates-files/mn-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Minnesota"}, + "us-legislation/ms": {"git_url": "https://github.com/chn-openstates-files/ms-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Mississippi"}, + "us-legislation/mo": {"git_url": "https://github.com/chn-openstates-files/mo-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Missouri"}, + "us-legislation/mt": {"git_url": "https://github.com/chn-openstates-files/mt-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Montana"}, + "us-legislation/ne": {"git_url": "https://github.com/chn-openstates-files/ne-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Nebraska"}, + "us-legislation/nv": {"git_url": "https://github.com/chn-openstates-files/nv-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Nevada"}, + "us-legislation/nh": {"git_url": "https://github.com/chn-openstates-files/nh-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New Hampshire"}, + "us-legislation/nj": {"git_url": "https://github.com/chn-openstates-files/nj-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New Jersey"}, + "us-legislation/nm": {"git_url": "https://github.com/chn-openstates-files/nm-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New Mexico"}, + "us-legislation/ny": {"git_url": "https://github.com/chn-openstates-files/ny-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New York"}, + "us-legislation/nc": {"git_url": "https://github.com/chn-openstates-files/nc-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "North Carolina"}, + "us-legislation/nd": {"git_url": "https://github.com/chn-openstates-files/nd-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "North Dakota"}, + "us-legislation/oh": {"git_url": "https://github.com/chn-openstates-files/oh-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Ohio"}, + "us-legislation/ok": {"git_url": "https://github.com/chn-openstates-files/ok-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Oklahoma"}, + "us-legislation/or": {"git_url": "https://github.com/chn-openstates-files/or-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Oregon"}, + "us-legislation/pa": {"git_url": "https://github.com/chn-openstates-files/pa-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Pennsylvania"}, + "us-legislation/ri": {"git_url": "https://github.com/chn-openstates-files/ri-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Rhode Island"}, + "us-legislation/sc": {"git_url": "https://github.com/chn-openstates-files/sc-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "South Carolina"}, + "us-legislation/sd": {"git_url": "https://github.com/chn-openstates-files/sd-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "South Dakota"}, + "us-legislation/tn": {"git_url": "https://github.com/chn-openstates-files/tn-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Tennessee"}, + "us-legislation/tx": {"git_url": "https://github.com/chn-openstates-files/tx-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Texas"}, + "us-legislation/ut": {"git_url": "https://github.com/chn-openstates-files/ut-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Utah"}, + "us-legislation/vt": {"git_url": "https://github.com/chn-openstates-files/vt-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Vermont"}, + "us-legislation/va": {"git_url": "https://github.com/chn-openstates-files/va-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Virginia"}, + "us-legislation/wa": {"git_url": "https://github.com/chn-openstates-files/wa-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Washington"}, + "us-legislation/wv": {"git_url": "https://github.com/chn-openstates-files/wv-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "West Virginia"}, + "us-legislation/wi": {"git_url": "https://github.com/chn-openstates-files/wi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Wisconsin"}, + "us-legislation/wy": {"git_url": "https://github.com/chn-openstates-files/wy-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Wyoming"}, + "us-legislation/pr": {"git_url": "https://github.com/chn-openstates-files/pr-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Puerto Rico"}, + "us-legislation/mp": {"git_url": "https://github.com/chn-openstates-files/mp-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Northern Mariana Islands"}, + "us-legislation/vi": {"git_url": "https://github.com/chn-openstates-files/vi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "U.S. Virgin Islands"}, + "us-legislation/gu": {"git_url": "https://github.com/chn-openstates-files/gu-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Guam"}, + "us-legislation/usa": {"git_url": "https://github.com/chn-openstates-files/usa-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "United States Congress"} + } +} diff --git a/actions/govbot/examples/govbot-clone-list.sh b/actions/govbot/examples/govbot-clone-list.sh deleted file mode 100644 index 843c33e2..00000000 --- a/actions/govbot/examples/govbot-clone-list.sh +++ /dev/null @@ -1 +0,0 @@ -govbot clone --list \ No newline at end of file diff --git a/actions/govbot/examples/govbot-pull-list.sh b/actions/govbot/examples/govbot-pull-list.sh new file mode 100644 index 00000000..182ab9d1 --- /dev/null +++ b/actions/govbot/examples/govbot-pull-list.sh @@ -0,0 +1 @@ +govbot pull --list \ No newline at end of file diff --git a/actions/govbot/examples/logs-basic.sh b/actions/govbot/examples/logs-basic.sh deleted file mode 100644 index 47118436..00000000 --- a/actions/govbot/examples/logs-basic.sh +++ /dev/null @@ -1 +0,0 @@ -govbot logs diff --git a/actions/govbot/examples/source-basic.sh b/actions/govbot/examples/source-basic.sh new file mode 100644 index 00000000..caa69b00 --- /dev/null +++ b/actions/govbot/examples/source-basic.sh @@ -0,0 +1 @@ +govbot source diff --git a/actions/govbot/justfile b/actions/govbot/justfile index 54ed620e..6fde2f32 100644 --- a/actions/govbot/justfile +++ b/actions/govbot/justfile @@ -56,9 +56,9 @@ build-release: # Usage: just govbot [COMMAND] [ARGS...] # Examples: # just govbot --help -# just govbot clone usa il -# just govbot clone --govbot-dir custom-dir usa -# just govbot logs --repos usa +# just govbot pull usa il +# just govbot pull --govbot-dir custom-dir usa +# just govbot source --repos usa govbot *ARGS: #!/usr/bin/env bash set -e @@ -140,38 +140,33 @@ run: run-args ARGS: cargo run -- {{ARGS}} -# Tag bills using AI (reads JSON lines from stdin) -# Usage: just govbot logs --repos il --limit 10 | just govbot tag --ai-tool "ollama run llama3" -# Example: just govbot logs --repos il --limit 10 | just govbot tag --ai-tool "ollama" -# Note: The tag command reads from stdin, so pipe the logs output to it -tag *ARGS: +# Apply fastclass classification results (reads result JSON from stdin) +# Usage: just govbot source --select docs | fastclass classify - | just apply +# Note: The apply command reads from stdin, so pipe the classify output to it +apply *ARGS: #!/usr/bin/env bash set -e - + DEV_DIR=".govbot" - + # Build release binary if it doesn't exist or if any source files are newer if [ ! -f "target/release/govbot" ] || [ "src" -nt "target/release/govbot" ] || find src -name "*.rs" -newer "target/release/govbot" 2>/dev/null | grep -q .; then echo "🔨 Building release target..." cargo build --release fi - + # Check if --govbot-dir is already in the arguments ARGS_STR="{{ARGS}}" if [[ "$ARGS_STR" =~ --govbot-dir ]]; then - ./target/release/govbot tag {{ARGS}} + ./target/release/govbot apply {{ARGS}} else - ./target/release/govbot tag {{ARGS}} --govbot-dir "$DEV_DIR" + ./target/release/govbot apply {{ARGS}} --govbot-dir "$DEV_DIR" fi # Run the release binary run-release: cargo run --release -# Generate locale enum from pipeline-manager config -generate: - cargo run --bin generate-locale-enum - # Update mocks by cloning/pulling repos and cleaning them up # Usage: just mocks [LOCALES...] # Example: just mocks usa il @@ -199,10 +194,10 @@ mocks *LOCALES: # Ensure binary is built cargo build --bin govbot - # Clone/pull repositories using govbot + # Pull datasets using govbot echo "" - echo "📥 Cloning/pulling repositories..." - cargo run --bin govbot -- clone --govbot-dir "$MOCKS_DIR" $LOCALES + echo "📥 Pulling datasets..." + cargo run --bin govbot -- pull --govbot-dir "$MOCKS_DIR" $LOCALES # Cleanup functions function delete_files_dir { diff --git a/actions/govbot/src/bin/generate-locale-enum.rs b/actions/govbot/src/bin/generate-locale-enum.rs deleted file mode 100644 index 6573339a..00000000 --- a/actions/govbot/src/bin/generate-locale-enum.rs +++ /dev/null @@ -1,204 +0,0 @@ -//! Generate a Rust enum from working locales in pipeline-manager config.yml -//! Run with: cargo run --bin generate-locale-enum - -use serde::Deserialize; -use std::collections::HashMap; -use std::fs; -use std::path::PathBuf; - -#[derive(Debug, Deserialize)] -struct Config { - locales: HashMap, -} - -#[derive(Debug, Deserialize)] -struct LocaleConfig { - #[allow(dead_code)] - template: String, - #[serde(default)] - labels: Vec, -} - -fn locale_to_variant(locale: &str) -> String { - // Two-letter codes should be uppercase (e.g., 'ar' -> 'AR', 'pr' -> 'PR') - // Longer codes should be capitalized (e.g., 'usa' -> 'Usa') - if locale.len() <= 2 { - locale.to_uppercase() - } else { - let mut chars = locale.chars(); - match chars.next() { - None => String::new(), - Some(first) => first.to_uppercase().collect::() + chars.as_str(), - } - } -} - -fn get_working_locales(config_path: &PathBuf) -> Result, Box> { - let content = fs::read_to_string(config_path)?; - let config: Config = serde_yaml::from_str(&content)?; - - let mut working_locales: Vec = config - .locales - .into_iter() - .filter(|(_, locale_config)| locale_config.labels.contains(&"working".to_string())) - .map(|(locale, _)| locale) - .collect(); - - working_locales.sort(); - Ok(working_locales) -} - -fn generate_rust_enum( - locales: &[String], - output_path: &PathBuf, -) -> Result<(), Box> { - // Create mapping of locale -> variant name - let locale_variants: Vec<(String, String)> = locales - .iter() - .map(|loc| (loc.clone(), locale_to_variant(loc))) - .collect(); - - // Generate enum variants - let enum_variants: Vec = locale_variants - .iter() - .map(|(_, variant)| format!(" {},", variant)) - .collect(); - - // Generate match arms for as_str - let as_str_arms: Vec = locale_variants - .iter() - .map(|(locale, variant)| { - format!(" WorkingLocale::{} => \"{}\",", variant, locale) - }) - .collect(); - - // Generate match arms for as_lowercase - let as_lowercase_arms: Vec = locale_variants - .iter() - .map(|(locale, variant)| { - format!( - " WorkingLocale::{} => \"{}\",", - variant, - locale.to_lowercase() - ) - }) - .collect(); - - // Generate match arms for From<&str> - let from_str_arms: Vec = locale_variants - .iter() - .map(|(locale, variant)| { - format!( - " \"{}\" => WorkingLocale::{},", - locale.to_lowercase(), - variant - ) - }) - .collect(); - - // Generate all() vector items - let all_items: Vec = locale_variants - .iter() - .map(|(_, variant)| format!(" WorkingLocale::{},", variant)) - .collect(); - - let rust_code = format!( - r#"//! Auto-generated locale enum from pipeline-manager config.yml -//! This file is generated by src/bin/generate-locale-enum.rs -//! Do not edit manually - regenerate using: just generate - -/// Locale codes for working pipelines -#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)] -#[serde(rename_all = "lowercase")] -pub enum WorkingLocale {{ - All, -{} -}} - -impl WorkingLocale {{ - /// Get all working locales as a vector (excludes All variant) - pub fn all() -> Vec {{ - vec![ -{} - ] - }} - - /// Get the locale code as a string - pub fn as_str(&self) -> &'static str {{ - match self {{ - WorkingLocale::All => "all", -{} - }} - }} - - /// Get the locale code in lowercase - pub fn as_lowercase(&self) -> &'static str {{ - match self {{ - WorkingLocale::All => "all", -{} - }} - }} -}} - -impl From<&str> for WorkingLocale {{ - fn from(s: &str) -> Self {{ - match s.to_lowercase().as_str() {{ - "all" => WorkingLocale::All, -{} - _ => panic!("Invalid working locale: {{}}", s), - }} - }} -}} - -impl std::fmt::Display for WorkingLocale {{ - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {{ - write!(f, "{{}}", self.as_lowercase()) - }} -}} -"#, - enum_variants.join("\n"), - all_items.join("\n"), - as_str_arms.join("\n"), - as_lowercase_arms.join("\n"), - from_str_arms.join("\n") - ); - - fs::write(output_path, rust_code)?; - Ok(()) -} - -fn main() -> Result<(), Box> { - // Get paths relative to the binary location - // manifest_dir is /Users/sartaj/Git/toolkit/actions/govbot - // config is at /Users/sartaj/Git/toolkit/actions/pipeline-manager/chn-openstates-files.yml - let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); - let config_path = manifest_dir - .parent() - .unwrap() // /Users/sartaj/Git/toolkit/actions - .join("pipeline-manager") - .join("chn-openstates-files.yml"); - let output_path = manifest_dir.join("src").join("locale_generated.rs"); - - if !config_path.exists() { - eprintln!( - "❌ Error: Config file not found at {}", - config_path.display() - ); - std::process::exit(1); - } - - let locales = get_working_locales(&config_path)?; - if locales.is_empty() { - eprintln!("⚠️ Warning: No working locales found in config"); - } - - generate_rust_enum(&locales, &output_path)?; - println!( - "✅ Generated {} with {} working locales", - output_path.display(), - locales.len() - ); - println!("📋 Working locales: {}", locales.join(", ")); - - Ok(()) -} diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs new file mode 100644 index 00000000..65501906 --- /dev/null +++ b/actions/govbot/src/bluesky.rs @@ -0,0 +1,539 @@ +//! The `bluesky` publisher — posts matched bills to a Bluesky account. +//! +//! This is a **posting bot**, not a hosted AT-Protocol feed-generator service: +//! it posts to a normal Bluesky account via the XRPC API and runs to +//! completion (from cron/CI), so it needs no always-on server. +//! +//! Flow: +//! 1. Authenticate — `com.atproto.server.createSession` with the account +//! handle + an **app password** read from the environment. +//! 2. Select records — keep those carrying a `select`ed tag whose calibrated +//! `final_score` clears `min_score`. +//! 3. For each record not already in the ledger, render a post (<=300 +//! chars) and `com.atproto.repo.createRecord` an `app.bsky.feed.post`. +//! 4. Append the record's id to the posted-state ledger so re-runs never +//! double-post. +//! +//! `--dry-run` renders the posts that *would* be sent and touches no network +//! and no ledger. +//! +//! Credentials are **environment-only** — never read from `govbot.yml`: +//! - `BLUESKY_HANDLE` — the account handle, e.g. `mybot.bsky.social` +//! - `BLUESKY_APP_PASSWORD` — an app password (Settings → App Passwords), +//! never the main account password +//! - `BLUESKY_SERVICE` — optional PDS base URL (default `https://bsky.social`) + +use crate::publish::PublishJob; +use anyhow::{Context, Result}; +use serde_json::{json, Value}; +use std::collections::HashSet; +use std::fs; +use std::io::Write; +use std::path::{Path, PathBuf}; + +/// Bluesky's hard post-text limit (graphemes; we approximate with chars). +const POST_TEXT_LIMIT: usize = 300; + +/// Default PDS service endpoint when `BLUESKY_SERVICE` is unset. +const DEFAULT_SERVICE: &str = "https://bsky.social"; + +/// The default post-text template. Kept deliberately simple — a future +/// `summarize` transform will improve framing. +const DEFAULT_TEMPLATE: &str = "{title}\n\n{tags} · {link}"; + +/// A post ready to be sent: the routing key (ledger id) plus rendered text. +#[derive(Debug)] +struct RenderedPost { + /// The ledger key — a stable per-record id (the entry GUID). + id: String, + /// The post body, already truncated to the Bluesky limit. + text: String, +} + +/// Run the `bluesky` publisher against its result stream. +/// +/// `dry_run` renders the would-be posts and touches no network and no ledger. +pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { + let p = job.publisher; + let select = p.select.clone().unwrap_or_default(); + let min_score = p.resolved_min_score(); + + // Resolve the ledger path (project-dir relative). Default: a per-publisher + // file under `.govbot/`. + let ledger_path = resolve_ledger_path(job); + + // Select records: a `select`ed tag must clear the calibrated threshold. + let posts: Vec = job + .entries + .iter() + .filter(|e| record_clears_threshold(e, &select, min_score)) + .map(|e| render_post(e, p.post_template.as_deref())) + .collect(); + + if posts.is_empty() { + eprintln!( + "Publisher '{}' (bluesky): no records cleared min_score {} for tags {} — nothing to post.", + job.name, + min_score, + if select.is_empty() { "".to_string() } else { select.join(", ") } + ); + return Ok(()); + } + + // Idempotency: drop records already in the posted-state ledger. + let already_posted = read_ledger(&ledger_path)?; + let pending: Vec<&RenderedPost> = posts + .iter() + .filter(|post| !already_posted.contains(&post.id)) + .collect(); + + if dry_run { + eprintln!( + "Publisher '{}' (bluesky) --dry-run: {} record(s) cleared the threshold, \ + {} already posted, {} would be posted. No network, no ledger writes.", + job.name, + posts.len(), + posts.len() - pending.len(), + pending.len(), + ); + for (i, post) in pending.iter().enumerate() { + println!("--- post {} of {} (id: {}) ---", i + 1, pending.len(), post.id); + println!("{}", post.text); + println!(); + } + return Ok(()); + } + + if pending.is_empty() { + eprintln!( + "Publisher '{}' (bluesky): all {} matching record(s) already posted — nothing new.", + job.name, + posts.len() + ); + return Ok(()); + } + + // Authenticate — credentials are environment-only. + let service = std::env::var("BLUESKY_SERVICE") + .ok() + .filter(|s| !s.trim().is_empty()) + .unwrap_or_else(|| DEFAULT_SERVICE.to_string()); + let session = create_session(&service) + .context("Bluesky authentication failed")?; + + eprintln!( + "Publisher '{}' (bluesky): authenticated as {} — posting {} record(s) to {}.", + job.name, session.handle, pending.len(), service + ); + + // Post each pending record, appending to the ledger as we go so a + // mid-run failure never re-posts what already succeeded. + let mut posted = 0usize; + for post in &pending { + match create_post(&service, &session, &post.text) { + Ok(uri) => { + append_ledger(&ledger_path, &post.id)?; + posted += 1; + eprintln!(" ✓ posted {} -> {}", post.id, uri); + } + Err(e) => { + // Fail loudly but stop — leave the rest for the next run + // rather than hammering a failing endpoint. + anyhow::bail!( + "Publisher '{}' (bluesky): posted {}/{} record(s); failed on {}: {}", + job.name, + posted, + pending.len(), + post.id, + e + ); + } + } + } + + eprintln!( + "✓ Publisher '{}' (bluesky): posted {} record(s); ledger at {}", + job.name, + posted, + ledger_path.display() + ); + Ok(()) +} + +// ============================================================ +// Record selection + post rendering +// ============================================================ + +/// True when the record carries a `select`ed tag whose calibrated +/// `final_score` clears `min_score`. When `select` is empty, any tag counts. +/// +/// The `tags` field is a map `tag_name -> ScoreBreakdown`; the calibrated +/// probability is `tags..final_score` (STREAM_PROTOCOL §5). +fn record_clears_threshold(entry: &Value, select: &[String], min_score: f64) -> bool { + let tags = match entry.get("tags").and_then(|t| t.as_object()) { + Some(t) if !t.is_empty() => t, + _ => return false, + }; + tags.iter().any(|(name, score)| { + let selected = select.is_empty() || select.iter().any(|s| s == name); + if !selected { + return false; + } + score + .get("final_score") + .and_then(|v| v.as_f64()) + .map(|s| s >= min_score) + .unwrap_or(false) + }) +} + +/// Render a record into post text, applying the template and truncating to +/// the Bluesky character limit. +fn render_post(entry: &Value, template: Option<&str>) -> RenderedPost { + let id = crate::rss::extract_guid(entry); + let template = template.unwrap_or(DEFAULT_TEMPLATE); + + let title = bill_title(entry); + let tags = entry + .get("tags") + .and_then(|t| t.as_object()) + .map(|m| m.keys().cloned().collect::>().join(", ")) + .unwrap_or_default(); + let link = crate::rss::extract_link(entry, None).unwrap_or_default(); + let identifier = entry + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str()) + .or_else(|| entry.get("id").and_then(|v| v.as_str())) + .unwrap_or("") + .to_string(); + let session = entry + .get("bill") + .and_then(|b| b.get("legislative_session")) + .and_then(|v| v.as_str()) + .unwrap_or("") + .to_string(); + let score = top_score(entry) + .map(|s| format!("{:.2}", s)) + .unwrap_or_default(); + + let text = template + .replace("{title}", &title) + .replace("{tags}", &tags) + .replace("{link}", &link) + .replace("{identifier}", &identifier) + .replace("{session}", &session) + .replace("{score}", &score); + + RenderedPost { + id, + text: truncate_post(&text), + } +} + +/// The highest calibrated `final_score` across a record's tags. +fn top_score(entry: &Value) -> Option { + entry + .get("tags") + .and_then(|t| t.as_object()) + .and_then(|tags| { + tags.values() + .filter_map(|s| s.get("final_score").and_then(|v| v.as_f64())) + .fold(None, |acc, s| Some(acc.map_or(s, |a: f64| a.max(s)))) + }) +} + +/// Best-effort bill title — the bill's `title`, else its identifier, else a +/// generic fallback. +fn bill_title(entry: &Value) -> String { + if let Some(t) = entry + .get("bill") + .and_then(|b| b.get("title")) + .and_then(|v| v.as_str()) + { + let t = t.trim(); + if !t.is_empty() { + return t.to_string(); + } + } + if let Some(id) = entry + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str()) + .or_else(|| entry.get("id").and_then(|v| v.as_str())) + { + if !id.is_empty() { + return id.to_string(); + } + } + "Legislative update".to_string() +} + +/// Truncate post text to the Bluesky limit, appending an ellipsis when cut. +fn truncate_post(text: &str) -> String { + let trimmed = text.trim(); + if trimmed.chars().count() <= POST_TEXT_LIMIT { + return trimmed.to_string(); + } + let mut out: String = trimmed.chars().take(POST_TEXT_LIMIT - 1).collect(); + // Avoid cutting mid-word where reasonable. + if let Some(idx) = out.rfind(char::is_whitespace) { + if idx > POST_TEXT_LIMIT / 2 { + out.truncate(idx); + } + } + format!("{}…", out.trim_end()) +} + +// ============================================================ +// Posted-state ledger (idempotency) +// ============================================================ + +/// Resolve the ledger file path: the publisher's `ledger` field if set, +/// else `/.govbot/bluesky-.ledger`. Relative paths resolve +/// against the project directory (where `govbot.yml` lives). +fn resolve_ledger_path(job: &PublishJob) -> PathBuf { + match &job.publisher.ledger { + Some(p) => { + let p = PathBuf::from(p); + if p.is_absolute() { + p + } else { + job.project_dir.join(p) + } + } + None => job + .project_dir + .join(".govbot") + .join(format!("bluesky-{}.ledger", job.name)), + } +} + +/// Read the set of already-posted record ids from the ledger. A missing +/// ledger is an empty set (first run). The ledger is append-only, +/// newline-delimited, one record id per line. +fn read_ledger(path: &Path) -> Result> { + if !path.exists() { + return Ok(HashSet::new()); + } + let contents = fs::read_to_string(path) + .with_context(|| format!("Failed to read posted-state ledger: {}", path.display()))?; + Ok(contents + .lines() + .map(str::trim) + .filter(|l| !l.is_empty()) + .map(|l| l.to_string()) + .collect()) +} + +/// Append a posted record id to the ledger, creating it (and its parent +/// directory) if needed. +fn append_ledger(path: &Path, id: &str) -> Result<()> { + if let Some(parent) = path.parent() { + fs::create_dir_all(parent).with_context(|| { + format!("Failed to create ledger directory: {}", parent.display()) + })?; + } + let mut file = fs::OpenOptions::new() + .create(true) + .append(true) + .open(path) + .with_context(|| format!("Failed to open posted-state ledger: {}", path.display()))?; + writeln!(file, "{}", id) + .with_context(|| format!("Failed to append to ledger: {}", path.display()))?; + Ok(()) +} + +// ============================================================ +// AT Protocol XRPC +// ============================================================ + +/// An authenticated Bluesky session. +struct Session { + /// The bearer access token (`accessJwt`). + access_jwt: String, + /// The repo DID — the record owner for `createRecord`. + did: String, + /// The resolved account handle (for logging). + handle: String, +} + +/// Authenticate via `com.atproto.server.createSession`. +/// +/// Reads `BLUESKY_HANDLE` + `BLUESKY_APP_PASSWORD` from the environment; +/// these are required and never sourced from `govbot.yml`. +fn create_session(service: &str) -> Result { + let handle = require_env("BLUESKY_HANDLE")?; + let password = require_env("BLUESKY_APP_PASSWORD")?; + + let url = format!( + "{}/xrpc/com.atproto.server.createSession", + service.trim_end_matches('/') + ); + // `http_status_as_error(false)` keeps a non-2xx response an `Ok` so we can + // read its body for an actionable error; only transport errors are `Err`. + let response = ureq::post(&url) + .config() + .http_status_as_error(false) + .build() + .header("Content-Type", "application/json") + .send_json(json!({ "identifier": handle, "password": password })) + .context("createSession request failed")?; + + let status = response.status(); + let mut resp_body = response.into_body(); + if !status.is_success() { + let detail = resp_body + .read_to_string() + .unwrap_or_else(|_| "".to_string()); + anyhow::bail!( + "createSession returned HTTP {} — check BLUESKY_HANDLE / \ + BLUESKY_APP_PASSWORD (use an app password, not the main \ + password). Response: {}", + status.as_u16(), + detail + ); + } + let body: Value = resp_body + .read_json() + .context("Failed to parse createSession response")?; + + let access_jwt = body + .get("accessJwt") + .and_then(|v| v.as_str()) + .context("createSession response missing accessJwt")? + .to_string(); + let did = body + .get("did") + .and_then(|v| v.as_str()) + .context("createSession response missing did")? + .to_string(); + let handle = body + .get("handle") + .and_then(|v| v.as_str()) + .unwrap_or(&handle) + .to_string(); + + Ok(Session { access_jwt, did, handle }) +} + +/// Post one `app.bsky.feed.post` record via `com.atproto.repo.createRecord`. +/// Returns the AT URI of the created record. +fn create_post(service: &str, session: &Session, text: &str) -> Result { + let url = format!( + "{}/xrpc/com.atproto.repo.createRecord", + service.trim_end_matches('/') + ); + // RFC-3339 UTC timestamp, as the AT Protocol expects for `createdAt`. + let created_at = chrono::Utc::now() + .format("%Y-%m-%dT%H:%M:%S%.3fZ") + .to_string(); + + let response = ureq::post(&url) + .config() + .http_status_as_error(false) + .build() + .header("Authorization", &format!("Bearer {}", session.access_jwt)) + .header("Content-Type", "application/json") + .send_json(json!({ + "repo": session.did, + "collection": "app.bsky.feed.post", + "record": { + "$type": "app.bsky.feed.post", + "text": text, + "createdAt": created_at, + } + })) + .context("createRecord request failed")?; + + let status = response.status(); + let mut resp_body = response.into_body(); + if !status.is_success() { + let detail = resp_body + .read_to_string() + .unwrap_or_else(|_| "".to_string()); + anyhow::bail!("createRecord returned HTTP {}: {}", status.as_u16(), detail); + } + let body: Value = resp_body + .read_json() + .context("Failed to parse createRecord response")?; + + Ok(body + .get("uri") + .and_then(|v| v.as_str()) + .unwrap_or("") + .to_string()) +} + +/// Read a required environment variable, with an actionable error message. +fn require_env(key: &str) -> Result { + std::env::var(key) + .ok() + .filter(|v| !v.trim().is_empty()) + .with_context(|| { + format!( + "the `bluesky` publisher needs the {key} environment variable. \ + Set BLUESKY_HANDLE and BLUESKY_APP_PASSWORD (an app password \ + from Bluesky Settings → App Passwords). Never put credentials \ + in govbot.yml." + ) + }) +} + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + + #[test] + fn truncate_respects_limit() { + let long = "word ".repeat(100); + let out = truncate_post(&long); + assert!(out.chars().count() <= POST_TEXT_LIMIT); + assert!(out.ends_with('…')); + } + + #[test] + fn truncate_leaves_short_text_alone() { + assert_eq!(truncate_post(" hello "), "hello"); + } + + #[test] + fn threshold_selects_on_calibrated_score() { + let entry = json!({ + "tags": { "clean_energy": { "final_score": 0.8 } } + }); + assert!(record_clears_threshold(&entry, &[], 0.6)); + assert!(record_clears_threshold( + &entry, + &["clean_energy".to_string()], + 0.6 + )); + assert!(!record_clears_threshold(&entry, &[], 0.9)); + assert!(!record_clears_threshold( + &entry, + &["fossil_fuels".to_string()], + 0.6 + )); + } + + #[test] + fn threshold_rejects_untagged() { + assert!(!record_clears_threshold(&json!({}), &[], 0.0)); + assert!(!record_clears_threshold(&json!({ "tags": {} }), &[], 0.0)); + } + + #[test] + fn render_substitutes_template_placeholders() { + let entry = json!({ + "id": "wy-legislation/.../HB0001", + "bill": { "title": "Renewable energy storage act", "identifier": "HB 1" }, + "tags": { "clean_energy": { "final_score": 0.92 } } + }); + let post = render_post(&entry, Some("{title} [{identifier}] {tags} {score}")); + assert!(post.text.contains("Renewable energy storage act")); + assert!(post.text.contains("[HB 1]")); + assert!(post.text.contains("clean_energy")); + assert!(post.text.contains("0.92")); + } +} diff --git a/actions/govbot/src/cache.rs b/actions/govbot/src/cache.rs new file mode 100644 index 00000000..54a0f21d --- /dev/null +++ b/actions/govbot/src/cache.rs @@ -0,0 +1,155 @@ +//! The shared, content-addressed dataset cache at `~/.govbot/cache/`. +//! +//! ## The problem this solves +//! +//! Before this, every govbot project cloned every dataset into its own +//! `/.govbot/repos/`. Ten climate projects on one laptop meant ten +//! clones of `wy-legislation`. The cache makes a dataset **cloned once per +//! machine**: the heavy git repo lives in `~/.govbot/cache/`, and a project's +//! `.govbot/repos/` is a lightweight reference into it. +//! +//! ## Layout +//! +//! ```text +//! ~/.govbot/ +//! cache/ +//! / a bare-ish working clone of one dataset@channel +//! registry.json most-recently fetched registry (see registry.rs) +//! ``` +//! +//! The cache **key** is content-addressed over the dataset's *identity* — its +//! git URL plus channel — as a short SHA-256 hex digest, prefixed with the +//! dataset's short name for human readability: +//! +//! ```text +//! wy-legislation-3f9a1c20e5b4 +//! us-counties__cook-7a2b... (a '/' in a namespace becomes '__') +//! ``` +//! +//! Keying on URL+channel (not on a resolved SHA) keeps the clone path stable +//! across `pull`s: the same dataset always maps to the same cache directory, +//! `git pull` updates it in place, and `govbot.lock` records the exact SHA. +//! A *second* `pull` in any project finds the cache populated and only fetches +//! deltas — no re-clone. +//! +//! ## How a project references the cache +//! +//! A project's `.govbot/repos/` is a symlink to the cache entry +//! (a plain directory copy is the fallback where symlinks are unavailable). +//! Downstream code (`source`, `load`) walks `.govbot/repos/` exactly as +//! before — it does not need to know the cache exists. + +use crate::error::{Error, Result}; +use sha2::{Digest, Sha256}; +use std::path::PathBuf; + +/// The govbot home directory: `~/.govbot`. Honors `GOVBOT_HOME` for tests. +pub fn govbot_home() -> Option { + if let Some(explicit) = std::env::var_os("GOVBOT_HOME") { + let p = PathBuf::from(explicit); + if !p.as_os_str().is_empty() { + return Some(p); + } + } + std::env::var_os("HOME") + .or_else(|| std::env::var_os("USERPROFILE")) + .map(PathBuf::from) + .filter(|p| !p.as_os_str().is_empty()) + .map(|h| h.join(".govbot")) +} + +/// The shared content-addressed cache directory: `~/.govbot/cache`. +pub fn cache_dir() -> Result { + let home = govbot_home() + .ok_or_else(|| Error::Config("Could not determine home directory for cache".into()))?; + Ok(home.join("cache")) +} + +/// Compute the content-addressed cache key for a dataset's identity. +/// +/// The key is `-` where the digest is the first 12 hex +/// chars of `sha256(git_url + "@" + channel)`. A `/` in the short name (it +/// should not contain one, but be defensive) becomes `__`. +pub fn cache_key(short_name: &str, git_url: &str, channel: Option<&str>) -> String { + let mut hasher = Sha256::new(); + hasher.update(git_url.as_bytes()); + hasher.update(b"@"); + hasher.update(channel.unwrap_or("").as_bytes()); + let digest = hasher.finalize(); + let hex: String = digest.iter().take(6).map(|b| format!("{:02x}", b)).collect(); + let safe_name = short_name.replace('/', "__"); + format!("{}-{}", safe_name, hex) +} + +/// The absolute path of a dataset's entry in the shared cache. +pub fn cache_path(short_name: &str, git_url: &str, channel: Option<&str>) -> Result { + Ok(cache_dir()?.join(cache_key(short_name, git_url, channel))) +} + +/// Link a project's `repos/` directory to a populated cache entry. +/// +/// Prefers a symlink (cheap, shared); falls back to recording the cache path +/// when symlinks are unavailable. Idempotent — an existing correct link is a +/// no-op; a stale link is replaced. +pub fn link_into_project(cache_entry: &std::path::Path, project_repo: &std::path::Path) -> Result<()> { + if let Some(parent) = project_repo.parent() { + std::fs::create_dir_all(parent)?; + } + + // If the project repo path is already a symlink to the right place, done. + if let Ok(existing) = std::fs::read_link(project_repo) { + if existing == cache_entry { + return Ok(()); + } + // Stale symlink — remove it. + let _ = std::fs::remove_file(project_repo); + } else if project_repo.exists() { + // A real directory is sitting where the link should be (a pre-cache + // clone). Remove it so the cache becomes the single source of truth. + let _ = std::fs::remove_dir_all(project_repo); + } + + #[cfg(unix)] + { + std::os::unix::fs::symlink(cache_entry, project_repo).map_err(|e| { + Error::Config(format!( + "Failed to link cache entry {} into project: {}", + cache_entry.display(), + e + )) + })?; + return Ok(()); + } + + #[cfg(not(unix))] + { + std::os::windows::fs::symlink_dir(cache_entry, project_repo).map_err(|e| { + Error::Config(format!( + "Failed to link cache entry {} into project: {}", + cache_entry.display(), + e + )) + })?; + Ok(()) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn cache_key_is_stable_and_named() { + let k1 = cache_key("wy-legislation", "https://example.com/wy.git", None); + let k2 = cache_key("wy-legislation", "https://example.com/wy.git", None); + assert_eq!(k1, k2, "cache key must be deterministic"); + assert!(k1.starts_with("wy-legislation-")); + } + + #[test] + fn cache_key_differs_by_url_and_channel() { + let base = cache_key("wy", "https://a/wy.git", None); + assert_ne!(base, cache_key("wy", "https://b/wy.git", None)); + assert_ne!(base, cache_key("wy", "https://a/wy.git", Some("nightly"))); + } +} diff --git a/actions/govbot/src/config.rs b/actions/govbot/src/config.rs index dfbf0d4e..ace5c09c 100644 --- a/actions/govbot/src/config.rs +++ b/actions/govbot/src/config.rs @@ -1,5 +1,184 @@ use crate::error::{Error, Result}; -use std::path::PathBuf; +use serde::Deserialize; +use std::collections::BTreeMap; +use std::path::{Path, PathBuf}; + +// ============================================================ +// govbot.yml — the project manifest (datasets / transforms / +// publish / pipelines). This is the typed view of the schema in +// `schemas/govbot.schema.json`. It is the layer-2 contract config +// and is distinct from the pipeline-processor `Config` below +// (whose `repos` is CLI-arg state, not manifest state). +// ============================================================ + +/// A `govbot.yml` manifest. `additionalProperties: false` in the schema — +/// an unknown top-level key (notably the retired `tags:`) fails to parse. +#[derive(Debug, Clone, Deserialize)] +#[serde(deny_unknown_fields)] +pub struct Manifest { + /// Optional `$schema` reference for editor autocomplete; ignored at runtime. + #[serde(default, rename = "$schema")] + pub schema: Option, + + /// Government-data sources the project pulls and processes. + pub datasets: Vec, + + /// Named external-process transforms, keyed by name. + #[serde(default)] + pub transforms: BTreeMap, + + /// Named publishers, keyed by name. + #[serde(default)] + pub publish: BTreeMap, + + /// Named `govbot run` targets — ordered lists of transform/publisher names. + #[serde(default)] + pub pipelines: BTreeMap>, +} + +/// A single external-process transform stage. +#[derive(Debug, Clone, Deserialize)] +pub struct Transform { + /// The external process to run. Either a shell-style string or an argv array. + pub command: Command_, + + /// The stream record kind this transform consumes (e.g. `docs`). + pub reads: String, + + /// The stream record kind this transform produces (e.g. `classification`). + pub writes: String, + + /// For a classify-style transform: the path to the fastclass classifier + /// bundle directory. govbot passes this path through unchanged. + #[serde(default)] + pub classifier: Option, +} + +/// A transform `command`: either a single shell-style string or an argv array. +#[derive(Debug, Clone, Deserialize)] +#[serde(untagged)] +pub enum Command_ { + /// A single string, split on whitespace into argv. + Shell(String), + /// An explicit argv array (first element is the executable). + Argv(Vec), +} + +impl Command_ { + /// Resolve to an argv vector. A `Shell` string is whitespace-split. + pub fn argv(&self) -> Vec { + match self { + Command_::Shell(s) => s.split_whitespace().map(|s| s.to_string()).collect(), + Command_::Argv(v) => v.clone(), + } + } +} + +/// The publisher kind. Mirrors `govbot.schema.json`'s `publisher.type` enum. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Deserialize)] +#[serde(rename_all = "lowercase")] +pub enum PublisherKind { + Rss, + Html, + Json, + Duckdb, + /// Bluesky publisher — the extension point for Wave 3 (not yet implemented). + Bluesky, +} + +/// A single publisher stage. Required fields depend on `type`. +#[derive(Debug, Clone, Deserialize)] +pub struct Publisher { + /// The publisher kind (`rss` / `html` / `json` / `duckdb` / `bluesky`). + #[serde(rename = "type")] + pub kind: PublisherKind, + + /// Tag names to include. Only records carrying one of these tags are + /// published; if omitted, all tagged records are published. + #[serde(default)] + pub select: Option>, + + /// Base URL for generated links (required for `rss`/`html`). + #[serde(default)] + pub base_url: Option, + + /// Directory the publisher writes artifacts into (used by rss/html/json). + #[serde(default)] + pub output_dir: Option, + + /// Output filename for the primary artifact. + #[serde(default)] + pub output_file: Option, + + /// Custom feed/index title. + #[serde(default)] + pub title: Option, + + /// Custom feed/index description. + #[serde(default)] + pub description: Option, + + /// Maximum number of entries. The string `"none"` means no limit. + #[serde(default)] + pub limit: Option, + + // ---- bluesky-publisher fields ---------------------------------------- + // These configure the `bluesky` publisher only; other kinds ignore them. + // Credentials are NOT here — they are read from the environment + // (`BLUESKY_HANDLE` / `BLUESKY_APP_PASSWORD` / `BLUESKY_SERVICE`). + /// Minimum calibrated `final_score` a matched tag must reach for a record + /// to be posted. `final_score` is the contractually calibrated probability + /// from the fastclass result (STREAM_PROTOCOL §5). + #[serde(default)] + pub min_score: Option, + + /// Path to the append-only posted-state ledger that makes the publisher + /// idempotent — re-runs never double-post. Relative to the project + /// directory; defaults to `.govbot/bluesky-.ledger`. + #[serde(default)] + pub ledger: Option, + + /// Post-text template. `{placeholders}` are substituted per record: + /// `{title}`, `{tags}`, `{link}`, `{identifier}`, `{session}`, `{score}`. + /// If omitted, a sensible default template is used. + #[serde(default)] + pub post_template: Option, +} + +impl Publisher { + /// Resolve the calibrated-score threshold for the `bluesky` publisher. + /// Falls back to a conservative default so a misconfigured manifest does + /// not flood a feed with low-confidence matches. + pub fn resolved_min_score(&self) -> f64 { + self.min_score.unwrap_or(0.6) + } + + /// Resolve `limit` to an `Option`: `None` means unlimited, the + /// string `"none"` also means unlimited, an integer is the cap. + pub fn resolved_limit(&self, default: Option) -> Option { + match &self.limit { + None => default, + Some(serde_yaml::Value::String(s)) if s.eq_ignore_ascii_case("none") => None, + Some(serde_yaml::Value::String(s)) => s.parse().ok().or(default), + Some(serde_yaml::Value::Number(n)) => n.as_u64().map(|n| n as usize).or(default), + Some(_) => default, + } + } +} + +impl Manifest { + /// Load and parse a `govbot.yml` manifest. A manifest carrying the retired + /// `tags:` block (or any other unknown key) fails here via + /// `deny_unknown_fields`. + pub fn load(path: &Path) -> anyhow::Result { + use anyhow::Context; + let contents = std::fs::read_to_string(path) + .with_context(|| format!("Failed to read manifest: {}", path.display()))?; + let manifest: Manifest = serde_yaml::from_str(&contents) + .with_context(|| format!("Failed to parse govbot.yml manifest: {}", path.display()))?; + Ok(manifest) + } +} /// Sort order for log entries #[derive(Debug, Clone, Copy, PartialEq, Eq)] diff --git a/actions/govbot/src/embeddings.rs b/actions/govbot/src/embeddings.rs index feb68ab2..619194b9 100644 --- a/actions/govbot/src/embeddings.rs +++ b/actions/govbot/src/embeddings.rs @@ -1,30 +1,30 @@ -use ort::inputs; -use ort::session::Session; -use ort::value::Value; -use std::collections::HashMap; -use std::path::Path; -use tokenizers::Tokenizer; +//! Per-bill tag-file persistence types. +//! +//! govbot no longer classifies bills itself — classification is delegated to +//! `fastclass` (an external transform). What remains here is the on-disk +//! `.tag.json` format: `govbot apply` deserializes a `fastclass classify` +//! result and writes these structs into the dataset; `govbot publish` reads +//! them back. The ONNX embedding machinery that used to live in this module +//! has been removed. -use ndarray::Array1; -use regex::Regex; use serde::{Deserialize, Serialize}; use sha2::{Digest, Sha256}; +use std::collections::HashMap; -use crate::selectors::ocd_files_select_default; - -/// Breakdown of scoring components for a tag match +/// Breakdown of scoring components for a tag match. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ScoreBreakdown { pub final_score: f64, pub base_embedding: Option, pub example_similarity: Option, - /// Keywords from include_keywords that matched in the text + /// Keywords from include_keywords that matched in the text. #[serde(default)] pub keyword_match: Vec, pub negative_penalty: f64, } -/// Tag file structure with metadata, text cache, and bill results +/// A per-tag `.tag.json` file: metadata, an optional text cache, and the +/// bills that matched the tag. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct TagFile { pub metadata: TagFileMetadata, @@ -34,7 +34,7 @@ pub struct TagFile { pub bills: HashMap, } -/// Metadata about the tag file +/// Metadata about a tag file. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct TagFileMetadata { pub last_run: String, @@ -42,21 +42,23 @@ pub struct TagFileMetadata { pub tag_config_hash: String, } -/// Result for a single bill +/// Result for a single bill within a tag file. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct BillTagResult { pub text_hash: String, pub score: ScoreBreakdown, } -/// Hash text for deduplication +/// Hash text (SHA-256 hex) for deduplication / `tag_config` stamping. pub fn hash_text(text: &str) -> String { let mut hasher = Sha256::new(); hasher.update(text.as_bytes()); format!("{:x}", hasher.finalize()) } -/// Tag definition provided by the creator +/// A stub tag definition stamped into each tag file. The real taxonomy lives +/// in the fastclass classifier bundle, not here — `govbot apply` only records +/// the tag name. #[derive(Debug, Deserialize, Serialize, Clone)] pub struct TagDefinition { pub name: String, @@ -70,7 +72,7 @@ pub struct TagDefinition { pub exclude_keywords: Vec, #[serde(default)] pub negative_examples: Vec, - /// Minimum similarity score (0.0 - 1.0). Default to 0.5 if not provided. + /// Minimum similarity score (0.0 - 1.0). Defaults to 0.5. #[serde(default = "default_threshold")] pub threshold: f32, } @@ -78,419 +80,3 @@ pub struct TagDefinition { fn default_threshold() -> f32 { 0.5 } - -#[derive(Debug, Deserialize)] -pub struct RawTag { - #[serde(default)] - pub description: String, - #[serde(default)] - pub examples: Vec, - #[serde(default)] - pub include_keywords: Vec, - #[serde(default)] - pub exclude_keywords: Vec, - #[serde(default)] - pub negative_examples: Vec, - #[serde(default = "default_threshold")] - pub threshold: f32, -} - -#[derive(Debug, Deserialize)] -pub struct RawTagConfig { - pub tags: std::collections::HashMap, -} - -pub fn load_tags_config>(path: P) -> anyhow::Result> { - let contents = std::fs::read_to_string(path)?; - let raw: RawTagConfig = serde_yaml::from_str(&contents) - .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; - - let mut tags = Vec::new(); - for (name, raw_tag) in raw.tags { - tags.push(TagDefinition { - name, - description: raw_tag.description, - examples: raw_tag.examples, - include_keywords: raw_tag.include_keywords, - exclude_keywords: raw_tag.exclude_keywords, - negative_examples: raw_tag.negative_examples, - threshold: raw_tag.threshold, - }); - } - Ok(tags) -} - -/// Lightweight embedding service powered by ONNX Runtime -pub struct EmbeddingService { - session: Session, - tokenizer: Tokenizer, -} - -impl EmbeddingService { - pub fn new>(model_path: P, tokenizer_path: P) -> anyhow::Result { - let tokenizer = Tokenizer::from_file(tokenizer_path.as_ref()) - .map_err(|e| anyhow::anyhow!("Failed to load tokenizer: {}", e))?; - - let session = Session::builder()?.commit_from_file(model_path)?; - - Ok(Self { session, tokenizer }) - } - - /// Embed text using the configured model with mean pooling over last hidden state - pub fn embed(&mut self, text: &str) -> anyhow::Result> { - // Tokenize - let encoding = self - .tokenizer - .encode(text, true) - .map_err(|e| anyhow::anyhow!("Tokenizer encode failed: {}", e))?; - - let ids = encoding.get_ids(); - let mask = encoding.get_attention_mask(); - let type_ids = encoding.get_type_ids(); - - let input_ids: Vec = ids.iter().map(|&x| x as i64).collect(); - let attention_mask_vec: Vec = mask.iter().map(|&x| x as i64).collect(); - let token_type_vec: Vec = type_ids.iter().map(|&x| x as i64).collect(); - - let outputs = self.session.run(inputs![ - "input_ids" => Value::from_array((vec![1_i64, ids.len() as i64], input_ids))?, - "attention_mask" => Value::from_array((vec![1_i64, mask.len() as i64], attention_mask_vec))?, - "token_type_ids" => Value::from_array((vec![1_i64, type_ids.len() as i64], token_type_vec))?, - ])?; - - // Use last_hidden_state and mean-pool - let hidden = outputs["last_hidden_state"].try_extract_array::()?; - - // hidden shape: [batch, seq_len, hidden_dim] - let shape = hidden.shape(); - if shape.len() != 3 { - return Err(anyhow::anyhow!("Unexpected embedding shape {:?}", shape)); - } - let seq_len = shape[1]; - let hidden_dim = shape[2]; - - let mut pooled = vec![0f32; hidden_dim]; - for i in 0..seq_len { - for h in 0..hidden_dim { - pooled[h] += hidden[[0, i, h]]; - } - } - for h in 0..hidden_dim { - pooled[h] /= seq_len as f32; - } - let pooled = Array1::from(pooled); - - Ok(pooled) - } - - pub fn cosine_similarity(&self, a: &Array1, b: &Array1) -> f32 { - let dot = a.dot(b); - let norm_a = a.dot(a).sqrt(); - let norm_b = b.dot(b).sqrt(); - dot / (norm_a * norm_b).max(1e-9) - } -} - -/// Return all keywords from the list that appear in the text -/// (case-insensitive, word-boundary aware). -fn find_matching_keywords(text: &str, keywords: &[String]) -> Vec { - let text_lower = text.to_lowercase(); - let mut matches = Vec::new(); - - for keyword in keywords { - let keyword_lower = keyword.to_lowercase(); - // Check for exact word match or phrase match - // For multi-word keywords, use contains - // For single-word keywords, check word boundaries - let is_match = if keyword_lower.contains(' ') { - // Multi-word phrase: use contains - text_lower.contains(&keyword_lower) - } else { - // Single word: check word boundaries to avoid partial matches - // e.g., "trans" should not match "transport" or "transfer" - // But "lgbtq" should match "lgbtq+" (with punctuation) - let escaped = regex::escape(&keyword_lower); - let pattern = format!(r"\b{}(?:\+|\b)", escaped); - Regex::new(&pattern) - .map(|re| re.is_match(&text_lower)) - .unwrap_or_else(|_| text_lower.contains(&keyword_lower)) - }; - - if is_match { - matches.push(keyword.clone()); - } - } - - matches -} - -/// Matcher that precomputes tag embeddings and scores logs against them -pub struct TagMatcher { - embeddings: std::sync::Mutex, - tag_embeddings: HashMap>, - example_embeddings: HashMap>>, - negative_example_embeddings: HashMap>>, - tags: HashMap, -} - -impl TagMatcher { - pub fn from_files>( - model_path: P, - tokenizer_path: P, - tags_path: P, - ) -> anyhow::Result { - let mut embeddings = EmbeddingService::new(&model_path, &tokenizer_path)?; - - // Load tags YAML - let tag_defs = load_tags_config(tags_path)?; - - // Precompute tag embeddings - let mut tag_embeddings = HashMap::new(); - let mut example_embeddings = HashMap::new(); - let mut negative_example_embeddings = HashMap::new(); - let mut tags_map = HashMap::new(); - - for tag in tag_defs { - // Combine description + examples for richer embedding - let mut text = tag.description.clone(); - if !tag.examples.is_empty() { - text.push_str(" Examples: "); - text.push_str(&tag.examples.join(" | ")); - } - let emb = embeddings.embed(&text)?; - tag_embeddings.insert(tag.name.clone(), emb); - - // Precompute embeddings for individual examples - let mut example_embs = Vec::new(); - for example in &tag.examples { - let example_emb = embeddings.embed(example)?; - example_embs.push(example_emb); - } - example_embeddings.insert(tag.name.clone(), example_embs); - - // Precompute embeddings for negative examples - let mut neg_example_embs = Vec::new(); - for neg_example in &tag.negative_examples { - let neg_emb = embeddings.embed(neg_example)?; - neg_example_embs.push(neg_emb); - } - negative_example_embeddings.insert(tag.name.clone(), neg_example_embs); - - tags_map.insert(tag.name.clone(), tag); - } - - Ok(Self { - embeddings: std::sync::Mutex::new(embeddings), - tag_embeddings, - example_embeddings, - negative_example_embeddings, - tags: tags_map, - }) - } - - /// Calculate composite score using multiple signals - fn calculate_composite_score( - &self, - log_embedding: &Array1, - log_text: &str, - tag_name: &str, - tag_def: &TagDefinition, - embeddings: &mut EmbeddingService, - ) -> ScoreBreakdown { - // 4. Exclude keywords: zero out if exclude keywords match (check first). - // We don't currently expose which exclude keyword matched; we just block the tag. - if !tag_def.exclude_keywords.is_empty() { - let exclude_matches = find_matching_keywords(log_text, &tag_def.exclude_keywords); - if !exclude_matches.is_empty() { - return ScoreBreakdown { - final_score: 0.0, - base_embedding: None, - example_similarity: None, - keyword_match: Vec::new(), - negative_penalty: 0.0, - }; - } - } - - // 3. Include keywords: if keywords match, they have the heaviest impact - let include_matches = if tag_def.include_keywords.is_empty() { - Vec::new() - } else { - find_matching_keywords(log_text, &tag_def.include_keywords) - }; - let has_keyword_match = !include_matches.is_empty(); - - let mut score = 0.0; - let mut weight_sum = 0.0; - let mut base_embedding_score: Option = None; - let mut example_similarity_score: Option = None; - - // 1. Base score: embedding similarity to description + examples - // Industry standard: embeddings are the primary signal - if let Some(tag_emb) = self.tag_embeddings.get(tag_name) { - let base_score = embeddings.cosine_similarity(log_embedding, tag_emb); - base_embedding_score = Some(base_score); - // Weight embeddings less when keywords match (keywords will add boost) - let weight = if has_keyword_match { 0.35 } else { 0.5 }; - score += base_score * weight; - weight_sum += weight; - } - - // 2. Example similarity: max similarity to individual examples - if let Some(example_embs) = self.example_embeddings.get(tag_name) { - if !example_embs.is_empty() { - let max_example_score = example_embs - .iter() - .map(|example_emb| embeddings.cosine_similarity(log_embedding, example_emb)) - .fold(0.0f32, f32::max); - example_similarity_score = Some(max_example_score); - let weight = if has_keyword_match { 0.25 } else { 0.35 }; - score += max_example_score * weight; - weight_sum += weight; - } - } - - // 3. Keyword boost: additive boost when keywords match - // Keywords are explicit signals and should have strong weight - // This ensures keyword matches are strong but still respect embedding quality - if has_keyword_match { - // Strong boost for keywords - they are explicit signals from the tag definition - // Higher than typical industry systems because keywords are curated and highly reliable - let keyword_boost = 0.4; - score += keyword_boost; - weight_sum += keyword_boost; - } - - // Normalize the weighted combination - if weight_sum > 0.0 { - score = score / weight_sum; - } - - // If keywords matched, ensure minimum score meets threshold (before negative penalty) - // Keywords are explicit signals, so they should guarantee threshold unless negated - if has_keyword_match { - score = score.max(tag_def.threshold); - } - - // 5. Negative examples: penalty if too similar to negative examples - let mut negative_penalty = 0.0f32; - if let Some(neg_example_embs) = self.negative_example_embeddings.get(tag_name) { - if !neg_example_embs.is_empty() { - let max_neg_score = neg_example_embs - .iter() - .map(|neg_emb| embeddings.cosine_similarity(log_embedding, neg_emb)) - .fold(0.0f32, f32::max); - // Apply penalty: subtract up to 0.25 based on negative similarity - // Higher negative similarity = stronger penalty - negative_penalty = max_neg_score * 0.25; - score = (score - negative_penalty).max(0.0); - } - } - - // Clamp to [0, 1] - let final_score = score.min(1.0).max(0.0); - - ScoreBreakdown { - final_score: final_score as f64, - base_embedding: base_embedding_score.map(|s| s as f64), - example_similarity: example_similarity_score.map(|s| s as f64), - keyword_match: include_matches, - negative_penalty: negative_penalty as f64, - } - } - - /// Match a serde_json::Value log entry against tags, returning (tag, score_breakdown) - pub fn match_json_value( - &self, - value: &serde_json::Value, - ) -> anyhow::Result> { - let text = ocd_files_select_default(value); - let mut embeddings = self.embeddings.lock().unwrap(); - let log_embedding = embeddings.embed(&text)?; - - let mut results = Vec::new(); - for (name, tag_def) in &self.tags { - let score_breakdown = self.calculate_composite_score( - &log_embedding, - &text, - name, - tag_def, - &mut *embeddings, - ); - if score_breakdown.final_score >= tag_def.threshold as f64 { - results.push((name.clone(), score_breakdown)); - } - } - - // Sort descending by final score - results.sort_by(|a, b| { - b.1.final_score - .partial_cmp(&a.1.final_score) - .unwrap_or(std::cmp::Ordering::Equal) - }); - Ok(results) - } - - /// Access tag definitions (name -> definition) - pub fn tag_definitions(&self) -> &HashMap { - &self.tags - } -} - -/// Keyword-based fallback matcher when embedding mode is unavailable -/// Matches tags based on include_keywords and exclude_keywords from tag definitions -pub fn match_tags_keywords( - tag_defs: &[TagDefinition], - json_entry: &serde_json::Value, -) -> Vec<(String, ScoreBreakdown)> { - let text = ocd_files_select_default(json_entry); - let text_lower = text.to_lowercase(); - - let mut results = Vec::new(); - - for tag_def in tag_defs { - // Check exclude_keywords first - if any match, skip this tag - if !tag_def.exclude_keywords.is_empty() { - let exclude_matches = find_matching_keywords(&text_lower, &tag_def.exclude_keywords); - if !exclude_matches.is_empty() { - continue; - } - } - - // Check include_keywords - if any match, create a match - let include_matches = if tag_def.include_keywords.is_empty() { - Vec::new() - } else { - find_matching_keywords(&text_lower, &tag_def.include_keywords) - }; - - if !include_matches.is_empty() { - // If keywords match, assign a score based on threshold - // Use threshold as the base score, or 0.6 if threshold is lower - let score = tag_def.threshold.max(0.6) as f64; - - // Only include if score meets threshold - if score >= tag_def.threshold as f64 { - results.push(( - tag_def.name.clone(), - ScoreBreakdown { - final_score: score, - base_embedding: None, - example_similarity: None, - keyword_match: include_matches, - negative_penalty: 0.0, - }, - )); - } - } - } - - // Sort by score descending - results.sort_by(|a, b| { - b.1.final_score - .partial_cmp(&a.1.final_score) - .unwrap_or(std::cmp::Ordering::Equal) - }); - - results -} diff --git a/actions/govbot/src/git.rs b/actions/govbot/src/git.rs index de6c768d..86ad63d6 100644 --- a/actions/govbot/src/git.rs +++ b/actions/govbot/src/git.rs @@ -1,94 +1,32 @@ use crate::error::{Error, Result}; +use crate::registry::ResolvedDataset; use git2::{build::RepoBuilder, FetchOptions, RemoteCallbacks, Repository}; use std::fs; use std::path::{Path, PathBuf}; -// Repository URL template - fully configurable for any git hosting service +// ============================================================ +// Dataset git operations. // -// This template uses {locale} as a placeholder that will be replaced with the actual locale. -// You can configure it via the GOVBOT_REPO_URL_TEMPLATE environment variable. -// -// Examples: -// - GitHub: https://github.com/org/{locale}-suffix.git -// - GitLab: https://gitlab.com/org/{locale}-suffix.git -// - Bitbucket: https://bitbucket.org/org/{locale}-suffix.git -// - Self-hosted GitLab: https://git.example.com/group/{locale}-repo.git -// - Self-hosted Gitea: https://gitea.example.com/org/{locale}-data.git -// -// To use a custom URL template, set the environment variable: -// export GOVBOT_REPO_URL_TEMPLATE="https://gitlab.com/myorg/{locale}-data.git" -const DEFAULT_REPO_URL_TEMPLATE: &str = - "https://github.com/chn-openstates-files/{locale}-legislation.git"; - -/// Get the repository URL template from environment or use default -fn get_repo_url_template() -> String { - std::env::var("GOVBOT_REPO_URL_TEMPLATE") - .unwrap_or_else(|_| DEFAULT_REPO_URL_TEMPLATE.to_string()) -} - -/// Build the clone URL for a repository -pub fn build_clone_url(locale: &str) -> String { - let template = get_repo_url_template(); - template.replace("{locale}", locale) +// Datasets are git repos. Their URLs are NOT derived from a compiled locale +// enum or a `{locale}` URL template anymore — they are looked up at runtime in +// the dataset *registry* (`registry.rs`). A dataset is cloned ONCE per machine +// into the shared content-addressed cache (`cache.rs`); a project's +// `.govbot/repos/` is a symlink into that cache. +// ============================================================ + +/// The local directory name a dataset's clone is stored under, within a +/// project's `repos/` directory. This is the dataset's short (slash-free) +/// name plus the legacy `-legislation` data-repo suffix, so existing on-disk +/// layouts and downstream walkers (`source`, `load`) are unchanged. +/// +/// `wy` → `wy-legislation`. The suffix is overridable for tests/mocks via +/// `GOVBOT_REPO_SUFFIX` (the mock data uses `-data-pipeline`). +pub fn repo_dir_name(short_name: &str) -> String { + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + format!("{}{}", short_name, suffix) } -/// Extract repository name from URL template -/// For example: "https://github.com/org/{locale}-suffix.git" -> "{locale}-suffix" -fn extract_repo_name_pattern(template: &str) -> String { - // Extract the part after the last / and before .git - if let Some(start) = template.rfind('/') { - let after_slash = &template[start + 1..]; - if let Some(end) = after_slash.rfind(".git") { - after_slash[..end].to_string() - } else { - // No .git extension, just take after last / - after_slash.to_string() - } - } else { - // Fallback: return template as-is (might be just the pattern) - template.to_string() - } -} - -/// Extract organization/group from URL template -/// For example: "https://github.com/org/{locale}-suffix.git" -> "org" -fn extract_repo_org(template: &str) -> String { - // Extract the part between domain and repository name - // Format: https://domain.com/org/{locale}-suffix.git - if let Some(protocol_pos) = template.find("://") { - let after_protocol = &template[protocol_pos + 3..]; - if let Some(domain_end) = after_protocol.find('/') { - let after_domain = &after_protocol[domain_end + 1..]; - // Find the next / which should be before the repo name - if let Some(org_end) = after_domain.find('/') { - return after_domain[..org_end].to_string(); - } - // If no second /, the whole thing might be the org (unlikely but handle it) - if let Some(repo_start) = after_domain.find('{') { - return after_domain[..repo_start].trim_end_matches('/').to_string(); - } - } - } - // Fallback: return default org - "chn-openstates-files".to_string() -} - -/// Build the repository name (used for local directory names) -pub fn build_repo_name(locale: &str) -> String { - let template = get_repo_url_template(); - let pattern = extract_repo_name_pattern(&template); - pattern.replace("{locale}", locale) -} - -/// Build the repository path (org/repo-name format, used for display) -pub fn build_repo_path(locale: &str) -> String { - let template = get_repo_url_template(); - let org = extract_repo_org(&template); - let repo_name = build_repo_name(locale); - format!("{}/{}", org, repo_name) -} - -/// Get the default repos directory: $CWD/.govbot/repos +/// Get the default repos directory: `$CWD/.govbot/repos`. pub fn default_repos_dir() -> Result { let cwd = std::env::current_dir() .map_err(|_| Error::Config("Could not determine current working directory.".to_string()))?; @@ -96,6 +34,17 @@ pub fn default_repos_dir() -> Result { Ok(cwd.join(".govbot").join("repos")) } +/// The outcome of a clone/pull, plus the commit it landed on. +#[derive(Debug, Clone)] +pub struct PullOutcome { + /// `"clone"`, `"pulled"`, `"no_updates"`, or `"recloned"`. + pub action: &'static str, + /// The commit SHA the dataset is now checked out at. + pub commit: String, + /// The shared-cache key the dataset's clone lives under. + pub cache_key: String, +} + /// Build callbacks for git operations with optional token authentication fn build_callbacks(token: Option<&str>, show_progress: bool) -> RemoteCallbacks<'_> { let mut callbacks = RemoteCallbacks::new(); @@ -144,202 +93,205 @@ fn build_callbacks(token: Option<&str>, show_progress: bool) -> RemoteCallbacks< callbacks } -/// Clone or pull a repository for a given locale with quiet option -/// Returns action: "clone", "pulled", or "no_updates" -pub fn clone_or_pull_repo_quiet( - locale: &str, +/// Read the commit SHA `HEAD` currently resolves to in an open repository. +fn head_commit(repo: &Repository) -> Result { + let head = repo + .head() + .map_err(|e| Error::Config(format!("Failed to read HEAD: {}", e)))?; + let oid = head + .target() + .ok_or_else(|| Error::Config("HEAD has no commit target".to_string()))?; + Ok(oid.to_string()) +} + +/// Clone-or-pull a dataset into the shared content-addressed cache, then link +/// the cache entry into the project's `repos/` directory. +/// +/// This is the registry-driven replacement for the old locale-keyed +/// `clone_or_pull_repo_quiet`. It: +/// 1. resolves the dataset's cache key (URL + channel), +/// 2. clones into `~/.govbot/cache/` once, or `git pull`s it if present, +/// 3. symlinks `/` to that cache entry, +/// 4. returns the action taken plus the resolved commit SHA (for the lock). +/// +/// A second `pull` of the same dataset — in this or any other project — finds +/// the cache populated and only fetches deltas. +pub fn clone_or_pull_dataset( + dataset: &ResolvedDataset, repos_dir: &Path, token: Option<&str>, quiet: bool, -) -> Result<&'static str> { - let clone_url = build_clone_url(locale); - let repo_name = build_repo_name(locale); - let repo_path = build_repo_path(locale); - let target_dir = repos_dir.join(&repo_name); - let mut is_reclone = false; +) -> Result { + let short = dataset.short_name(); + let git_url = &dataset.entry.git_url; + let channel = dataset.channel.as_deref(); - // Check if repository already exists - if target_dir.exists() && Repository::open(&target_dir).is_ok() { - // Repository exists, pull instead - let repo = Repository::open(&target_dir) - .map_err(|e| Error::Config(format!("Failed to open repository: {}", e)))?; + let cache_entry = crate::cache::cache_path(short, git_url, channel)?; + let cache_key = crate::cache::cache_key(short, git_url, channel); + + let mut is_reclone = false; - // Pull the latest changes (credentials will be used if token is provided) + let outcome_action: &'static str = if cache_entry.exists() + && Repository::open(&cache_entry).is_ok() + { + // Cached already — pull deltas. + let repo = Repository::open(&cache_entry) + .map_err(|e| Error::Config(format!("Failed to open cached repository: {}", e)))?; match pull_repo_internal(&repo, token, quiet) { Ok(had_updates) => { - // Explicitly drop the repository to ensure all file handles are closed drop(repo); - - // Give the file system a moment to release all locks std::thread::sleep(std::time::Duration::from_millis(50)); - - return Ok(if had_updates { "pulled" } else { "no_updates" }); + if had_updates { + "pulled" + } else { + "no_updates" + } } Err(e) => { - // Check if this is a merge analysis error let error_msg = e.to_string(); if error_msg.contains("Failed to analyze merge") || error_msg.contains("object not found") { - // Close the repository first drop(repo); - - // Delete the corrupted repository and reclone if !quiet { - eprintln!( - "Merge analysis failed, deleting and recloning {}...", - repo_name - ); + eprintln!("Merge analysis failed, deleting and recloning {}...", short); } - - // Delete the repository - delete_repo(locale, repos_dir)?; - - // Mark that we're doing a reclone + remove_dir_all_robust(&cache_entry).map_err(|e| { + Error::Config(format!("Failed to clear corrupt cache entry: {}", e)) + })?; is_reclone = true; - - // Now fall through to clone it fresh + // fall through to clone + "" } else { - // For other errors, close repo and return the error drop(repo); return Err(e); } } } - } + } else { + "" + }; - // Remove existing directory if it exists (but is not a git repo) - if target_dir.exists() { - if !quiet { - eprintln!("Removing existing directory: {}", target_dir.display()); - } - std::fs::remove_dir_all(&target_dir)?; + // If the cache entry is populated and we already pulled, we are done with + // the heavy step — just link and report. + if !outcome_action.is_empty() { + link_dataset(&cache_entry, repos_dir, short)?; + let repo = Repository::open(&cache_entry) + .map_err(|e| Error::Config(format!("Failed to reopen cached repository: {}", e)))?; + let commit = head_commit(&repo)?; + return Ok(PullOutcome { + action: outcome_action, + commit, + cache_key, + }); } - // Repository doesn't exist, clone it + // Clone into the cache. + if cache_entry.exists() { + // A non-repo directory is squatting the cache slot — clear it. + let _ = std::fs::remove_dir_all(&cache_entry); + } + if let Some(parent) = cache_entry.parent() { + std::fs::create_dir_all(parent)?; + } let mut fetch_options = FetchOptions::new(); - // Use a reasonable depth (50 commits) instead of depth=1 - // This provides enough history for merge analysis while still being faster than full clone - // 50 commits is typically enough for several weeks/months of history + // A 50-commit depth: enough history for merge analysis, faster than a + // full clone. fetch_options.depth(50); fetch_options.remote_callbacks(build_callbacks(token, !quiet)); let mut builder = RepoBuilder::new(); builder.fetch_options(fetch_options); + if let Some(channel) = channel { + builder.branch(channel); + } - builder.clone(&clone_url, &target_dir).map_err(|e| { - Error::Config(format!( - "Failed to shallow clone repository {}: {}", - repo_path, e - )) + builder.clone(git_url, &cache_entry).map_err(|e| { + Error::Config(format!("Failed to clone dataset {}: {}", dataset.id, e)) })?; - // After cloning, check if we need to set HEAD to main or master - let repo = Repository::open(&target_dir) + let repo = Repository::open(&cache_entry) .map_err(|e| Error::Config(format!("Failed to open cloned repository: {}", e)))?; - // Try to find the default branch (main or master) - // Check local branches first - let default_branch = if repo.find_branch("main", git2::BranchType::Local).is_ok() { - "main" - } else if repo.find_branch("master", git2::BranchType::Local).is_ok() { - "master" - } else { - // Check remote branches - if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - // Create local main branch from remote - let remote_branch = repo.find_branch("origin/main", git2::BranchType::Remote)?; - let commit = remote_branch.get().target().ok_or_else(|| { - Error::Config("Failed to get commit from origin/main".to_string()) - })?; - let commit_obj = repo.find_commit(commit)?; - repo.branch("main", &commit_obj, false)?; - "main" - } else if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - // Create local master branch from remote - let remote_branch = repo.find_branch("origin/master", git2::BranchType::Remote)?; - let commit = remote_branch.get().target().ok_or_else(|| { - Error::Config("Failed to get commit from origin/master".to_string()) - })?; - let commit_obj = repo.find_commit(commit)?; - repo.branch("master", &commit_obj, false)?; - "master" - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in repository".to_string(), - )); - } - }; - - // Set HEAD to the default branch if it's not already set correctly - if let Ok(head) = repo.head() { - if let Some(head_name) = head.name() { - if head_name != format!("refs/heads/{}", default_branch) { - // HEAD points to a different branch, update it - repo.set_head(&format!("refs/heads/{}", default_branch)) - .map_err(|e| { - Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)) - })?; - repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) - .map_err(|e| { - Error::Config(format!("Failed to checkout {}: {}", default_branch, e)) - })?; - } - } - } else { - // HEAD doesn't exist, set it to the default branch - repo.set_head(&format!("refs/heads/{}", default_branch)) - .map_err(|e| { - Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)) - })?; - repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) - .map_err(|e| Error::Config(format!("Failed to checkout {}: {}", default_branch, e)))?; + // Resolve to a sensible default branch (main/master) if no channel given. + if channel.is_none() { + ensure_default_branch(&repo)?; } - // Explicitly drop the repository to ensure all file handles are closed - // This is important on macOS where file handles can prevent deletion + let commit = head_commit(&repo)?; drop(repo); - - // Give the file system a moment to release all locks - // This helps on macOS where file handles might not be released immediately std::thread::sleep(std::time::Duration::from_millis(50)); - // Clear any progress line if !quiet { eprint!( "\r \r" ); } - // Return "recloned" if we deleted and recloned, otherwise "clone" - Ok(if is_reclone { "recloned" } else { "clone" }) -} + link_dataset(&cache_entry, repos_dir, short)?; -/// Clone or pull a repository for a given locale (clones if doesn't exist, pulls if it does) -pub fn clone_or_pull_repo(locale: &str, repos_dir: &Path, token: Option<&str>) -> Result<()> { - clone_or_pull_repo_quiet(locale, repos_dir, token, false).map(|_| ()) + Ok(PullOutcome { + action: if is_reclone { "recloned" } else { "clone" }, + commit, + cache_key, + }) } -/// Clone a repository for a given locale (deprecated - use clone_or_pull_repo) -pub fn clone_repo(locale: &str, repos_dir: &Path, token: Option<&str>) -> Result<()> { - clone_or_pull_repo(locale, repos_dir, token) +/// Link a populated cache entry into a project's `repos/` directory under the +/// dataset's `repo_dir_name`. +fn link_dataset(cache_entry: &Path, repos_dir: &Path, short_name: &str) -> Result<()> { + let project_repo = repos_dir.join(repo_dir_name(short_name)); + crate::cache::link_into_project(cache_entry, &project_repo) } -/// Clone a repository for a given locale with quiet option (deprecated - use clone_or_pull_repo_quiet) -pub fn clone_repo_quiet( - locale: &str, - repos_dir: &Path, - token: Option<&str>, - quiet: bool, -) -> Result<()> { - clone_or_pull_repo_quiet(locale, repos_dir, token, quiet).map(|_| ()) +/// Ensure a freshly cloned repo's HEAD points at `main` or `master`. +fn ensure_default_branch(repo: &Repository) -> Result<()> { + let default_branch = if repo.find_branch("main", git2::BranchType::Local).is_ok() { + "main" + } else if repo.find_branch("master", git2::BranchType::Local).is_ok() { + "master" + } else if repo + .find_branch("origin/main", git2::BranchType::Remote) + .is_ok() + { + let remote_branch = repo.find_branch("origin/main", git2::BranchType::Remote)?; + let commit = remote_branch + .get() + .target() + .ok_or_else(|| Error::Config("Failed to get commit from origin/main".to_string()))?; + let commit_obj = repo.find_commit(commit)?; + repo.branch("main", &commit_obj, false)?; + "main" + } else if repo + .find_branch("origin/master", git2::BranchType::Remote) + .is_ok() + { + let remote_branch = repo.find_branch("origin/master", git2::BranchType::Remote)?; + let commit = remote_branch + .get() + .target() + .ok_or_else(|| Error::Config("Failed to get commit from origin/master".to_string()))?; + let commit_obj = repo.find_commit(commit)?; + repo.branch("master", &commit_obj, false)?; + "master" + } else { + return Err(Error::Config( + "Neither 'main' nor 'master' branch found in repository".to_string(), + )); + }; + + let needs_set = match repo.head() { + Ok(head) => head.name() != Some(&format!("refs/heads/{}", default_branch)[..]), + Err(_) => true, + }; + if needs_set { + repo.set_head(&format!("refs/heads/{}", default_branch)) + .map_err(|e| Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)))?; + repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) + .map_err(|e| Error::Config(format!("Failed to checkout {}: {}", default_branch, e)))?; + } + Ok(()) } /// Internal function to pull changes from a repository @@ -353,7 +305,8 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re let local_branch_name = head .name() .and_then(|name| name.strip_prefix("refs/heads/")) - .ok_or_else(|| Error::Config("Failed to determine local branch name".to_string()))?; + .ok_or_else(|| Error::Config("Failed to determine local branch name".to_string()))? + .to_string(); // Fetch from remote - try both main and master let mut remote = repo @@ -366,100 +319,70 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re let mut fetch_options = FetchOptions::new(); fetch_options.remote_callbacks(build_callbacks(token, !quiet)); - // If it's a shallow repo, we need to fetch more history for merge analysis to work - // The issue is that shallow clones only have 1 commit, so merge_analysis can't find - // the common ancestor. We need to fetch enough history to unshallow the repo. + // If it's a shallow repo, fetch more history so merge analysis can find the + // common ancestor — a shallow clone of 1 commit has none. if is_shallow { - // Fetch all refs to get full history - this unshallows the repository - // This ensures merge_analysis can find the common ancestor between local and remote let all_refs = vec!["+refs/*:refs/remotes/origin/*"]; let _ = remote.fetch(&all_refs, Some(&mut fetch_options), None); } - // Fetch both main and master branches (only fail if both fail) + // Fetch the current branch plus the usual defaults. + let branch_refspec = format!( + "refs/heads/{0}:refs/remotes/origin/{0}", + local_branch_name + ); let refspecs = vec![ + branch_refspec.as_str(), "refs/heads/main:refs/remotes/origin/main", "refs/heads/master:refs/remotes/origin/master", ]; - // Try to fetch both branches - ignore errors for individual branches let fetch_result = remote.fetch(&refspecs, Some(&mut fetch_options), None); - // If fetch completely fails, return error if fetch_result.is_err() { - // Check if at least one branch exists remotely by trying to find them - let has_main = repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok(); - let has_master = repo - .find_branch("origin/master", git2::BranchType::Remote) + let has_branch = repo + .find_branch( + &format!("origin/{}", local_branch_name), + git2::BranchType::Remote, + ) .is_ok(); - - if !has_main && !has_master { + if !has_branch { return Err(Error::Config( - "Failed to fetch from remote and neither 'main' nor 'master' branch found" - .to_string(), + "Failed to fetch from remote and the tracked branch was not found".to_string(), )); } - // If at least one exists, continue (fetch might have partially succeeded) } - // Determine which remote branch to use based on local branch - // If local is main, use origin/main; if local is master, use origin/master - // Otherwise, prefer main over master - let (remote_branch_name, target_local_branch) = if local_branch_name == "main" { - if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - ("origin/main", "main") - } else if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - ("origin/master", "master") - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in remote repository".to_string(), - )); - } - } else if local_branch_name == "master" { - if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - ("origin/master", "master") - } else if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - ("origin/main", "main") - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in remote repository".to_string(), - )); - } + // Track the branch we are on; fall back to main/master if it's gone. + let (remote_branch_name, target_local_branch) = if repo + .find_branch( + &format!("origin/{}", local_branch_name), + git2::BranchType::Remote, + ) + .is_ok() + { + ( + format!("origin/{}", local_branch_name), + local_branch_name.clone(), + ) + } else if repo + .find_branch("origin/main", git2::BranchType::Remote) + .is_ok() + { + ("origin/main".to_string(), "main".to_string()) + } else if repo + .find_branch("origin/master", git2::BranchType::Remote) + .is_ok() + { + ("origin/master".to_string(), "master".to_string()) } else { - // Local branch is neither main nor master - prefer main, fallback to master - if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - ("origin/main", "main") - } else if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - ("origin/master", "master") - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in remote repository".to_string(), - )); - } + return Err(Error::Config( + "No tracked branch found in remote repository".to_string(), + )); }; let remote_branch = repo - .find_branch(remote_branch_name, git2::BranchType::Remote) + .find_branch(&remote_branch_name, git2::BranchType::Remote) .map_err(|e| { Error::Config(format!( "Failed to find remote branch {}: {}", @@ -477,13 +400,12 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re // If local branch doesn't match the target, switch to it if local_branch_name != target_local_branch { - // Check if local branch exists, if not create it if repo - .find_branch(target_local_branch, git2::BranchType::Local) + .find_branch(&target_local_branch, git2::BranchType::Local) .is_err() { let commit_obj = repo.find_commit(remote_commit)?; - repo.branch(target_local_branch, &commit_obj, false)?; + repo.branch(&target_local_branch, &commit_obj, false)?; } repo.set_head(&format!("refs/heads/{}", target_local_branch)) @@ -504,10 +426,8 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re .map_err(|e| Error::Config(format!("Failed to analyze merge: {}", e)))?; if analysis.0.is_up_to_date() { - // Already up to date - return Ok(false); + Ok(false) } else if analysis.0.is_fast_forward() { - // Fast-forward merge let mut reference = head .resolve() .map_err(|e| Error::Config(format!("Failed to resolve HEAD: {}", e)))?; @@ -518,65 +438,13 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re .map_err(|e| Error::Config(format!("Failed to set HEAD: {}", e)))?; repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) .map_err(|e| Error::Config(format!("Failed to checkout: {}", e)))?; - - // Updates were made - return Ok(true); + Ok(true) } else { - // Need to merge - return Err(Error::Config( + Err(Error::Config( "Repository has diverged and cannot be fast-forwarded. Please resolve manually." .to_string(), - )); - } -} - -/// Pull a repository for a given locale -pub fn pull_repo(locale: &str, repos_dir: &Path, token: Option<&str>) -> Result<()> { - pull_repo_quiet(locale, repos_dir, token, false) -} - -/// Pull a repository for a given locale with quiet option -pub fn pull_repo_quiet( - locale: &str, - repos_dir: &Path, - token: Option<&str>, - quiet: bool, -) -> Result<()> { - let repo_name = build_repo_name(locale); - let repo_path = build_repo_path(locale); - let target_dir = repos_dir.join(&repo_name); - - let repo = match Repository::open(&target_dir) { - Ok(repo) => repo, - Err(_) => { - if !quiet { - eprintln!("Repository does not exist: {}. Skipping.", repo_path); - } - return Ok(()); - } - }; - - // Pull the latest changes (credentials will be used if token is provided) - if !quiet { - eprintln!("Pulling repository: {}", repo_path); - } - - pull_repo_internal(&repo, token, quiet)?; - - // Explicitly drop the repository to ensure all file handles are closed - drop(repo); - - // Give the file system a moment to release all locks - std::thread::sleep(std::time::Duration::from_millis(50)); - - // Clear any progress line - if !quiet { - eprint!( - "\r \r" - ); - eprintln!("Successfully pulled {}", repo_path); + )) } - Ok(()) } /// Calculate the size of a directory in bytes @@ -592,7 +460,6 @@ pub fn get_directory_size(path: &Path) -> Result { if metadata.is_file() { *total += metadata.len(); } else if metadata.is_dir() { - // Recursively calculate size of subdirectories for sub_entry in fs::read_dir(entry.path())? { let sub_entry = sub_entry?; calculate_size(&sub_entry, total)?; @@ -631,125 +498,51 @@ pub fn format_size(bytes: u64) -> String { } } -/// Get estimated remote repository size by doing a lightweight fetch -/// This fetches only refs and estimates size from transfer progress -pub fn get_remote_repo_size_estimate( - repo: &Repository, - token: Option<&str>, - _quiet: bool, -) -> Result { - use std::sync::{Arc, Mutex}; - - let mut remote = repo - .find_remote("origin") - .map_err(|e| Error::Config(format!("Failed to find remote 'origin': {}", e)))?; - - let size_estimate = Arc::new(Mutex::new(0u64)); - let size_estimate_clone = size_estimate.clone(); - - let mut fetch_options = FetchOptions::new(); - let token = token.map(|t| t.to_string()); - - let mut callbacks = RemoteCallbacks::new(); - callbacks.credentials(move |_url, _username, _allowed| { - if let Some(ref token) = token { - git2::Cred::userpass_plaintext("x-access-token", token) - } else { - git2::Cred::default() - } - }); - - // Track transfer progress to estimate size - callbacks.transfer_progress(move |stats| { - // received_bytes() gives us the total bytes received so far - let bytes = stats.received_bytes() as u64; - let mut size = size_estimate_clone.lock().unwrap(); - *size = bytes; - true - }); - - fetch_options.remote_callbacks(callbacks); - - // Do a lightweight fetch - fetch refs only, not objects - // This will give us size information without downloading everything - let _fetch_result = remote.fetch( - &["refs/heads/*:refs/remotes/origin/*"], - Some(&mut fetch_options), - None, - ); - - // Even if fetch fails, we might have gotten some size info - let estimated_size = *size_estimate.lock().unwrap(); - - if estimated_size > 0 { - Ok(estimated_size) - } else { - // Fallback: estimate from local pack files if they exist - let pack_dir = repo.path().join("objects").join("pack"); - if pack_dir.exists() { - Ok(get_directory_size(&pack_dir).unwrap_or(0)) - } else { - Ok(0) - } - } -} - -/// Extract suffix from URL template (everything after {locale}) -/// For example: "{locale}-legislation" -> "-legislation" -fn extract_repo_suffix(template: &str) -> String { - let pattern = extract_repo_name_pattern(template); - if let Some(locale_pos) = pattern.find("{locale}") { - // Get everything after {locale} - pattern[locale_pos + 8..].to_string() // 8 is length of "{locale}" - } else { - // Fallback: try common patterns - "-legislation".to_string() - } -} - -/// Get all available locale repositories in the repos directory -pub fn get_available_locales(repos_dir: &Path) -> Result> { +/// List the datasets locally present in a project's `repos/` directory, +/// returned as short names (the registry/manifest identifier form). +/// +/// A "dataset directory" is any directory (or symlink-to-directory) whose name +/// carries the dataset suffix — it need not be a live git repo, so mock data +/// and non-git extracts are listed too. +pub fn get_local_datasets(repos_dir: &Path) -> Result> { if !repos_dir.exists() { return Ok(Vec::new()); } - let template = get_repo_url_template(); - let suffix = extract_repo_suffix(&template); - let mut locales = Vec::new(); + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + let mut datasets = Vec::new(); for entry in std::fs::read_dir(repos_dir)? { let entry = entry?; let path = entry.path(); - if path.is_dir() && Repository::open(&path).is_ok() { + // A symlink into the cache or a real clone — both count. `is_dir()` + // follows a symlink, so a cache symlink resolves correctly. + if path.is_dir() { if let Some(dir_name) = path.file_name().and_then(|n| n.to_str()) { - // Check for current format first, then old format for backward compatibility - if !suffix.is_empty() { - if let Some(locale) = dir_name.strip_suffix(&suffix) { - locales.push(locale.to_string()); - continue; - } + if let Some(short) = dir_name.strip_suffix(&suffix) { + datasets.push(short.to_string()); + continue; } - // Fallback to old format for backward compatibility - if let Some(locale) = dir_name.strip_suffix("-data-pipeline") { - locales.push(locale.to_string()); + // Legacy layout fallback. + if let Some(short) = dir_name.strip_suffix("-data-pipeline") { + datasets.push(short.to_string()); } } } } - - Ok(locales) + datasets.sort(); + Ok(datasets) } -/// Recursively remove a directory and all its contents -/// This is more robust than remove_dir_all on macOS +/// Recursively remove a directory and all its contents. +/// More robust than `remove_dir_all` on macOS. fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { if !path.exists() { return Ok(()); } - if path.is_file() { - // Make file writable before removing + if path.is_file() || path.is_symlink() { let _ = std::fs::metadata(path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -759,14 +552,12 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { return std::fs::remove_file(path); } - // For directories, recursively remove contents first let entries: Vec<_> = std::fs::read_dir(path)?.collect(); for entry_result in entries { let entry = entry_result?; let entry_path = entry.path(); - // Make writable before trying to remove let _ = std::fs::metadata(&entry_path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -775,20 +566,16 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { }); if entry_path.is_dir() { - // Recursively remove subdirectory if remove_dir_all_robust(&entry_path).is_err() { - // If recursive removal fails, try a few more times for _ in 0..3 { std::thread::sleep(std::time::Duration::from_millis(100)); if remove_dir_all_robust(&entry_path).is_ok() { break; } } - // If still failing, try direct removal let _ = std::fs::remove_dir_all(&entry_path); } } else { - // Try to remove file multiple times let mut removed = false; for _ in 0..3 { if std::fs::remove_file(&entry_path).is_ok() { @@ -798,7 +585,6 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { std::thread::sleep(std::time::Duration::from_millis(50)); } if !removed { - // Last resort: try to make it writable again and remove let _ = std::fs::metadata(&entry_path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -810,7 +596,6 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { } } - // Make directory writable before removing let _ = std::fs::metadata(path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -818,8 +603,6 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { std::fs::set_permissions(path, perms) }); - // Now try to remove the directory itself - // Retry multiple times for macOS let mut last_error = None; for _ in 0..5 { match std::fs::remove_dir(path) { @@ -831,11 +614,9 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { } } - // Final attempt with remove_dir_all match std::fs::remove_dir_all(path) { Ok(_) => Ok(()), Err(e) => { - // Return the more specific error if available if let Some(prev_error) = last_error { Err(prev_error) } else { @@ -845,65 +626,58 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { } } -/// Delete a repository for a given locale -pub fn delete_repo(locale: &str, repos_dir: &Path) -> Result<()> { - let repo_name = build_repo_name(locale); - let target_dir = repos_dir.join(&repo_name); +/// Remove a dataset's clone from a project's `repos/` directory. +/// +/// This unlinks the project's reference (the symlink into the shared cache); +/// the cache entry itself is left intact, since other projects may use it. +pub fn delete_dataset(short_name: &str, repos_dir: &Path) -> Result<()> { + let target_dir = repos_dir.join(repo_dir_name(short_name)); - if !target_dir.exists() { - return Ok(()); // Repository doesn't exist, nothing to delete + if !target_dir.exists() && std::fs::symlink_metadata(&target_dir).is_err() { + return Ok(()); // Nothing to delete. } - // Try to open and close the repository first to release any locks - // This helps on macOS where git files might be locked + // A symlink into the cache: unlink it, leave the cache entry. + if let Ok(meta) = std::fs::symlink_metadata(&target_dir) { + if meta.file_type().is_symlink() { + return std::fs::remove_file(&target_dir).map_err(|e| { + Error::Config(format!( + "Failed to unlink dataset {}: {}", + short_name, e + )) + }); + } + } + + // A real directory (a pre-cache clone): remove it. if let Ok(repo) = Repository::open(&target_dir) { - // Try to close the index explicitly if possible - // The index file is often the one that gets locked - let git_dir = repo.path(); + let git_dir = repo.path().to_path_buf(); let index_path = git_dir.join("index"); - - // Force close the repository to release file handles drop(repo); - - // Give it a moment for file handles to be released std::thread::sleep(std::time::Duration::from_millis(100)); - - // Try to remove the index file explicitly if it exists - // This often helps on macOS if index_path.exists() { let _ = std::fs::remove_file(&index_path); } } - // Use robust removal that handles macOS edge cases if let Err(e) = remove_dir_all_robust(&target_dir) { - // If robust removal fails, try using shell command as fallback - // This is often more reliable on macOS for stubborn directories let output = std::process::Command::new("rm") .arg("-rf") .arg(&target_dir) .output(); - match output { - Ok(result) if result.status.success() => { - // Successfully removed via shell command - Ok(()) - } + Ok(result) if result.status.success() => Ok(()), Ok(result) => { - // Shell command failed, return original error with shell error info let shell_err = String::from_utf8_lossy(&result.stderr); Err(Error::Config(format!( - "Failed to delete repository {}: {} (shell fallback also failed: {})", - repo_name, e, shell_err - ))) - } - Err(shell_err) => { - // Couldn't execute shell command, return original error - Err(Error::Config(format!( - "Failed to delete repository {}: {} (shell fallback unavailable: {})", - repo_name, e, shell_err + "Failed to delete dataset {}: {} (shell fallback also failed: {})", + short_name, e, shell_err ))) } + Err(shell_err) => Err(Error::Config(format!( + "Failed to delete dataset {}: {} (shell fallback unavailable: {})", + short_name, e, shell_err + ))), } } else { Ok(()) diff --git a/actions/govbot/src/lib.rs b/actions/govbot/src/lib.rs index e2cd937d..a3087719 100644 --- a/actions/govbot/src/lib.rs +++ b/actions/govbot/src/lib.rs @@ -3,37 +3,43 @@ //! This library provides a reactive stream-based API for discovering, filtering, //! sorting, and processing JSON log files from pipeline repositories. +pub mod bluesky; +pub mod cache; pub mod config; pub mod embeddings; pub mod error; pub mod filter; pub mod git; -pub mod locale_generated; +pub mod lock; pub mod pipeline; pub mod processor; pub mod publish; +pub mod registry; pub mod rss; pub mod selectors; pub mod types; pub mod wizard; -pub use config::{Config, ConfigBuilder, JoinOption, SortOrder}; +pub use config::{ + Command_, Config, ConfigBuilder, JoinOption, Manifest, Publisher, PublisherKind, SortOrder, + Transform, +}; pub use embeddings::{ - hash_text, BillTagResult, ScoreBreakdown, TagDefinition, TagFile, TagFileMetadata, TagMatcher, + hash_text, BillTagResult, ScoreBreakdown, TagDefinition, TagFile, TagFileMetadata, }; pub use error::{Error, Result}; pub use filter::{FilterAlias, FilterManager, FilterResult, LogFilter}; -pub use locale::WorkingLocale; -pub use locale_generated as locale; +pub use lock::LockFile; pub use processor::PipelineProcessor; +pub use registry::{DatasetEntry, Registry, ResolvedDataset}; pub use types::{LogContent, LogEntry, Metadata, VoteEventResult}; /// Re-export commonly used types for convenience pub mod prelude { pub use crate::config::{Config, ConfigBuilder, JoinOption, SortOrder}; pub use crate::error::{Error, Result}; - pub use crate::locale::WorkingLocale; pub use crate::processor::PipelineProcessor; + pub use crate::registry::Registry; pub use crate::types::{LogContent, LogEntry, Metadata, VoteEventResult}; pub use futures::StreamExt; } diff --git a/actions/govbot/src/locale_generated.rs b/actions/govbot/src/locale_generated.rs deleted file mode 100644 index 6ca3b4b5..00000000 --- a/actions/govbot/src/locale_generated.rs +++ /dev/null @@ -1,302 +0,0 @@ -//! Auto-generated locale enum from pipeline-manager config.yml -//! This file is generated by src/bin/generate-locale-enum.rs -//! Do not edit manually - regenerate using: just generate - -/// Locale codes for working pipelines -#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)] -#[serde(rename_all = "lowercase")] -pub enum WorkingLocale { - All, - AK, - AL, - AR, - CA, - CO, - DE, - FL, - GA, - GU, - HI, - IA, - ID, - IL, - IN, - KS, - KY, - LA, - MA, - MD, - ME, - MI, - MN, - MO, - MP, - MS, - MT, - NC, - ND, - NE, - NH, - NJ, - NM, - NV, - NY, - OH, - OK, - OR, - PA, - PR, - RI, - SC, - SD, - TN, - Usa, - UT, - VI, - VT, - WA, - WI, - WV, - WY, -} - -impl WorkingLocale { - /// Get all working locales as a vector (excludes All variant) - pub fn all() -> Vec { - vec![ - WorkingLocale::AK, - WorkingLocale::AL, - WorkingLocale::AR, - WorkingLocale::CA, - WorkingLocale::CO, - WorkingLocale::DE, - WorkingLocale::FL, - WorkingLocale::GA, - WorkingLocale::GU, - WorkingLocale::HI, - WorkingLocale::IA, - WorkingLocale::ID, - WorkingLocale::IL, - WorkingLocale::IN, - WorkingLocale::KS, - WorkingLocale::KY, - WorkingLocale::LA, - WorkingLocale::MA, - WorkingLocale::MD, - WorkingLocale::ME, - WorkingLocale::MI, - WorkingLocale::MN, - WorkingLocale::MO, - WorkingLocale::MP, - WorkingLocale::MS, - WorkingLocale::MT, - WorkingLocale::NC, - WorkingLocale::ND, - WorkingLocale::NE, - WorkingLocale::NH, - WorkingLocale::NJ, - WorkingLocale::NM, - WorkingLocale::NV, - WorkingLocale::NY, - WorkingLocale::OH, - WorkingLocale::OK, - WorkingLocale::OR, - WorkingLocale::PA, - WorkingLocale::PR, - WorkingLocale::RI, - WorkingLocale::SC, - WorkingLocale::SD, - WorkingLocale::TN, - WorkingLocale::Usa, - WorkingLocale::UT, - WorkingLocale::VI, - WorkingLocale::VT, - WorkingLocale::WA, - WorkingLocale::WI, - WorkingLocale::WV, - WorkingLocale::WY, - ] - } - - /// Get the locale code as a string - pub fn as_str(&self) -> &'static str { - match self { - WorkingLocale::All => "all", - WorkingLocale::AK => "ak", - WorkingLocale::AL => "al", - WorkingLocale::AR => "ar", - WorkingLocale::CA => "ca", - WorkingLocale::CO => "co", - WorkingLocale::DE => "de", - WorkingLocale::FL => "fl", - WorkingLocale::GA => "ga", - WorkingLocale::GU => "gu", - WorkingLocale::HI => "hi", - WorkingLocale::IA => "ia", - WorkingLocale::ID => "id", - WorkingLocale::IL => "il", - WorkingLocale::IN => "in", - WorkingLocale::KS => "ks", - WorkingLocale::KY => "ky", - WorkingLocale::LA => "la", - WorkingLocale::MA => "ma", - WorkingLocale::MD => "md", - WorkingLocale::ME => "me", - WorkingLocale::MI => "mi", - WorkingLocale::MN => "mn", - WorkingLocale::MO => "mo", - WorkingLocale::MP => "mp", - WorkingLocale::MS => "ms", - WorkingLocale::MT => "mt", - WorkingLocale::NC => "nc", - WorkingLocale::ND => "nd", - WorkingLocale::NE => "ne", - WorkingLocale::NH => "nh", - WorkingLocale::NJ => "nj", - WorkingLocale::NM => "nm", - WorkingLocale::NV => "nv", - WorkingLocale::NY => "ny", - WorkingLocale::OH => "oh", - WorkingLocale::OK => "ok", - WorkingLocale::OR => "or", - WorkingLocale::PA => "pa", - WorkingLocale::PR => "pr", - WorkingLocale::RI => "ri", - WorkingLocale::SC => "sc", - WorkingLocale::SD => "sd", - WorkingLocale::TN => "tn", - WorkingLocale::Usa => "usa", - WorkingLocale::UT => "ut", - WorkingLocale::VI => "vi", - WorkingLocale::VT => "vt", - WorkingLocale::WA => "wa", - WorkingLocale::WI => "wi", - WorkingLocale::WV => "wv", - WorkingLocale::WY => "wy", - } - } - - /// Get the locale code in lowercase - pub fn as_lowercase(&self) -> &'static str { - match self { - WorkingLocale::All => "all", - WorkingLocale::AK => "ak", - WorkingLocale::AL => "al", - WorkingLocale::AR => "ar", - WorkingLocale::CA => "ca", - WorkingLocale::CO => "co", - WorkingLocale::DE => "de", - WorkingLocale::FL => "fl", - WorkingLocale::GA => "ga", - WorkingLocale::GU => "gu", - WorkingLocale::HI => "hi", - WorkingLocale::IA => "ia", - WorkingLocale::ID => "id", - WorkingLocale::IL => "il", - WorkingLocale::IN => "in", - WorkingLocale::KS => "ks", - WorkingLocale::KY => "ky", - WorkingLocale::LA => "la", - WorkingLocale::MA => "ma", - WorkingLocale::MD => "md", - WorkingLocale::ME => "me", - WorkingLocale::MI => "mi", - WorkingLocale::MN => "mn", - WorkingLocale::MO => "mo", - WorkingLocale::MP => "mp", - WorkingLocale::MS => "ms", - WorkingLocale::MT => "mt", - WorkingLocale::NC => "nc", - WorkingLocale::ND => "nd", - WorkingLocale::NE => "ne", - WorkingLocale::NH => "nh", - WorkingLocale::NJ => "nj", - WorkingLocale::NM => "nm", - WorkingLocale::NV => "nv", - WorkingLocale::NY => "ny", - WorkingLocale::OH => "oh", - WorkingLocale::OK => "ok", - WorkingLocale::OR => "or", - WorkingLocale::PA => "pa", - WorkingLocale::PR => "pr", - WorkingLocale::RI => "ri", - WorkingLocale::SC => "sc", - WorkingLocale::SD => "sd", - WorkingLocale::TN => "tn", - WorkingLocale::Usa => "usa", - WorkingLocale::UT => "ut", - WorkingLocale::VI => "vi", - WorkingLocale::VT => "vt", - WorkingLocale::WA => "wa", - WorkingLocale::WI => "wi", - WorkingLocale::WV => "wv", - WorkingLocale::WY => "wy", - } - } -} - -impl From<&str> for WorkingLocale { - fn from(s: &str) -> Self { - match s.to_lowercase().as_str() { - "all" => WorkingLocale::All, - "ak" => WorkingLocale::AK, - "al" => WorkingLocale::AL, - "ar" => WorkingLocale::AR, - "ca" => WorkingLocale::CA, - "co" => WorkingLocale::CO, - "de" => WorkingLocale::DE, - "fl" => WorkingLocale::FL, - "ga" => WorkingLocale::GA, - "gu" => WorkingLocale::GU, - "hi" => WorkingLocale::HI, - "ia" => WorkingLocale::IA, - "id" => WorkingLocale::ID, - "il" => WorkingLocale::IL, - "in" => WorkingLocale::IN, - "ks" => WorkingLocale::KS, - "ky" => WorkingLocale::KY, - "la" => WorkingLocale::LA, - "ma" => WorkingLocale::MA, - "md" => WorkingLocale::MD, - "me" => WorkingLocale::ME, - "mi" => WorkingLocale::MI, - "mn" => WorkingLocale::MN, - "mo" => WorkingLocale::MO, - "mp" => WorkingLocale::MP, - "ms" => WorkingLocale::MS, - "mt" => WorkingLocale::MT, - "nc" => WorkingLocale::NC, - "nd" => WorkingLocale::ND, - "ne" => WorkingLocale::NE, - "nh" => WorkingLocale::NH, - "nj" => WorkingLocale::NJ, - "nm" => WorkingLocale::NM, - "nv" => WorkingLocale::NV, - "ny" => WorkingLocale::NY, - "oh" => WorkingLocale::OH, - "ok" => WorkingLocale::OK, - "or" => WorkingLocale::OR, - "pa" => WorkingLocale::PA, - "pr" => WorkingLocale::PR, - "ri" => WorkingLocale::RI, - "sc" => WorkingLocale::SC, - "sd" => WorkingLocale::SD, - "tn" => WorkingLocale::TN, - "usa" => WorkingLocale::Usa, - "ut" => WorkingLocale::UT, - "vi" => WorkingLocale::VI, - "vt" => WorkingLocale::VT, - "wa" => WorkingLocale::WA, - "wi" => WorkingLocale::WI, - "wv" => WorkingLocale::WV, - "wy" => WorkingLocale::WY, - _ => panic!("Invalid working locale: {}", s), - } - } -} - -impl std::fmt::Display for WorkingLocale { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - write!(f, "{}", self.as_lowercase()) - } -} diff --git a/actions/govbot/src/lock.rs b/actions/govbot/src/lock.rs new file mode 100644 index 00000000..2307a4e4 --- /dev/null +++ b/actions/govbot/src/lock.rs @@ -0,0 +1,175 @@ +//! `govbot.lock` — the dataset lockfile, for reproducible runs. +//! +//! `govbot.yml` declares *which* datasets a project wants; `govbot.lock` +//! records the *exact git commit* each resolved to, so a run on another +//! machine (or a re-run weeks later) processes byte-identical data. It is the +//! `package-lock.json` / `Cargo.lock` of govbot. +//! +//! ## When it is written +//! +//! `govbot pull` and `govbot run` write/update `govbot.lock` next to +//! `govbot.yml` after resolving and fetching datasets — recording each +//! dataset's canonical id, git URL, channel, the cloned commit SHA, the +//! content-addressed cache key, and the resolve timestamp. +//! +//! ## Format +//! +//! `govbot.lock` is JSON (stable, diff-friendly, no YAML ambiguity): +//! +//! ```json +//! { +//! "lockfile_version": 1, +//! "generated_at": "2026-05-22T12:00:00Z", +//! "datasets": { +//! "us-legislation/wy": { +//! "git_url": "https://github.com/chn-openstates-files/wy-legislation.git", +//! "channel": null, +//! "commit": "a1b2c3d4e5f6...", +//! "cache_key": "wy-legislation-3f9a1c20e5b4", +//! "resolved_at": "2026-05-22T12:00:00Z" +//! } +//! } +//! } +//! ``` +//! +//! Keys are canonical `namespace/name` ids; the map is sorted for a stable +//! diff. The lockfile SHOULD be committed to a project's git repo. + +use crate::error::{Error, Result}; +use serde::{Deserialize, Serialize}; +use std::collections::BTreeMap; +use std::path::{Path, PathBuf}; + +/// The current lockfile format version. +pub const LOCKFILE_VERSION: u32 = 1; + +/// The lockfile filename, written next to `govbot.yml`. +pub const LOCKFILE_NAME: &str = "govbot.lock"; + +/// One pinned dataset. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct LockedDataset { + /// The git URL the dataset was cloned from. + pub git_url: String, + /// The requested channel (branch), if any. + pub channel: Option, + /// The exact commit SHA the dataset is pinned to. + pub commit: String, + /// The shared-cache key the dataset's clone lives under. + pub cache_key: String, + /// When this dataset was last resolved (RFC 3339 UTC). + pub resolved_at: String, +} + +/// The whole `govbot.lock` file. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct LockFile { + /// Lockfile format version. + pub lockfile_version: u32, + /// When the lockfile was last written (RFC 3339 UTC). + pub generated_at: String, + /// Canonical `namespace/name` → pin. Sorted for a stable diff. + pub datasets: BTreeMap, +} + +impl Default for LockFile { + fn default() -> Self { + LockFile { + lockfile_version: LOCKFILE_VERSION, + generated_at: now_rfc3339(), + datasets: BTreeMap::new(), + } + } +} + +impl LockFile { + /// The lockfile path for a project (the directory holding `govbot.yml`). + pub fn path_for(project_dir: &Path) -> PathBuf { + project_dir.join(LOCKFILE_NAME) + } + + /// Load an existing lockfile, or an empty one if none exists yet. + pub fn load_or_default(project_dir: &Path) -> Result { + let path = LockFile::path_for(project_dir); + if !path.is_file() { + return Ok(LockFile::default()); + } + let contents = std::fs::read_to_string(&path).map_err(|e| { + Error::Config(format!("Failed to read {}: {}", path.display(), e)) + })?; + serde_json::from_str(&contents) + .map_err(|e| Error::Config(format!("Invalid {}: {}", path.display(), e))) + } + + /// Record (or overwrite) a dataset's pin. + pub fn pin( + &mut self, + canonical_id: &str, + git_url: &str, + channel: Option<&str>, + commit: &str, + cache_key: &str, + ) { + self.datasets.insert( + canonical_id.to_string(), + LockedDataset { + git_url: git_url.to_string(), + channel: channel.map(|c| c.to_string()), + commit: commit.to_string(), + cache_key: cache_key.to_string(), + resolved_at: now_rfc3339(), + }, + ); + } + + /// Write the lockfile to `/govbot.lock`, pretty-printed, + /// refreshing `generated_at`. + pub fn save(&mut self, project_dir: &Path) -> Result<()> { + self.lockfile_version = LOCKFILE_VERSION; + self.generated_at = now_rfc3339(); + let path = LockFile::path_for(project_dir); + let json = serde_json::to_string_pretty(self) + .map_err(|e| Error::Config(format!("Failed to serialize lockfile: {}", e)))?; + std::fs::write(&path, format!("{}\n", json)).map_err(|e| { + Error::Config(format!("Failed to write {}: {}", path.display(), e)) + })?; + Ok(()) + } +} + +/// The current time as an RFC 3339 UTC string. +fn now_rfc3339() -> String { + chrono::Utc::now().to_rfc3339_opts(chrono::SecondsFormat::Secs, true) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn round_trips_through_disk() { + let dir = tempfile::tempdir().unwrap(); + let mut lock = LockFile::default(); + lock.pin( + "us-legislation/wy", + "https://example.com/wy.git", + None, + "abc123", + "wy-legislation-deadbeef", + ); + lock.save(dir.path()).unwrap(); + + let reloaded = LockFile::load_or_default(dir.path()).unwrap(); + assert_eq!(reloaded.lockfile_version, LOCKFILE_VERSION); + let wy = reloaded.datasets.get("us-legislation/wy").unwrap(); + assert_eq!(wy.commit, "abc123"); + assert_eq!(wy.cache_key, "wy-legislation-deadbeef"); + } + + #[test] + fn missing_lockfile_is_empty_default() { + let dir = tempfile::tempdir().unwrap(); + let lock = LockFile::load_or_default(dir.path()).unwrap(); + assert!(lock.datasets.is_empty()); + } +} diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 0b59ab15..2f974413 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -1,9 +1,10 @@ use clap::{Parser, Subcommand}; use govbot::git; -use govbot::{TagMatcher, hash_text, TagFile, TagFileMetadata, BillTagResult}; +use govbot::lock::LockFile; +use govbot::registry::Registry; +use govbot::{hash_text, TagFile, TagFileMetadata, BillTagResult}; use govbot::selectors::ocd_files_select_default; -use govbot::publish::{load_config, get_repos_from_config, filter_by_tags, deduplicate_entries, sort_by_timestamp}; -use govbot::rss; +use govbot::publish::{load_manifest, filter_by_tags, deduplicate_entries, sort_by_timestamp}; use futures::StreamExt; use futures::stream; use std::io::{self, Write, BufRead, BufReader}; @@ -40,12 +41,26 @@ fn write_json_line(line: &str) -> io::Result<()> { #[derive(Debug, Clone)] struct CloneResult { locale: String, - result: String, // "cloned", "pulled", "no_updates", "failed" + result: String, // emoji, or "failed" position: String, // "1/37" size: Option, local_size: Option, final_size: Option, error: Option, + /// On success: the canonical registry id, git URL, channel, resolved + /// commit SHA, and cache key — recorded into `govbot.lock`. + pin: Option, +} + +/// A resolved dataset pin, captured during a successful clone/pull for the +/// lockfile. +#[derive(Debug, Clone)] +struct DatasetPin { + canonical_id: String, + git_url: String, + channel: Option, + commit: String, + cache_key: String, } /// Type-safe, functional reactive processor for pipeline log files @@ -60,11 +75,11 @@ struct Args { #[derive(Subcommand, Debug)] enum Command { - /// Clone or pull data pipeline repositories (default: updates existing repos) - /// Clones if repository doesn't exist, pulls if it does - /// Use "govbot clone all" to clone all repos, or "govbot clone " for specific repos - Clone { - /// Repository names to clone/pull (e.g., usa, il, ca, or "all" for all repos). If not specified, updates existing repos. + /// Pull dataset repositories (default: updates existing datasets) + /// Clones if the dataset repository doesn't exist, pulls if it does + /// Use "govbot pull all" to pull all datasets, or "govbot pull " for specific ones + Pull { + /// Dataset names to pull (e.g., usa, il, ca, or "all" for all datasets). If not specified, updates existing datasets. #[arg(num_args = 0..)] repos: Vec, @@ -84,14 +99,14 @@ enum Command { #[arg(long)] verbose: bool, - /// List available repos instead of cloning/pulling + /// List available datasets instead of pulling #[arg(long)] list: bool, }, - /// Process and display pipeline log files - Logs { - /// Repos to output (default: `all`) `--repos="il,ca"` + /// Stream dataset records as JSON Lines (the govbot stream-protocol source) + Source { + /// Datasets to output (default: `all`) `--repos="il,ca"` #[arg(long, num_args = 0..)] repos: Vec, @@ -103,8 +118,11 @@ enum Command { #[arg(long, default_value = "bill,tags")] join: String, - /// Select/transform fields (default: `default`) - applies extract_text_from_json transformation - #[arg(long, default_value = "default", value_parser = ["default"])] + /// Select/transform fields (default: `default`). `docs` emits one + /// `{"id","text","kind":"docs"}` JSON object per entry carrying the + /// FULL bill text — the stream-protocol document `fastclass classify -` + /// consumes. + #[arg(long, default_value = "default", value_parser = ["default", "docs"])] select: String, /// Filter log entries based on per-repo AI generated filters (default: `default`) options: `default` | `none` @@ -165,50 +183,103 @@ enum Command { /// Downloads and installs the latest nightly build from GitHub releases Update, - /// Build RSS feed and HTML index from govbot.yml configuration - /// Generates a combined RSS feed and HTML index from logs filtered by tags in govbot.yml - Build { - /// Specific tags to include in feed (default: all tags from govbot.yml) - #[arg(long, num_args = 0..)] - tags: Vec, - - /// Limit number of entries per feed (default: 100, use "none" for all entries) + /// Run a publisher: emit feeds/indexes/dumps from a govbot.yml publisher + /// Reads the named publishers from govbot.yml `publish:` and emits their artifacts. + Publish { + /// Publisher name(s) from govbot.yml `publish:` (default: every publisher) + #[arg(long = "publisher", num_args = 0..)] + publishers: Vec, + + /// Limit number of entries per artifact (default: 100, use "none" for all entries) #[arg(long)] limit: Option, - - /// Output directory for RSS feed and HTML (default: from govbot.yml build.output_dir, or "docs") + + /// Output directory override (default: from the publisher's output_dir, or "docs") #[arg(long)] output_dir: Option, - - /// Output filename for RSS feed (default: from govbot.yml build.output_file, or "feed.xml") + + /// Output filename override (default: from the publisher's output_file, or "feed.xml") #[arg(long)] output_file: Option, - + + /// Render but do not emit. The `bluesky` publisher honours this by + /// printing the posts it would send and touching no network/ledger. + #[arg(long = "dry-run")] + dry_run: bool, + /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] govbot_dir: Option, }, - /// Tag bills using semantic or built-in similarity based on govbot.yml in the current directory. - /// Reads JSON lines from stdin (from `govbot logs`), processes entries with bill identifiers, - /// and writes per-tag files under the directory containing govbot.yml. - /// By default, acts as a filter: only outputs lines that match tags. - /// If a tag name is provided, only processes and outputs lines matching that specific tag. - Tag { - /// Optional tag name to filter to a specific tag (e.g., "lgbtq", "budget") + /// Persist fastclass classification results into the dataset as tag files. + /// Reads `fastclass classify` result JSON from stdin — the apply sink of + /// `govbot source --select docs | fastclass classify - | govbot apply` — + /// and writes per-tag `.tag.json` files under each bill's session + /// directory, the files `govbot publish` turns into feeds. Classification + /// itself is done by fastclass; `govbot apply` only stores the results. + Apply { + /// Optional tag name: persist only this tag's matches tag_name: Option, /// Output directory (defaults to the directory containing govbot.yml) #[arg(long = "output-dir")] output_dir: Option, + /// Overwrite a bill's tag entry even if it is already present + #[arg(long)] + overwrite: bool, + }, + + /// Run the full govbot pipeline against the current directory's `govbot.yml`: + /// pull/update → `source --select docs | fastclass classify - | apply` → publish. + /// Equivalent to running `govbot` with no arguments. + Run { + /// Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var) + #[arg(long = "govbot-dir")] + govbot_dir: Option, + }, + + /// Scaffold a new govbot.yml in the current directory (the setup wizard). + /// Interactive in a TTY; writes sensible defaults when non-interactive. + Init, + + /// Add one or more datasets to the project's `govbot.yml` `datasets:` list. + /// Each id is validated against the registry before it is added. + Add { + /// Dataset identifiers to add (e.g. `wy`, `il`, `us-legislation/ca`). + #[arg(num_args = 1..)] + datasets: Vec, + }, + + /// Remove one or more datasets from the project's `govbot.yml`. + Remove { + /// Dataset identifiers to remove from `datasets:`. + #[arg(num_args = 1..)] + datasets: Vec, + }, + + /// List datasets — the project's manifest datasets and the ones cached + /// locally. With no manifest, lists every dataset in the registry. + Ls { /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] govbot_dir: Option, - /// Force re-tagging even if bill already exists in tag files - #[arg(long)] - overwrite: bool, + /// Emit machine-readable JSON instead of a human table. + #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] + output: String, + }, + + /// Search the dataset registry. A blank query lists every dataset. + Search { + /// Query matched against dataset ids and names (case-insensitive). + #[arg(num_args = 0..)] + query: Vec, + + /// Emit machine-readable JSON instead of a human table. + #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] + output: String, }, } @@ -227,65 +298,92 @@ fn get_govbot_dir(govbot_dir: Option) -> anyhow::Result { } } -/// Process a single locale clone/pull operation -fn process_single_locale( - locale: &str, +/// The directory holding the project's `govbot.yml` (and where `govbot.lock` +/// is written) — the current working directory. +fn project_dir() -> anyhow::Result { + std::env::current_dir().map_err(|e| anyhow::anyhow!("Could not determine cwd: {}", e)) +} + +/// Load the active dataset registry for the current project. +fn load_registry() -> anyhow::Result { + let dir = project_dir()?; + Registry::load(&dir).map_err(|e| anyhow::anyhow!("{}", e)) +} + +/// Process a single dataset clone/pull operation. +/// +/// Resolution is registry-driven: the dataset is cloned once into the shared +/// `~/.govbot/cache/` and linked into the project's `repos/`. The resolved +/// commit SHA is captured for `govbot.lock`. +fn process_single_dataset( + dataset: &govbot::ResolvedDataset, repos_dir: &PathBuf, token_str: Option<&str>, verbose: bool, ) -> CloneResult { - let repo_name = git::build_repo_name(locale); - let target_dir = repos_dir.join(&repo_name); - + let short = dataset.short_name().to_string(); + let target_dir = repos_dir.join(git::repo_dir_name(&short)); + let local_size = if target_dir.exists() { git::get_directory_size(&target_dir).unwrap_or(0) } else { 0 }; - - match git::clone_or_pull_repo_quiet(locale, repos_dir, token_str, !verbose) { - Ok(action) => { + + match git::clone_or_pull_dataset(dataset, repos_dir, token_str, !verbose) { + Ok(outcome) => { let final_size = if target_dir.exists() { git::get_directory_size(&target_dir).unwrap_or(0) } else { 0 }; - - let result = match action { + + let result = match outcome.action { "clone" => "🆕", "pulled" => "⬇️", "no_updates" => "✅", "recloned" => "🔄", _ => "processed", }; - + let mut clone_result = CloneResult { - locale: locale.to_string(), + locale: short.clone(), result: result.to_string(), - position: String::new(), // Will be set by caller + position: String::new(), size: None, local_size: None, final_size: None, error: None, + pin: Some(DatasetPin { + canonical_id: dataset.id.clone(), + git_url: dataset.entry.git_url.clone(), + channel: dataset.channel.clone(), + commit: outcome.commit.clone(), + cache_key: outcome.cache_key.clone(), + }), }; - - if action == "clone" || action == "recloned" || action == "no_updates" { + + if outcome.action == "clone" + || outcome.action == "recloned" + || outcome.action == "no_updates" + { clone_result.size = Some(git::format_size(final_size)); } else { clone_result.local_size = Some(git::format_size(local_size)); clone_result.final_size = Some(git::format_size(final_size)); } - + clone_result } Err(e) => CloneResult { - locale: locale.to_string(), + locale: short, result: "failed".to_string(), - position: String::new(), // Will be set by caller + position: String::new(), size: None, local_size: None, final_size: None, error: Some(e.to_string()), + pin: None, }, } } @@ -323,19 +421,19 @@ fn print_result(result: &CloneResult) { /// Perform clone/pull operations and print results as they complete async fn perform_clone_operations( - repos_to_clone: Vec, + datasets: Vec, repos_dir: PathBuf, token_str: Option<&str>, num_jobs: usize, verbose: bool, ) -> anyhow::Result> { - let total = repos_to_clone.len(); + let total = datasets.len(); let mut all_results = Vec::new(); - + if total == 1 || num_jobs == 1 { // Sequential clone/pull - print as we go - for (idx, locale) in repos_to_clone.iter().enumerate() { - let mut result = process_single_locale(locale, &repos_dir, token_str, verbose); + for (idx, dataset) in datasets.iter().enumerate() { + let mut result = process_single_dataset(dataset, &repos_dir, token_str, verbose); result.position = format!("{}/{}", idx + 1, total); print_result(&result); all_results.push(result); @@ -344,18 +442,17 @@ async fn perform_clone_operations( // Parallel clone/pull - print as results come in use std::sync::{Arc, Mutex}; let completed = Arc::new(Mutex::new(0usize)); - - let clone_futures = stream::iter(repos_to_clone.iter()) - .map(|locale| { - let locale = locale.clone(); + + let clone_futures = stream::iter(datasets.into_iter()) + .map(|dataset| { let repos_dir = repos_dir.clone(); let token = token_str.map(|s| s.to_string()); let completed = completed.clone(); let total = total; let verbose_flag = verbose; - + tokio::task::spawn_blocking(move || { - let mut result = process_single_locale(&locale, &repos_dir, token.as_deref(), verbose_flag); + let mut result = process_single_dataset(&dataset, &repos_dir, token.as_deref(), verbose_flag); let mut count = completed.lock().unwrap(); *count += 1; result.position = format!("{}/{}", *count, total); @@ -365,7 +462,7 @@ async fn perform_clone_operations( .buffer_unordered(num_jobs); let mut stream = clone_futures; - + while let Some(result) = stream.next().await { match result { Ok(data) => { @@ -381,6 +478,7 @@ async fn perform_clone_operations( local_size: None, final_size: None, error: Some(format!("Task error: {}", e)), + pin: None, }; print_result(&error_result); all_results.push(error_result); @@ -391,13 +489,45 @@ async fn perform_clone_operations( let _ = std::io::stderr().flush(); } } - + Ok(all_results) } +/// Write/update `govbot.lock` from a batch of successful clone/pull results. +/// Non-fatal: a lockfile-write failure prints a warning but does not abort. +fn update_lockfile(project_dir: &std::path::Path, results: &[CloneResult]) { + let mut lock = match LockFile::load_or_default(project_dir) { + Ok(l) => l, + Err(e) => { + eprintln!("⚠️ Could not read govbot.lock ({}); skipping pin update", e); + return; + } + }; + let mut pinned = 0usize; + for r in results { + if let Some(pin) = &r.pin { + lock.pin( + &pin.canonical_id, + &pin.git_url, + pin.channel.as_deref(), + &pin.commit, + &pin.cache_key, + ); + pinned += 1; + } + } + if pinned == 0 { + return; + } + match lock.save(project_dir) { + Ok(()) => eprintln!("🔒 Updated govbot.lock ({} datasets pinned)", pinned), + Err(e) => eprintln!("⚠️ Could not write govbot.lock: {}", e), + } +} + -async fn run_clone_command(cmd: Command) -> anyhow::Result<()> { - let Command::Clone { +async fn run_pull_command(cmd: Command) -> anyhow::Result<()> { + let Command::Pull { repos, govbot_dir, token, @@ -408,106 +538,80 @@ async fn run_clone_command(cmd: Command) -> anyhow::Result<()> { unreachable!() }; + let registry = load_registry()?; + // If --list flag is set, show the list if list { - println!("Available repos:"); - let all_locales = govbot::locale::WorkingLocale::all(); - for locale in all_locales { - println!(" {}", locale.as_lowercase()); + println!("Available datasets:"); + for d in registry.all() { + println!(" {}", d.short_name()); } - println!(" all (clone all repos)"); + println!(" all (pull every dataset)"); return Ok(()); } let repos_dir = get_govbot_dir(govbot_dir)?; - + let proj_dir = project_dir()?; + // Get token from argument or environment variable let env_token = std::env::var("TOKEN").ok(); let token_str = token.as_deref().or(env_token.as_deref()); - + // Get parallelization setting let num_jobs = parallel .or_else(|| std::env::var("GOVBOT_JOBS").ok().and_then(|s| s.parse().ok())) .unwrap_or(4); - // Parse repos and handle "all" - let mut repos_to_clone = Vec::new(); - - if repos.is_empty() { - // No repos specified: find existing repos to update - // Check all known locales to see which repos exist - let all_locales = govbot::locale::WorkingLocale::all(); - for locale in all_locales { - let locale_str = locale.as_lowercase(); - let repo_name = git::build_repo_name(&locale_str); - let repo_path = repos_dir.join(&repo_name); - - // Check if this is a git repository - if repo_path.exists() && repo_path.join(".git").exists() { - repos_to_clone.push(locale_str.to_string()); - } - } - - if repos_to_clone.is_empty() { - eprintln!("No repos downloaded yet in this directory"); - eprintln!("to download all gov data, do `govbot clone all`. future syncs are just `govbot clone`"); + // Resolve which datasets to pull. + let datasets_to_pull: Vec = if repos.is_empty() { + // No datasets specified: update whatever is already cloned locally. + // A locally-present dataset that is no longer in the registry is + // skipped with a warning rather than aborting the whole update. + let local = git::get_local_datasets(&repos_dir).unwrap_or_default(); + if local.is_empty() { + eprintln!("No datasets downloaded yet in this directory"); + eprintln!("to download all gov data, do `govbot pull all`. future syncs are just `govbot pull`"); return Ok(()); } - - // Create directory if it doesn't exist (needed for the clone operations) - std::fs::create_dir_all(&repos_dir)?; - } else { - // Create directory if it doesn't exist (needed for the clone operations) std::fs::create_dir_all(&repos_dir)?; - - // Parse specified repos - for repo in repos { - let repo = repo.trim().to_lowercase(); - if repo.is_empty() { - continue; - } - - if repo == "all" { - // Add all working locales - let all_locales = govbot::locale::WorkingLocale::all(); - for loc in all_locales { - repos_to_clone.push(loc.as_lowercase().to_string()); - } - } else { - // Validate locale - let _ = govbot::locale::WorkingLocale::from(repo.as_str()); - repos_to_clone.push(repo); + let mut resolved = Vec::new(); + for short in &local { + match registry.resolve(short) { + Ok(d) => resolved.push(d), + Err(_) => eprintln!("⚠️ Skipping '{}' — not in the registry", short), } } - } + resolved + } else { + std::fs::create_dir_all(&repos_dir)?; + registry + .resolve_all(&repos) + .map_err(|e| anyhow::anyhow!("{}", e))? + }; - if repos_to_clone.is_empty() { + if datasets_to_pull.is_empty() { return Ok(()); -} + } // Print initial message with count - eprintln!("🔁 Syncing {} repos\n", repos_to_clone.len()); + eprintln!("🔁 Syncing {} datasets\n", datasets_to_pull.len()); // Perform clone operations and print results as they complete - let results = perform_clone_operations( - repos_to_clone, - repos_dir, - token_str, - num_jobs, - verbose, - ).await?; - + let results = + perform_clone_operations(datasets_to_pull, repos_dir, token_str, num_jobs, verbose).await?; + + // Pin resolved SHAs into govbot.lock for reproducibility. + update_lockfile(&proj_dir, &results); + // Show summary - let errors: Vec<_> = results.iter() - .filter(|r| r.result == "failed") - .collect(); - + let errors: Vec<_> = results.iter().filter(|r| r.result == "failed").collect(); + if !errors.is_empty() { eprintln!("\n❌ Errors occurred: {}/{}", errors.len(), results.len()); } else if !results.is_empty() { - eprintln!("\n✅ Successfully processed all {} repos!", results.len()); + eprintln!("\n✅ Successfully processed all {} datasets!", results.len()); } - + Ok(()) } @@ -537,30 +641,31 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { } let repos_dir = get_govbot_dir(govbot_dir)?; - + // Get parallelization setting let num_jobs = parallel .or_else(|| std::env::var("GOVBOT_JOBS").ok().and_then(|s| s.parse().ok())) .unwrap_or(4); - // Parse locales and handle "all" + // Parse datasets and handle "all". `all` expands to whatever is cloned + // locally — there is nothing to delete that is not on disk. let mut locales_to_delete = Vec::new(); for locale in locales { let locale = locale.trim().to_lowercase(); if locale.is_empty() { continue; } - + if locale == "all" { - // Add all working locales - let all_locales = govbot::locale::WorkingLocale::all(); - for loc in all_locales { - locales_to_delete.push(loc.as_lowercase().to_string()); + for short in git::get_local_datasets(&repos_dir).unwrap_or_default() { + locales_to_delete.push(short); } } else { - // Validate locale - let _ = govbot::locale::WorkingLocale::from(locale.as_str()); - locales_to_delete.push(locale); + // A dataset identifier may be namespaced; delete keys on the short + // (slash-free) name the clone directory uses. + let short = locale.rsplit('/').next().unwrap_or(&locale).to_string(); + let short = short.split('@').next().unwrap_or(&short).to_string(); + locales_to_delete.push(short); } } @@ -579,15 +684,15 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { if total == 1 || num_jobs == 1 { // Sequential delete for (idx, locale) in locales_to_delete.iter().enumerate() { - let repo_name = format!("{}-data-pipeline", locale); + let repo_name = git::repo_dir_name(locale); let target_dir = repos_dir.join(&repo_name); - let existed = target_dir.exists(); - + let existed = target_dir.exists() || std::fs::symlink_metadata(&target_dir).is_ok(); + if verbose { eprintln!("[{}/{}] Deleting {}...", idx + 1, total, locale); } - - match git::delete_repo(locale, &repos_dir) { + + match git::delete_dataset(locale, &repos_dir) { Ok(_) => { if existed { eprintln!("{:<4} deleted", locale); @@ -618,18 +723,18 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let verbose_flag = verbose; tokio::task::spawn_blocking(move || { - let repo_name = format!("{}-data-pipeline", locale); + let repo_name = git::repo_dir_name(&locale); let target_dir = repos_dir.join(&repo_name); - + if verbose_flag { let d = deleted.lock().unwrap(); let f = failed.lock().unwrap(); let current = *d + *f + 1; eprintln!("[{}/{}] Deleting {}...", current, total, locale); } - - let existed = target_dir.exists(); - match git::delete_repo(&locale, &repos_dir) { + + let existed = target_dir.exists() || std::fs::symlink_metadata(&target_dir).is_ok(); + match git::delete_dataset(&locale, &repos_dir) { Ok(_) => { if existed { let mut d = deleted.lock().unwrap(); @@ -683,8 +788,35 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { Ok(()) } -async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { - let Command::Logs { +/// Collapse a fully-joined `govbot logs` entry into the +/// `{"id","text","kind":"docs"}` document the govbot stream protocol defines +/// (`STREAM_PROTOCOL.md` §1) — the record `fastclass classify -` consumes. +/// +/// `id` is the bill's dataset-relative directory path (derived from +/// `sources.log` by dropping the `/logs/.json` tail), so a classified +/// result can be routed back to the right place when `govbot apply` writes it. +/// `text` is the **full** bill text assembled from `metadata.json` (not just +/// titles) — the `docs` projection joins the complete bill so this is whole. +fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { + let id = entry + .get("sources") + .and_then(|s| s.get("log")) + .and_then(|v| v.as_str()) + .and_then(|log_path| log_path.split("/logs/").next()) + .map(|s| s.to_string()) + .or_else(|| { + entry + .get("log") + .and_then(|l| l.get("bill_id").or_else(|| l.get("bill_identifier"))) + .and_then(|v| v.as_str()) + .map(|s| s.to_string()) + }) + .unwrap_or_default(); + serde_json::json!({ "id": id, "text": ocd_files_select_default(entry), "kind": "docs" }) +} + +async fn run_source_command(cmd: Command) -> anyhow::Result<()> { + let Command::Source { govbot_dir, repos, sort: _sort, @@ -734,32 +866,29 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { repo_list.push("all".to_string()); } - // Expand "all" to existing repos in the directory, or convert locale names to repo names + // Expand "all" to the datasets cloned in the directory, or map dataset + // identifiers to their on-disk repo directory names. let mut repos_to_process = Vec::new(); for locale in repo_list { let locale = locale.trim().to_lowercase(); if locale.is_empty() { continue; } - + if locale == "all" { - // Find all existing repos in the directory + // Every dataset cloned locally — registry membership is not + // required here, only on-disk presence. if git_dir.exists() { - let all_locales = govbot::locale::WorkingLocale::all(); - for loc in all_locales { - let locale_str = loc.as_lowercase(); - let repo_name = git::build_repo_name(&locale_str); - let repo_path = git_dir.join(&repo_name); - - // Only add repos that actually exist (for logs, we don't need .git, just the directory) - if repo_path.exists() && repo_path.is_dir() { - repos_to_process.push(repo_name); - } + for short in git::get_local_datasets(&git_dir).unwrap_or_default() { + repos_to_process.push(git::repo_dir_name(&short)); } } } else { - // Convert locale name to repo name using build_repo_name - repos_to_process.push(git::build_repo_name(&locale)); + // A dataset identifier may be namespaced; the clone directory is + // keyed on the short (slash-free) name. + let short = locale.rsplit('/').next().unwrap_or(&locale); + let short = short.split('@').next().unwrap_or(short); + repos_to_process.push(git::repo_dir_name(short)); } } @@ -771,8 +900,11 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { // Process each repo (with optional filtering) for repo_name in repos_to_process { + // A project's repo entry may be a symlink into the shared dataset + // cache. The walker reads through it transparently and reports child + // paths under `git_dir`, so `sources.log` stays project-relative. let repo_path = git_dir.join(&repo_name); - + if !repo_path.exists() { eprintln!("Warning: Repository not found: {}", repo_path.display()); continue; @@ -999,7 +1131,14 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { let mut output_value = serde_json::Value::Object(output); - // Apply select transformation if requested + // Apply select transformation if requested. + // `default` trims each entry to the familiar + // title/abstracts/subject shape. `docs` deliberately + // does NOT trim — it keeps the full joined `bill` + // (the whole metadata.json) so the {id,text,kind} + // document carries the FULL bill text per + // STREAM_PROTOCOL §1. The collapse to {id,text,kind} + // happens after the entry survives the filter. if select == "default" { // Select specific keys from nested objects, preserving structure let mut selected_output = serde_json::Map::new(); @@ -1075,6 +1214,13 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { }; if should_output { + // `docs` mode: collapse the surviving entry to the + // {id,text} document shape fastclass consumes. + let output_value = if select == "docs" { + ocd_entry_to_doc(&output_value) + } else { + output_value + }; // Deep prune empty/null values before serialization let pruned_value = deep_prune_json(output_value); @@ -1213,29 +1359,29 @@ fn extract_timestamp_from_path(path: &str) -> Option { None } -/// Compute relative path from git_dir to a file, following symlinks +/// Compute the relative path from `git_dir` to a walked file. +/// +/// Files are walked as `git_dir//...` — including through a `` +/// symlink into the shared dataset cache — so the direct (non-canonicalized) +/// diff is what keeps `sources.log` project-relative. Canonicalizing here +/// would resolve a cached dataset to `~/.govbot/cache/...` and escape +/// `git_dir`; it is used only as a last-resort fallback. fn compute_relative_source_path(file_path: &PathBuf, git_dir: &PathBuf) -> String { - // Canonicalize the file path to follow symlinks - let canonical_file = match file_path.canonicalize() { - Ok(p) => p, - Err(_) => file_path.clone(), - }; - - // Canonicalize git_dir for proper relative path calculation - let canonical_git_dir = match git_dir.canonicalize() { - Ok(p) => p, - Err(_) => git_dir.clone(), - }; - - // Get relative path from git_dir to the file + // Preferred: the path as walked, relative to git_dir. + if let Some(rel) = pathdiff::diff_paths(file_path, git_dir) { + if !rel.starts_with("..") { + return rel.to_string_lossy().replace('\\', "/"); + } + } + + // Fallback: canonicalize both ends and diff. + let canonical_file = file_path.canonicalize().unwrap_or_else(|_| file_path.clone()); + let canonical_git_dir = git_dir.canonicalize().unwrap_or_else(|_| git_dir.clone()); match pathdiff::diff_paths(&canonical_file, &canonical_git_dir) { Some(rel_path) => rel_path.to_string_lossy().replace('\\', "/"), - None => { - // Fallback: use path relative to git_dir directly - pathdiff::diff_paths(file_path, git_dir) - .map(|p| p.to_string_lossy().replace('\\', "/")) - .unwrap_or_else(|| file_path.to_string_lossy().replace('\\', "/")) - } + None => pathdiff::diff_paths(file_path, git_dir) + .map(|p| p.to_string_lossy().replace('\\', "/")) + .unwrap_or_else(|| file_path.to_string_lossy().replace('\\', "/")), } } @@ -1254,7 +1400,7 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { // Check if directory exists if !repos_dir.exists() { eprintln!("Error: Govbot repos directory not found: {}", repos_dir.display()); - eprintln!("Run 'govbot clone all' first to clone repositories."); + eprintln!("Run 'govbot pull all' first to pull datasets."); return Ok(()); } @@ -1409,719 +1555,298 @@ fn extract_path_info(path: &str) -> Option<(String, String, String)> { Some((country, state, session_id)) } -/// Download a file from a URL to a local path -fn download_file(url: &str, path: &std::path::Path) -> anyhow::Result<()> { - eprintln!("Downloading {}...", url); - let response = reqwest::blocking::get(url)?; - if !response.status().is_success() { - return Err(anyhow::anyhow!("Failed to download {}: HTTP {}", url, response.status())); - } - let mut file = std::fs::File::create(path)?; - std::io::copy(&mut response.bytes()?.as_ref(), &mut file)?; - Ok(()) +/// The slice of a `fastclass classify` result that `govbot apply` consumes. +/// Unknown fields are ignored, so fastclass may evolve its output freely. +#[derive(serde::Deserialize)] +struct FastclassResult { + doc: String, + #[serde(default)] + text_hash: String, + #[serde(default)] + tags: HashMap, } -/// Ensure embedding model and tokenizer exist; if missing, download them from Hugging Face. -/// Returns true if files are present/ready, false otherwise. -fn ensure_embedding_files(model_dir: &std::path::Path) -> bool { - let model_path = model_dir.join("model.onnx"); - let tokenizer_path = model_dir.join("tokenizer.json"); - let _vocab_path = model_dir.join("vocab.txt"); - - if model_path.exists() && tokenizer_path.exists() { - return true; - } - - eprintln!("Embedding files not found. Downloading all-MiniLM-L6-v2 (ONNX) to {}...", model_dir.display()); +#[derive(serde::Deserialize)] +struct FastclassTag { + #[serde(default)] + matched: bool, + #[serde(default)] + fusion: FastclassFusion, +} - // Use Xenova ONNX exports - let onnx_url = "https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx"; - let tokenizer_url = "https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/tokenizer.json"; +#[derive(serde::Deserialize, Default)] +struct FastclassFusion { + #[serde(default)] + final_score: f64, +} - // Download tokenizer.json - if !tokenizer_path.exists() { - if let Err(e) = download_file(tokenizer_url, &tokenizer_path) { - eprintln!("Failed to download tokenizer.json: {}", e); - return false; - } - } +/// A bill's location in the dataset, parsed from a fastclass result's `doc` +/// id — which `govbot source --select docs` set to the bill's directory path. +struct BillRoute { + country: String, + state: String, + session: String, + bill_id: String, +} - // Download ONNX model - if !model_path.exists() { - if let Err(e) = download_file(onnx_url, &model_path) { - eprintln!("Failed to download ONNX model: {}", e); - return false; +/// Parse a `doc` id of the form +/// `/country:/state:/sessions//bills/` into the +/// pieces needed to place its `.tag.json` file. Returns `None` for any id that +/// is not a dataset bill path (e.g. a document from a non-govbot source). +fn parse_doc_route(doc: &str) -> Option { + let segments: Vec<&str> = doc.split('/').collect(); + let (mut country, mut state, mut session, mut bill_id) = (None, None, None, None); + for (i, seg) in segments.iter().enumerate() { + if let Some(c) = seg.strip_prefix("country:") { + country = Some(c.to_string()); + } else if let Some(s) = seg.strip_prefix("state:") { + state = Some(s.to_string()); + } else if *seg == "sessions" { + session = segments.get(i + 1).map(|s| s.to_string()); + } else if *seg == "bills" { + bill_id = segments.get(i + 1).map(|s| s.to_string()); } } - - if !model_path.exists() || !tokenizer_path.exists() { - eprintln!( - "Download completed but model.onnx or tokenizer.json not found in {}", - model_dir.display() - ); - return false; - } - - eprintln!("✅ Successfully downloaded embedding files!"); - true + Some(BillRoute { + country: country?, + state: state?, + session: session?, + bill_id: bill_id?, + }) } -/// Tag result structure: (tag_key, score_breakdown) -type TagResult = (String, govbot::ScoreBreakdown); - -/// Check if a bill is already tagged in tag file(s) for the given session -/// If tag_name is Some, only checks that specific tag file -/// Returns a list of tag names that contain this bill -fn check_existing_tags( - tags_dir: &PathBuf, - bill_id: &str, - tag_name: Option<&str>, -) -> anyhow::Result> { - let mut matched_tags = Vec::new(); - - if !tags_dir.exists() { - return Ok(matched_tags); - } - - // If a specific tag is requested, only check that tag file - if let Some(requested_tag) = tag_name { - let tag_path = tags_dir.join(format!("{}.tag.json", requested_tag)); - if tag_path.exists() { - match fs::read_to_string(&tag_path) { - Ok(contents) => { - if let Ok(tag_file) = serde_json::from_str::(&contents) { - if tag_file.bills.contains_key(bill_id) { - matched_tags.push(requested_tag.to_string()); - } - } - } - Err(_) => { - // Tag file exists but can't be read - return empty - } - } - } - return Ok(matched_tags); - } - - // Otherwise, scan all .tag.json files in the tags directory - for entry in fs::read_dir(tags_dir)? { - let entry = entry?; - let path = entry.path(); - - if let Some(ext) = path.extension() { - if ext == "json" { - if let Some(stem) = path.file_stem().and_then(|s| s.to_str()) { - // Remove .tag suffix if present (e.g., "budget.tag" -> "budget") - let tag_name = stem.strip_suffix(".tag").unwrap_or(stem); - - match fs::read_to_string(&path) { - Ok(contents) => { - if let Ok(tag_file) = serde_json::from_str::(&contents) { - // Check if bill_id exists in bills map - if tag_file.bills.contains_key(bill_id) { - matched_tags.push(tag_name.to_string()); - } - } - } - Err(_) => { - // Skip files that can't be read - continue; - } - } - } - } - } +/// Build a fresh `TagFile` for `tag_key`. The taxonomy now lives in a fastclass +/// classifier bundle, not in `govbot.yml`, so `tag_defs` is normally empty and +/// each tag file gets a minimal stub `tag_config` derived from the tag name. +fn new_tag_file(tag_key: &str, tag_defs: &[govbot::TagDefinition], now: &str) -> TagFile { + let tag_def = tag_defs + .iter() + .find(|td| td.name == tag_key) + .cloned() + .unwrap_or_else(|| govbot::TagDefinition { + name: tag_key.to_string(), + description: String::new(), + examples: Vec::new(), + include_keywords: Vec::new(), + exclude_keywords: Vec::new(), + negative_examples: Vec::new(), + threshold: 0.5, + }); + let tag_config_hash = hash_text(&serde_json::to_string(&tag_def).unwrap_or_default()); + TagFile { + metadata: TagFileMetadata { + last_run: now.to_string(), + model: "fastclass".to_string(), + tag_config_hash, + }, + tag_config: tag_def, + text_cache: HashMap::new(), + bills: HashMap::new(), } - - Ok(matched_tags) } -async fn run_tag_command(cmd: Command) -> anyhow::Result<()> { - let Command::Tag { +/// `govbot apply` — the persistence sink of the tagging pipeline. +/// +/// It classifies nothing. It reads `fastclass classify` result JSON from +/// stdin — the apply sink of +/// `govbot source --select docs | fastclass classify - | govbot apply` — and +/// for every matched tag writes the bill into the per-tag `.tag.json` file +/// under the dataset's `sessions//tags/` directory. Those are the +/// files `govbot publish` later turns into feeds. +async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { + let Command::Apply { tag_name, output_dir, - govbot_dir, overwrite, - } = cmd else { + } = cmd + else { unreachable!() }; - // Check if govbot.yml exists in current directory let current_dir = std::env::current_dir()?; - let default_tags_cfg = current_dir.join("govbot.yml"); + // Tag files land under --output-dir when given, otherwise the current + // directory (which, for a govbot project, holds govbot.yml). + let base_output_dir = output_dir + .as_ref() + .map(PathBuf::from) + .unwrap_or_else(|| current_dir.clone()); + + // The taxonomy now lives in a fastclass classifier bundle, not in + // govbot.yml — each `.tag.json` is stamped with a stub `tag_config` + // derived only from the matched tag name. + let tag_defs: Vec = Vec::new(); - // Model/tokenizer directory: prefer user-specified govbot-dir or env GOVBOT_DIR, else default .govbot - let model_dir: PathBuf = if let Some(ref dir) = govbot_dir { - PathBuf::from(dir) - } else if let Ok(dir) = std::env::var("GOVBOT_DIR") { - PathBuf::from(dir) - } else { - current_dir.join(".govbot") - }; - fs::create_dir_all(&model_dir)?; - let model_path = model_dir.join("model.onnx"); - let tokenizer_path = model_dir.join("tokenizer.json"); - - // Require govbot.yml - if !default_tags_cfg.exists() { - return Err(anyhow::anyhow!( - "govbot.yml not found in current directory" - )); - } - - // Load tag definitions (needed for both embedding and keyword fallback) - let tag_defs = govbot::embeddings::load_tags_config(&default_tags_cfg) - .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; - - // Try embedding mode first - let embedding_matcher = if ensure_embedding_files(&model_dir) { - let tags_path = default_tags_cfg.clone(); - - eprintln!("Using embedding mode:"); - eprintln!(" Model: {}", model_path.display()); - eprintln!(" Tokenizer: {}", tokenizer_path.display()); - eprintln!(" Tags config: {}", tags_path.display()); - - match TagMatcher::from_files(&model_path, &tokenizer_path, &tags_path) { - Ok(matcher) => Some(matcher), - Err(e) => { - eprintln!("Warning: Failed to initialize embedding matcher: {}", e); - eprintln!("Falling back to keyword-based matching."); - None - } - } - } else { - eprintln!("Embedding files not available; using keyword-based matching."); - eprintln!(" Tags config: {}", default_tags_cfg.display()); - None - }; - - // Determine output directory - // If govbot.yml exists, use its directory as the base output directory - let base_output_dir = if default_tags_cfg.exists() { - // Use the directory containing govbot.yml - default_tags_cfg.parent() - .unwrap_or(¤t_dir) - .to_path_buf() - } else if let Some(ref dir) = output_dir { - PathBuf::from(dir) - } else if let Some(ref dir) = govbot_dir { - PathBuf::from(dir) - } else if let Ok(dir) = std::env::var("GOVBOT_DIR") { - PathBuf::from(dir) - } else { - // Default to current directory - current_dir - }; - - // Read JSON lines from stdin let stdin = io::stdin(); let reader = BufReader::new(stdin.lock()); - - let mut processed_count = 0; - let mut skipped_count = 0; - let mut read_count: usize = 0; - - eprintln!("Reading JSON lines from stdin..."); - + let now = chrono::Utc::now().to_rfc3339(); + let mut written = 0usize; + let mut skipped = 0usize; + + eprintln!("Reading fastclass classification results from stdin..."); for line_result in reader.lines() { let line = line_result?; let line = line.trim(); if line.is_empty() { - read_count += 1; - if read_count % 100 == 0 { - eprintln!("Read {} lines (processed {}, skipped {})...", read_count, processed_count, skipped_count); - } continue; } - - read_count += 1; - // Parse JSON line (assumes default selector format) - match serde_json::from_str::(line) { - Ok(json_value) => { - // Extract bill_id from top-level "id" field (default selector format) - let bill_id_opt = json_value - .get("id") - .and_then(|id| id.as_str()); - - // Extract text from JSON for embedding comparison - let bill_text = ocd_files_select_default(&json_value); - - // Extract path info from sources.log (default selector format) - let path_info = json_value - .get("sources") - .and_then(|sources| sources.get("log")) - .and_then(|path| path.as_str()) - .and_then(|log_path| extract_path_info(log_path)) - .or_else(|| { - // Fallback: use default values if we can't determine - Some(("us".to_string(), "unknown".to_string(), "unknown".to_string())) - }); - - // Process if we have path info (from sources.log in default selector format) - if let Some((country, state, session_id)) = path_info { - // Get bill_id - use "id" from default selector, or generate from text hash if missing - let bill_id = bill_id_opt.map(|s| s.to_string()).unwrap_or_else(|| { - let text_hash = hash_text(&bill_text); - format!("entry_{}", &text_hash[..8]) - }); - - // Determine tags directory - let tags_dir = base_output_dir - .join(&format!("country:{}", country)) - .join(&format!("state:{}", state)) - .join("sessions") - .join(&session_id) - .join("tags"); - - // Validate tag_name if provided - if let Some(ref requested_tag) = tag_name { - if !tag_defs.iter().any(|td| td.name == *requested_tag) { - return Err(anyhow::anyhow!( - "Tag '{}' not found in govbot.yml. Available tags: {}", - requested_tag, - tag_defs.iter().map(|td| td.name.clone()).collect::>().join(", ") - )); - } - } - - // Fast path: check if bill is already tagged (unless overwrite is set) - let mut matched_tags: Vec = Vec::new(); - let mut should_run_tagging = overwrite; - - if !overwrite { - match check_existing_tags(&tags_dir, &bill_id, tag_name.as_deref()) { - Ok(existing_tags) => { - if !existing_tags.is_empty() { - // Bill is already tagged - output the line and skip tagging - matched_tags = existing_tags; - should_run_tagging = false; - } else { - // Bill not found in tag file(s) - need to run tagging - should_run_tagging = true; - } - } - Err(e) => { - // Error checking tags - run tagging to be safe - eprintln!("Warning: Error checking existing tags for {}: {}", bill_id, e); - should_run_tagging = true; - } - } - } - - // Run tagging logic if needed - if should_run_tagging { - // Choose strategy based on mode - let mut tags: Vec = if let Some(matcher) = embedding_matcher.as_ref() { - match matcher.match_json_value(&json_value) { - Ok(results) => results, - Err(e) => { - eprintln!("Error running embedding matcher for bill {}: {}", bill_id, e); - eprintln!("Falling back to keyword-based matching for this entry."); - // Fall back to keyword matching for this entry - govbot::embeddings::match_tags_keywords(&tag_defs, &json_value) - } - } - } else { - // Use keyword-based fallback matcher - govbot::embeddings::match_tags_keywords(&tag_defs, &json_value) - }; - - // Filter to specific tag if requested - if let Some(ref requested_tag) = tag_name { - tags.retain(|(tag, _)| tag == requested_tag); - } - - // Extract tag names from results - matched_tags = tags.iter().map(|(tag_name, _)| tag_name.clone()).collect(); - - // Save tags to files if we found matches - if !tags.is_empty() { - let text_hash = hash_text(&bill_text); - - // Write per-tag files immediately - fs::create_dir_all(&tags_dir)?; - - // Get current timestamp for metadata - let now = chrono::Utc::now().to_rfc3339(); - let model_path_str = if embedding_matcher.is_some() { - model_path.to_string_lossy().to_string() - } else { - "keyword-fallback".to_string() - }; - - for (tag_key, score_breakdown) in tags { - let tag_path = tags_dir.join(format!("{}.tag.json", tag_key)); - - // Load or create TagFile structure - let mut tag_file: TagFile = if tag_path.exists() { - match fs::read_to_string(&tag_path) { - Ok(contents) => { - serde_json::from_str(&contents).unwrap_or_else(|_| { - // If parsing fails, create a new TagFile - let tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| govbot::TagDefinition { - name: tag_key.clone(), - description: String::new(), - examples: Vec::new(), - include_keywords: Vec::new(), - exclude_keywords: Vec::new(), - negative_examples: Vec::new(), - threshold: 0.5, - }); - - let tag_config_hash = hash_text(&serde_json::to_string(&tag_def).unwrap_or_default()); - - TagFile { - metadata: TagFileMetadata { - last_run: now.clone(), - model: model_path_str.clone(), - tag_config_hash, - }, - tag_config: tag_def, - text_cache: HashMap::new(), - bills: HashMap::new(), - } - }) - } - Err(_) => { - // Create new TagFile - let tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| govbot::TagDefinition { - name: tag_key.clone(), - description: String::new(), - examples: Vec::new(), - include_keywords: Vec::new(), - exclude_keywords: Vec::new(), - negative_examples: Vec::new(), - threshold: 0.5, - }); - - let tag_config_hash = hash_text(&serde_json::to_string(&tag_def)?); - - TagFile { - metadata: TagFileMetadata { - last_run: now.clone(), - model: model_path_str.clone(), - tag_config_hash, - }, - tag_config: tag_def, - text_cache: HashMap::new(), - bills: HashMap::new(), - } - } - } - } else { - // Create new TagFile - let tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| govbot::TagDefinition { - name: tag_key.clone(), - description: String::new(), - examples: Vec::new(), - include_keywords: Vec::new(), - exclude_keywords: Vec::new(), - negative_examples: Vec::new(), - threshold: 0.5, - }); - - let tag_config_hash = hash_text(&serde_json::to_string(&tag_def)?); - - TagFile { - metadata: TagFileMetadata { - last_run: now.clone(), - model: model_path_str.clone(), - tag_config_hash, - }, - tag_config: tag_def, - text_cache: HashMap::new(), - bills: HashMap::new(), - } - }; + let result: FastclassResult = match serde_json::from_str(line) { + Ok(r) => r, + Err(e) => { + eprintln!("Warning: skipping unparseable result line: {}", e); + skipped += 1; + continue; + } + }; + let Some(route) = parse_doc_route(&result.doc) else { + eprintln!( + "Warning: skipping '{}' — its id is not a dataset bill path. \ + Stream documents in with `govbot source --select docs`.", + result.doc + ); + skipped += 1; + continue; + }; - // Update metadata - tag_file.metadata.last_run = now.clone(); - tag_file.metadata.model = model_path_str.clone(); - - // Update tag config if it changed - let current_tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| tag_file.tag_config.clone()); - - let current_config_hash = hash_text(&serde_json::to_string(¤t_tag_def)?); - if current_config_hash != tag_file.metadata.tag_config_hash { - tag_file.tag_config = current_tag_def; - tag_file.metadata.tag_config_hash = current_config_hash; - } - - // Add text to cache if not present - if !tag_file.text_cache.contains_key(&text_hash) { - tag_file.text_cache.insert(text_hash.clone(), bill_text.clone()); - } - - // Add/update bill result - tag_file.bills.insert(bill_id.to_string(), BillTagResult { - text_hash: text_hash.clone(), - score: score_breakdown, - }); - - // Write updated TagFile - let json_string = serde_json::to_string_pretty(&tag_file)?; - fs::write(&tag_path, json_string)?; - } - } - } - - // Output the line if it matches tags (filter mode) - // If a specific tag was requested, only output if that tag matches - // Otherwise, output if any tag matches - let should_output = if let Some(ref requested_tag) = tag_name { - matched_tags.contains(requested_tag) - } else { - !matched_tags.is_empty() - }; - - if should_output { - write_json_line(line)?; - } - - processed_count += 1; - if processed_count % 50 == 0 { - eprintln!("Processed {} entries (matched: {} tags)...", processed_count, matched_tags.len()); - } - } else { - // No path info - skip this entry (default selector should always provide sources.log) - skipped_count += 1; - } + // The tags this bill matched, optionally narrowed to one requested tag. + let mut matched: Vec<(String, f64)> = Vec::new(); + for (name, tag) in &result.tags { + if !tag.matched { + continue; } - Err(_e) => { - // Skip malformed/empty lines quietly - skipped_count += 1; + if let Some(req) = &tag_name { + if req != name { + continue; + } } + matched.push((name.clone(), tag.fusion.final_score)); + } + if matched.is_empty() { + continue; } - if read_count % 100 == 0 { - eprintln!("Read {} lines (processed {}, skipped {})...", read_count, processed_count, skipped_count); + let tags_dir = base_output_dir + .join(format!("country:{}", route.country)) + .join(format!("state:{}", route.state)) + .join("sessions") + .join(&route.session) + .join("tags"); + fs::create_dir_all(&tags_dir)?; + + for (tag_key, final_score) in matched { + let tag_path = tags_dir.join(format!("{}.tag.json", tag_key)); + + // Update the existing tag file, or start a fresh one. + let mut tag_file: TagFile = fs::read_to_string(&tag_path) + .ok() + .and_then(|c| serde_json::from_str(&c).ok()) + .unwrap_or_else(|| new_tag_file(&tag_key, &tag_defs, &now)); + + // With --overwrite off, an already-tagged bill is left untouched. + if !overwrite && tag_file.bills.contains_key(&route.bill_id) { + continue; + } + + tag_file.metadata.last_run = now.clone(); + tag_file.metadata.model = "fastclass".to_string(); + tag_file.bills.insert( + route.bill_id.clone(), + BillTagResult { + text_hash: result.text_hash.clone(), + score: govbot::ScoreBreakdown { + final_score, + base_embedding: None, + example_similarity: None, + keyword_match: Vec::new(), + negative_penalty: 0.0, + }, + }, + ); + fs::write(&tag_path, serde_json::to_string_pretty(&tag_file)?)?; } + written += 1; } - - eprintln!("\nProcessed: {}, Skipped: {}", processed_count, skipped_count); - eprintln!("\n✅ Tagging complete!"); - + + eprintln!( + "\n✅ Persisted {} tagged bill(s) under {}; skipped {} entr(ies).", + written, + base_output_dir.display(), + skipped + ); Ok(()) } -async fn run_build_command(cmd: Command) -> anyhow::Result<()> { - let Command::Build { - tags, +/// `govbot publish` — run the manifest's publishers. +/// +/// Reads `govbot.yml`'s typed `publish:` map, collects the tagged result +/// stream from `govbot source`, and runs each named publisher (`rss`/`html`/ +/// `json`/`duckdb`) against it. The publisher's tag list comes from +/// `publish..select`; the retired `tags:` manifest block is gone. +async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { + let Command::Publish { + publishers, limit, output_dir, output_file, + dry_run, govbot_dir, } = cmd else { unreachable!() }; - - // Check if govbot.yml exists in current directory + let current_dir = std::env::current_dir()?; let config_path = current_dir.join("govbot.yml"); - if !config_path.exists() { return Err(anyhow::anyhow!("govbot.yml not found in current directory")); } - - // Load configuration - let config = load_config(&config_path)?; - - // Get tags configuration - let tags_config = config.get("tags") - .and_then(|t| t.as_object()) - .ok_or_else(|| anyhow::anyhow!("No tags found in configuration"))?; - - // Determine which tags to use - let tags_to_use: Vec = if tags.is_empty() { - // Use tags from build config, or all tags - if let Some(build_tags) = config.get("build") - .and_then(|p| p.get("tags")) - .and_then(|t| t.as_array()) - { - build_tags - .iter() - .filter_map(|v| v.as_str().map(|s| s.to_string())) - .collect() - } else { - tags_config.keys().cloned().collect() - } - } else { - tags - }; - - // Validate tags exist - for tag in &tags_to_use { - if !tags_config.contains_key(tag) { - return Err(anyhow::anyhow!("Tag '{}' not found in configuration", tag)); - } - } - - if tags_to_use.is_empty() { - return Err(anyhow::anyhow!("No valid tags to process")); + + // Typed manifest — `publish:` is the publisher map. + let manifest = load_manifest(&config_path)?; + if manifest.publish.is_empty() { + return Err(anyhow::anyhow!( + "govbot.yml has no `publish:` publishers to run" + )); } - - // Get build configuration - let build_config = config.get("build").and_then(|p| p.as_object()); - - // Get output directory - let output_dir_path = if let Some(dir) = output_dir { - PathBuf::from(dir) - } else { - let dir_str = build_config - .and_then(|p| p.get("output_dir")) - .and_then(|d| d.as_str()) - .unwrap_or("docs"); - PathBuf::from(dir_str) - }; - - // Get output filename - let output_filename = if let Some(file) = output_file { - file + + // Which publishers to run: all of them, or the requested subset. + let names_to_run: Vec = if publishers.is_empty() { + manifest.publish.keys().cloned().collect() } else { - build_config - .and_then(|p| p.get("output_file")) - .and_then(|f| f.as_str()) - .unwrap_or("feed.xml") - .to_string() - }; - - // Get feed metadata - let feed_title = build_config - .and_then(|p| p.get("title")) - .and_then(|t| t.as_str()) - .map(|s| s.to_string()) - .unwrap_or_else(|| { - format!("{} Legislation", tags_to_use.iter() - .map(|t| t.replace('_', " ").split_whitespace() - .map(|w| { - let mut chars = w.chars(); - match chars.next() { - None => String::new(), - Some(f) => f.to_uppercase().collect::() + chars.as_str(), - } - }) - .collect::>() - .join(" ")) - .collect::>() - .join(" & ")) - }); - - let feed_description = build_config - .and_then(|p| p.get("description")) - .and_then(|d| d.as_str()) - .map(|s| s.to_string()) - .unwrap_or_else(|| { - let mut descs = Vec::new(); - for tag_name in &tags_to_use { - if let Some(tag_obj) = tags_config.get(tag_name).and_then(|t| t.as_object()) { - if let Some(desc) = tag_obj.get("description").and_then(|d| d.as_str()) { - let tag_title = tag_name.replace('_', " ").split_whitespace() - .map(|w| { - let mut chars = w.chars(); - match chars.next() { - None => String::new(), - Some(f) => f.to_uppercase().collect::() + chars.as_str(), - } - }) - .collect::>() - .join(" "); - descs.push(format!("{}: {}", tag_title, &desc[..desc.len().min(200)])); - } - } - } - if descs.is_empty() { - "Legislative updates".to_string() - } else { - descs.join(" | ") + for name in &publishers { + if !manifest.publish.contains_key(name) { + return Err(anyhow::anyhow!( + "publisher '{}' not found in govbot.yml `publish:`", + name + )); } - }); - - let feed_link = build_config - .and_then(|p| p.get("base_url")) - .and_then(|u| u.as_str()) - .unwrap_or("https://example.com"); - - let base_url = Some(feed_link); - - // Get repos - let repos = get_repos_from_config(&config); - - // Get repos to process - let repos_to_process: Vec = if repos == vec!["all".to_string()] { - Vec::new() // Empty means all repos - } else { - repos - }; - - // Get limit - parse "none" as no limit, otherwise parse as usize - // Default to 100 if not specified - let limit_str_opt = limit.or_else(|| { - build_config - .and_then(|p| p.get("limit")) - .and_then(|l| { - if let Some(s) = l.as_str() { - Some(s.to_string()) - } else if let Some(n) = l.as_u64() { - Some(n.to_string()) - } else { - None - } - }) - }); - - let limit_value: Option = if let Some(limit_str) = limit_str_opt { - if limit_str.to_lowercase() == "none" { - None // No limit - } else { - limit_str.parse().ok() } - } else { - Some(100) // Default to 100 items + publishers }; - - // Run logs command and collect entries - eprintln!("Collecting log entries for tags: {}", tags_to_use.join(", ")); - let mut entries = Vec::new(); - - // Get the base govbot directory (not the repos subdirectory) - // The logs command expects the base directory and will append /repos itself + + // Resolve the base govbot directory for the `source` subprocess. let base_govbot_dir = if let Some(ref gd) = govbot_dir { gd.clone() } else if let Ok(gd) = std::env::var("GOVBOT_DIR") { gd } else { - // Default: $CWD/.govbot std::env::current_dir() .unwrap_or_else(|_| PathBuf::from(".")) .join(".govbot") .to_string_lossy() .to_string() }; - - // Call logs command as subprocess and parse JSON output - // Use current executable (govbot binary) - let exe = std::env::current_exe() - .unwrap_or_else(|_| PathBuf::from("govbot")); - - let mut cmd = ProcessCommand::new(exe); - cmd.arg("logs") + + // Collect the dataset record stream once: `govbot source` over all + // datasets (an empty `--repos` means every dataset). + let datasets_to_process: Vec = if manifest.datasets == vec!["all".to_string()] { + Vec::new() + } else { + manifest.datasets.clone() + }; + + let exe = std::env::current_exe().unwrap_or_else(|_| PathBuf::from("govbot")); + let mut source_cmd = ProcessCommand::new(exe); + source_cmd + .arg("source") .arg("--join") .arg("bill,tags") .arg("--select") @@ -2130,134 +1855,109 @@ async fn run_build_command(cmd: Command) -> anyhow::Result<()> { .arg("default") .arg("--sort") .arg("DESC"); - - // Only add --govbot-dir if it's not the default if !base_govbot_dir.is_empty() && base_govbot_dir != ".govbot" { - cmd.arg("--govbot-dir").arg(&base_govbot_dir); + source_cmd.arg("--govbot-dir").arg(&base_govbot_dir); } - - if !repos_to_process.is_empty() { - cmd.arg("--repos"); - for repo in &repos_to_process { - cmd.arg(repo); + if !datasets_to_process.is_empty() { + source_cmd.arg("--repos"); + for d in &datasets_to_process { + source_cmd.arg(d); } } - - // Don't pass limit to logs command - we'll limit after filtering/sorting - // This ensures we get the best entries, not just the first N from each repo - - let output = cmd.output()?; - - // Check return code + + let output = source_cmd.output()?; if !output.status.success() { let stderr_str = String::from_utf8_lossy(&output.stderr); - eprintln!("Error: logs command failed with exit code: {:?}", output.status.code()); + eprintln!("Error: source command failed: {:?}", output.status.code()); eprintln!("Stderr: {}", stderr_str); - return Err(anyhow::anyhow!("Failed to collect log entries")); - } - - // Check if there were any errors in stderr (but compilation messages are OK) - if !output.stderr.is_empty() { - let stderr_str = String::from_utf8_lossy(&output.stderr); - // Filter out compilation messages - let filtered_stderr: Vec<&str> = stderr_str - .lines() - .filter(|line| !line.contains("Compiling") && !line.contains("Finished")) - .collect(); - if !filtered_stderr.is_empty() { - eprintln!("Warning from logs command: {}", filtered_stderr.join("\n")); - } + return Err(anyhow::anyhow!("Failed to collect dataset records")); } - - // Parse JSON lines from output - let mut total_entries = 0; - let mut filtered_entries = 0; + let stdout_str = String::from_utf8_lossy(&output.stdout); - if stdout_str.trim().is_empty() { - eprintln!("Warning: logs command returned no output. Make sure repositories are cloned and contain log files."); + eprintln!( + "Warning: source returned no output. Make sure datasets are pulled \ + and contain records." + ); } - + let mut all_entries: Vec = Vec::new(); for line in stdout_str.lines() { let line = line.trim(); if line.is_empty() { continue; } match serde_json::from_str::(line) { - Ok(entry) => { - total_entries += 1; - if filter_by_tags(&entry, &tags_to_use) { - entries.push(entry); - filtered_entries += 1; - } - } + Ok(entry) => all_entries.push(entry), Err(e) => { - // Skip invalid JSON lines (might be compilation output that leaked through) if !line.contains("Compiling") && !line.contains("Finished") { eprintln!("Warning: Failed to parse JSON line: {}", e); } } } } - - if total_entries == 0 { - eprintln!("Warning: No log entries found. Make sure repositories are cloned and contain log files."); - } else if filtered_entries == 0 && !tags_to_use.is_empty() { - eprintln!("Warning: Found {} entries but none matched the specified tags. Entries may not have tags yet - consider running 'govbot tag' first, or build without --tags to include all entries.", total_entries); - } - - // Deduplicate and sort - entries = deduplicate_entries(entries); - entries = sort_by_timestamp(entries); - - // Apply limit (default is 100) - let original_count = entries.len(); - if let Some(lim) = limit_value { - entries.truncate(lim); - if original_count > lim { - eprintln!("Limited feed to {} entries (RSS standard). Use --limit none to include all {} entries.", lim, original_count); + + // CLI `--limit` overrides every publisher's configured limit. + let cli_limit: Option> = limit.map(|s| { + if s.eq_ignore_ascii_case("none") { + None + } else { + s.parse().ok() } + }); + + // Run each named publisher against its filtered/sorted/limited stream. + for name in &names_to_run { + let publisher = manifest.publish.get(name).expect("checked above"); + let select = publisher.select.clone().unwrap_or_default(); + + eprintln!( + "\n=== Publisher '{}' ({:?}) — selecting tags: {} ===", + name, + publisher.kind, + if select.is_empty() { + "".to_string() + } else { + select.join(", ") + } + ); + + // Filter to the publisher's selected tags, dedup, sort. + let mut entries: Vec = all_entries + .iter() + .filter(|e| filter_by_tags(e, &select)) + .cloned() + .collect(); + entries = deduplicate_entries(entries); + entries = sort_by_timestamp(entries); + + // Apply the limit: CLI override, else the publisher's, else 100. + let limit_value: Option = match cli_limit { + Some(v) => v, + None => publisher.resolved_limit(Some(100)), + }; + let original_count = entries.len(); + if let Some(lim) = limit_value { + entries.truncate(lim); + if original_count > lim { + eprintln!( + "Limited '{}' to {} entries. Use --limit none for all {}.", + name, lim, original_count + ); + } + } + + let job = govbot::publish::PublishJob { + name, + publisher, + entries, + output_dir_override: output_dir.clone(), + output_file_override: output_file.clone(), + project_dir: current_dir.clone(), + dry_run, + }; + govbot::publish::run_publisher(&job)?; } - - // Create output directory - fs::create_dir_all(&output_dir_path)?; - - // Generate RSS - eprintln!("Generating RSS feed with {} entries...", entries.len()); - let rss_xml = rss::json_to_rss( - entries.clone(), - &feed_title, - &feed_description, - feed_link, - base_url.as_deref(), - "en-us", - ); - - // Write RSS feed - let rss_output_path = output_dir_path.join(&output_filename); - fs::write(&rss_output_path, rss_xml)?; - eprintln!("✓ Generated RSS feed: {}", rss_output_path.display()); - - // Generate HTML - eprintln!("Generating HTML index with {} entries...", entries.len()); - // Only pass title if it was explicitly set in config (not auto-generated) - let html_title = build_config - .and_then(|p| p.get("title")) - .and_then(|t| t.as_str()) - .filter(|s| !s.trim().is_empty()); - let html_content = rss::json_to_html( - entries, - html_title, - feed_link, - base_url.as_deref(), - ); - - // Write HTML index - let html_output_path = output_dir_path.join("index.html"); - fs::write(&html_output_path, html_content)?; - eprintln!("✓ Generated HTML index: {}", html_output_path.display()); - eprintln!(" Tags included: {}", tags_to_use.join(", ")); - + Ok(()) } @@ -2286,7 +1986,223 @@ async fn run_update_command() -> anyhow::Result<()> { } else { return Err(anyhow::anyhow!("Update failed with exit code: {}", status.code().unwrap_or(-1))); } - + + Ok(()) +} + +/// Locate the project's `govbot.yml`, erroring if there is none. +fn require_manifest_path() -> anyhow::Result { + let path = project_dir()?.join("govbot.yml"); + if !path.exists() { + anyhow::bail!( + "No govbot.yml in {}. Run `govbot init` to scaffold one.", + project_dir()?.display() + ); + } + Ok(path) +} + +/// `govbot add` — append validated dataset ids to `govbot.yml`'s `datasets:`. +fn run_add_command(cmd: Command) -> anyhow::Result<()> { + let Command::Add { datasets } = cmd else { + unreachable!() + }; + let manifest_path = require_manifest_path()?; + let registry = load_registry()?; + + // Validate every id against the registry before touching the file. + let mut to_add = Vec::new(); + for id in &datasets { + let id = id.trim(); + if id.is_empty() { + continue; + } + if id.eq_ignore_ascii_case("all") { + to_add.push("all".to_string()); + continue; + } + let resolved = registry + .resolve(id) + .map_err(|e| anyhow::anyhow!("{}", e))?; + // Add the identifier the user typed (keeps `wy` short and familiar); + // resolution proved it valid. + let _ = resolved; + to_add.push(id.to_string()); + } + + // Parse the manifest, mutate `datasets`, write it back. + let contents = std::fs::read_to_string(&manifest_path)?; + let mut doc: serde_yaml::Value = serde_yaml::from_str(&contents) + .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; + + let datasets_node = doc + .get_mut("datasets") + .and_then(|v| v.as_sequence_mut()) + .ok_or_else(|| anyhow::anyhow!("govbot.yml has no `datasets:` list"))?; + + let mut added = Vec::new(); + for id in to_add { + let already = datasets_node + .iter() + .any(|v| v.as_str() == Some(id.as_str())); + if already { + eprintln!(" · {} already in datasets", id); + } else { + datasets_node.push(serde_yaml::Value::String(id.clone())); + added.push(id); + } + } + + if added.is_empty() { + eprintln!("Nothing to add."); + return Ok(()); + } + + let yaml = serde_yaml::to_string(&doc) + .map_err(|e| anyhow::anyhow!("Failed to serialize govbot.yml: {}", e))?; + std::fs::write(&manifest_path, yaml)?; + for id in &added { + eprintln!(" + added {}", id); + } + eprintln!("✅ Updated {}. Run `govbot pull` to fetch.", manifest_path.display()); + Ok(()) +} + +/// `govbot remove` — drop dataset ids from `govbot.yml`'s `datasets:`. +fn run_remove_command(cmd: Command) -> anyhow::Result<()> { + let Command::Remove { datasets } = cmd else { + unreachable!() + }; + let manifest_path = require_manifest_path()?; + + let contents = std::fs::read_to_string(&manifest_path)?; + let mut doc: serde_yaml::Value = serde_yaml::from_str(&contents) + .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; + + let datasets_node = doc + .get_mut("datasets") + .and_then(|v| v.as_sequence_mut()) + .ok_or_else(|| anyhow::anyhow!("govbot.yml has no `datasets:` list"))?; + + let targets: Vec = datasets + .iter() + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty()) + .collect(); + + let before = datasets_node.len(); + let mut removed = Vec::new(); + datasets_node.retain(|v| { + if let Some(s) = v.as_str() { + if targets.iter().any(|t| t == s) { + removed.push(s.to_string()); + return false; + } + } + true + }); + + if datasets_node.len() == before { + eprintln!("No matching datasets found in govbot.yml."); + return Ok(()); + } + + let yaml = serde_yaml::to_string(&doc) + .map_err(|e| anyhow::anyhow!("Failed to serialize govbot.yml: {}", e))?; + std::fs::write(&manifest_path, yaml)?; + for id in &removed { + eprintln!(" - removed {}", id); + } + eprintln!("✅ Updated {}.", manifest_path.display()); + Ok(()) +} + +/// `govbot ls` — list the project's manifest datasets and locally-cached ones. +fn run_ls_command(cmd: Command) -> anyhow::Result<()> { + let Command::Ls { govbot_dir, output } = cmd else { + unreachable!() + }; + let registry = load_registry()?; + let repos_dir = get_govbot_dir(govbot_dir)?; + let local: Vec = git::get_local_datasets(&repos_dir).unwrap_or_default(); + + // The manifest's declared datasets, if a govbot.yml exists. + let manifest_path = project_dir()?.join("govbot.yml"); + let manifest_datasets: Vec = if manifest_path.exists() { + match govbot::Manifest::load(&manifest_path) { + Ok(m) => m.datasets, + Err(_) => Vec::new(), + } + } else { + Vec::new() + }; + + if output == "json" { + let out = serde_json::json!({ + "manifest": manifest_datasets, + "cached": local, + "registry_total": registry.datasets.len(), + }); + println!("{}", serde_json::to_string_pretty(&out)?); + return Ok(()); + } + + if !manifest_datasets.is_empty() { + println!("Manifest datasets (govbot.yml):"); + for d in &manifest_datasets { + let cached = local.iter().any(|c| c == d) || d == "all"; + let mark = if cached { "✓" } else { "·" }; + println!(" {} {}", mark, d); + } + println!(); + } + + println!("Cached locally ({}):", local.len()); + if local.is_empty() { + println!(" (none — run `govbot pull` to fetch)"); + } else { + for d in &local { + println!(" {}", d); + } + } + Ok(()) +} + +/// `govbot search` — query the dataset registry. +fn run_search_command(cmd: Command) -> anyhow::Result<()> { + let Command::Search { query, output } = cmd else { + unreachable!() + }; + let registry = load_registry()?; + let query_str = query.join(" "); + let hits = registry.search(&query_str); + + if output == "json" { + let rows: Vec<_> = hits + .iter() + .map(|d| { + serde_json::json!({ + "id": d.id, + "name": d.entry.name, + "git_url": d.entry.git_url, + "schema": d.entry.schema, + "path_pattern": d.entry.path_pattern, + }) + }) + .collect(); + println!("{}", serde_json::to_string_pretty(&rows)?); + return Ok(()); + } + + if hits.is_empty() { + eprintln!("No datasets match '{}'.", query_str); + return Ok(()); + } + println!("{} dataset(s):", hits.len()); + for d in &hits { + let name = d.entry.name.as_deref().unwrap_or(""); + println!(" {:<28} {}", d.id, name); + } Ok(()) } @@ -2295,14 +2211,14 @@ async fn main() -> anyhow::Result<()> { let args = Args::parse(); match args.command { - Some(cmd @ Command::Clone { .. }) => { - run_clone_command(cmd).await + Some(cmd @ Command::Pull { .. }) => { + run_pull_command(cmd).await } Some(cmd @ Command::Delete { .. }) => { run_delete_command(cmd).await } - Some(cmd @ Command::Logs { .. }) => { - run_logs_command(cmd).await + Some(cmd @ Command::Source { .. }) => { + run_source_command(cmd).await } Some(cmd @ Command::Load { .. }) => { run_load_command(cmd).await @@ -2310,12 +2226,40 @@ async fn main() -> anyhow::Result<()> { Some(Command::Update) => { run_update_command().await } - Some(cmd @ Command::Tag { .. }) => { - run_tag_command(cmd).await + Some(cmd @ Command::Apply { .. }) => { + run_apply_command(cmd).await + } + Some(cmd @ Command::Publish { .. }) => { + run_publish_command(cmd).await + } + Some(Command::Run { govbot_dir }) => { + let cwd = std::env::current_dir()?; + let config_path = cwd.join("govbot.yml"); + if !config_path.exists() { + anyhow::bail!( + "No govbot.yml in {}. Run `govbot init` to scaffold one, then `govbot run`.", + cwd.display() + ); + } + govbot::pipeline::run_pipeline(&config_path, govbot_dir.as_deref()) } - Some(cmd @ Command::Build { .. }) => { - run_build_command(cmd).await + Some(Command::Init) => { + let cwd = std::env::current_dir()?; + let config_path = cwd.join("govbot.yml"); + if config_path.exists() { + eprintln!("govbot.yml already exists in {}.", cwd.display()); + return Ok(()); + } + if std::io::IsTerminal::is_terminal(&std::io::stdin()) { + govbot::wizard::run_wizard() + } else { + govbot::wizard::write_default_files(&cwd) + } } + Some(cmd @ Command::Add { .. }) => run_add_command(cmd), + Some(cmd @ Command::Remove { .. }) => run_remove_command(cmd), + Some(cmd @ Command::Ls { .. }) => run_ls_command(cmd), + Some(cmd @ Command::Search { .. }) => run_search_command(cmd), None => { let cwd = std::env::current_dir()?; let config_path = cwd.join("govbot.yml"); @@ -2330,7 +2274,7 @@ async fn main() -> anyhow::Result<()> { // to start the pipeline (matches the wizard's own message). return Ok(()); } - govbot::pipeline::run_pipeline(&config_path) + govbot::pipeline::run_pipeline(&config_path, None) } } } diff --git a/actions/govbot/src/pipeline.rs b/actions/govbot/src/pipeline.rs index f744cca1..6c14dbde 100644 --- a/actions/govbot/src/pipeline.rs +++ b/actions/govbot/src/pipeline.rs @@ -1,49 +1,71 @@ +use crate::config::{Command_, Manifest, Transform}; use anyhow::{Context, Result}; -use std::path::Path; +use std::collections::HashMap; +use std::path::{Path, PathBuf}; use std::process::{Command, Stdio}; -/// Run the full govbot pipeline: clone/update → tag → build. +/// Run the full govbot pipeline against the project's `govbot.yml`. /// -/// Smart update behavior: -/// - If `.govbot/repos/` exists with repos: just update existing repos (git pull) -/// - If `.govbot/repos/` does not exist: clone repos based on govbot.yml config -pub fn run_pipeline(config_path: &Path) -> Result<()> { +/// Stages: +/// 1. **pull/update** — clone or git-pull the manifest's `datasets`. +/// 2. **classify+apply** — the transform DAG: stream `source --select docs` +/// into each declared transform (an external process speaking the govbot +/// stream protocol) and pipe the final transform's output into +/// `govbot apply`. +/// 3. **publish** — run `govbot publish` to emit the manifest's publishers. +/// +/// Smart update behavior: if `/repos/` already has datasets, just +/// `git pull`; otherwise clone the manifest's `datasets`. +pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>) -> Result<()> { let govbot_bin = std::env::current_exe() .context("Failed to determine govbot binary path")?; - let cwd = config_path - .parent() - .unwrap_or_else(|| Path::new(".")); + let cwd = config_path.parent().unwrap_or_else(|| Path::new(".")); + + let manifest = Manifest::load(config_path)?; + + // The transforms govbot runs in step 2. If the manifest declares no + // pipeline, fall back to the classic single classify-transform DAG (a + // `fastclass classify` stage with the classifier bundle at `.`). + let transforms = resolve_pipeline_transforms(&manifest)?; + + // Fast-fail if a transform's binary cannot be resolved. + let resolved: Vec<(String, ResolvedTransform)> = transforms + .iter() + .map(|(name, t)| { + resolve_transform(t).map(|r| (name.clone(), r)) + }) + .collect::>()?; - let repos_dir = cwd.join(".govbot").join("repos"); + // Resolve the repos directory the way subcommands do. + let repos_dir = match govbot_dir { + Some(d) => Path::new(d).join("repos"), + None => cwd.join(".govbot").join("repos"), + }; let has_repos = repos_dir.exists() && std::fs::read_dir(&repos_dir) .map(|mut d| d.next().is_some()) .unwrap_or(false); - // Step 1: Clone or update repos + // Step 1: pull or update datasets. eprintln!(); - eprintln!("=== Step 1/3: {} repositories ===", if has_repos { "Updating" } else { "Cloning" }); + eprintln!( + "=== Step 1/3: {} datasets ===", + if has_repos { "Updating" } else { "Pulling" } + ); eprintln!(); - let clone_status = if has_repos { - // Update existing repos only - Command::new(&govbot_bin) - .arg("clone") - .current_dir(cwd) - .stdin(Stdio::inherit()) - .stdout(Stdio::inherit()) - .stderr(Stdio::inherit()) - .status() - } else { - // First run: clone based on config - let config = crate::publish::load_config(config_path)?; - let repos = crate::publish::get_repos_from_config(&config); - + let pull_status = { let mut cmd = Command::new(&govbot_bin); - cmd.arg("clone"); - for repo in &repos { - cmd.arg(repo); + cmd.arg("pull"); + if !has_repos { + // Initial pull: clone the manifest's datasets. + for dataset in &manifest.datasets { + cmd.arg(dataset); + } + } + if let Some(d) = govbot_dir { + cmd.arg("--govbot-dir").arg(d); } cmd.current_dir(cwd) .stdin(Stdio::inherit()) @@ -51,87 +73,259 @@ pub fn run_pipeline(config_path: &Path) -> Result<()> { .stderr(Stdio::inherit()) .status() }; - - match clone_status { + match pull_status { Ok(status) if !status.success() => { - eprintln!("⚠️ Clone/update had errors (continuing anyway)"); + eprintln!("⚠️ Pull/update had errors (continuing anyway)"); } Err(e) => { - eprintln!("⚠️ Failed to run clone: {} (continuing anyway)", e); + eprintln!("⚠️ Failed to run pull: {} (continuing anyway)", e); } _ => {} } - // Step 2: Tag bills (govbot logs | govbot tag) + // Step 2: run the transform DAG (source | transform... | apply). eprintln!(); - eprintln!("=== Step 2/3: Tagging bills ==="); + eprintln!("=== Step 2/3: Running transforms (source | ... | apply) ==="); eprintln!(); - - let tag_result = run_logs_pipe_tag(&govbot_bin, cwd); - match tag_result { + match run_transform_dag(&govbot_bin, &resolved, cwd, govbot_dir) { Ok(false) => { - eprintln!("⚠️ Tagging had errors (continuing anyway)"); + eprintln!("⚠️ Transform stage had errors (continuing anyway)"); } Err(e) => { - eprintln!("⚠️ Failed to run tagging: {} (continuing anyway)", e); + eprintln!("⚠️ Failed to run transforms: {} (continuing anyway)", e); } _ => {} } - // Step 3: Build RSS feeds + // Step 3: publish. eprintln!(); - eprintln!("=== Step 3/3: Building RSS feeds ==="); + eprintln!("=== Step 3/3: Publishing ==="); eprintln!(); - - let build_status = Command::new(&govbot_bin) - .arg("build") + let mut publish_cmd = Command::new(&govbot_bin); + publish_cmd.arg("publish"); + if let Some(d) = govbot_dir { + publish_cmd.arg("--govbot-dir").arg(d); + } + let publish_status = publish_cmd .current_dir(cwd) .stdin(Stdio::inherit()) .stdout(Stdio::inherit()) .stderr(Stdio::inherit()) .status() - .context("Failed to run govbot build")?; - - if !build_status.success() { - anyhow::bail!("Build step failed with exit code: {}", build_status.code().unwrap_or(-1)); + .context("Failed to run govbot publish")?; + if !publish_status.success() { + anyhow::bail!( + "Publish step failed with exit code: {}", + publish_status.code().unwrap_or(-1) + ); } eprintln!(); eprintln!("Pipeline complete!"); - Ok(()) } -/// Run `govbot logs | govbot tag` by piping stdout of logs into stdin of tag. -/// Returns Ok(true) if both succeeded, Ok(false) if either failed. -fn run_logs_pipe_tag(govbot_bin: &Path, cwd: &Path) -> Result { - let mut logs_child = Command::new(govbot_bin) - .arg("logs") +/// Resolve which transforms `govbot run` executes. +/// +/// If the manifest declares pipelines, the first pipeline's stages that name a +/// `transforms:` entry are run, in order. (Publisher stages are handled by the +/// separate `publish` step.) If no pipeline / no transforms are declared, fall +/// back to a single `fastclass classify` transform with the classifier bundle +/// at `.` (the project directory). +fn resolve_pipeline_transforms(manifest: &Manifest) -> Result> { + // Prefer an explicit pipeline; pick the first one deterministically. + if let Some((_, stages)) = manifest.pipelines.iter().next() { + let mut out = Vec::new(); + for stage in stages { + if let Some(t) = manifest.transforms.get(stage) { + out.push((stage.clone(), t.clone())); + } + // A stage naming a publisher is handled by the publish step; + // a stage naming neither is a manifest error surfaced elsewhere. + } + if !out.is_empty() { + return Ok(out); + } + } + + // No pipeline transforms: run every declared transform, in name order. + if !manifest.transforms.is_empty() { + return Ok(manifest + .transforms + .iter() + .map(|(k, v)| (k.clone(), v.clone())) + .collect()); + } + + // Nothing declared — the classic single classify transform. The classifier + // bundle defaults to `.` (the project directory holding the bundle). + Ok(vec![( + "classify".to_string(), + Transform { + command: Command_::Argv(vec!["fastclass".to_string(), "classify".to_string(), "-".to_string()]), + reads: "docs".to_string(), + writes: "classification".to_string(), + classifier: Some(".".to_string()), + }, + )]) +} + +/// A transform whose binary has been resolved to an absolute path, with its +/// full argv assembled (including any `classifier=` argument). +struct ResolvedTransform { + /// The resolved executable path. + bin: PathBuf, + /// Arguments passed after the executable. + args: Vec, +} + +/// Resolve a transform's command to an executable + argv. +/// +/// The first argv element is the binary, resolved against `$PATH` and the +/// standard install locations (`~/.cargo/bin`, `~/.govbot/bin`). For a +/// classify-style transform the `classifier=` field is appended as an +/// explicit argument — NOT hard-coded to the cwd. +fn resolve_transform(t: &Transform) -> Result { + let argv = t.command.argv(); + let (bin_name, rest) = argv + .split_first() + .context("transform `command` is empty")?; + + let bin = resolve_transform_binary(bin_name).ok_or_else(|| { + anyhow::anyhow!( + "transform binary `{}` not found on PATH, ~/.cargo/bin, or ~/.govbot/bin.\n\ + For the bundled classify transform, install fastclass:\n\ + cd && just install (or: cargo install --path .)", + bin_name + ) + })?; + + let mut args: Vec = rest.to_vec(); + // Append the explicit classifier bundle path for classify-style transforms. + if let Some(classifier) = &t.classifier { + args.push(format!("classifier={}", classifier)); + } + + Ok(ResolvedTransform { bin, args }) +} + +/// The user's home directory, from `$HOME` (Unix) or `%USERPROFILE%` (Windows). +fn home_dir() -> Option { + std::env::var_os("HOME") + .or_else(|| std::env::var_os("USERPROFILE")) + .map(PathBuf::from) + .filter(|p| !p.as_os_str().is_empty()) +} + +/// Resolve a transform binary by name: `$PATH` first, then the standard install +/// locations (`~/.cargo/bin`, `~/.govbot/bin`). An absolute/relative path that +/// already exists is used as-is. This generalizes the old `find_fastclass()`. +fn resolve_transform_binary(name: &str) -> Option { + // An explicit path component — use it directly if it resolves. + if name.contains('/') || name.contains('\\') { + let p = PathBuf::from(name); + return p.is_file().then_some(p); + } + + let exe = if cfg!(windows) && !name.ends_with(".exe") { + format!("{}.exe", name) + } else { + name.to_string() + }; + + if let Ok(path) = std::env::var("PATH") { + if let Some(hit) = std::env::split_paths(&path) + .map(|p| p.join(&exe)) + .find(|p| p.is_file()) + { + return Some(hit); + } + } + let home = home_dir()?; + [".cargo/bin", ".govbot/bin"] + .into_iter() + .map(|d| home.join(d).join(&exe)) + .find(|p| p.is_file()) +} + +/// Run the transform DAG: `govbot source --select docs | | | ... | +/// govbot apply`. +/// +/// A **linear executor** — each transform is an external process speaking the +/// govbot stream protocol (newline-delimited JSON, `{id,text,kind}` in, +/// results out). Output of stage N is piped to the stdin of stage N+1. The +/// `transforms:`/`pipelines:` schema is DAG-capable; this runner walks it +/// linearly, which is sufficient for the single-classifier pipeline today. +/// +/// Returns `Ok(true)` when every stage exits successfully. +fn run_transform_dag( + govbot_bin: &Path, + transforms: &[(String, ResolvedTransform)], + cwd: &Path, + govbot_dir: Option<&str>, +) -> Result { + // Stage 0: the source — `govbot source --select docs`. + let mut source_cmd = Command::new(govbot_bin); + source_cmd.arg("source").arg("--select").arg("docs"); + if let Some(d) = govbot_dir { + source_cmd.arg("--govbot-dir").arg(d); + } + let mut source_child = source_cmd .current_dir(cwd) .stdout(Stdio::piped()) .stderr(Stdio::inherit()) .spawn() - .context("Failed to spawn govbot logs")?; - - let logs_stdout = logs_child + .context("Failed to spawn govbot source")?; + let mut prev_stdout: Stdio = source_child .stdout .take() - .context("Failed to capture logs stdout")?; + .context("Failed to capture source stdout")? + .into(); - let tag_child = Command::new(govbot_bin) - .arg("tag") + // Each transform stage reads the previous stage's stdout. + let mut transform_children = Vec::new(); + for (name, t) in transforms { + let mut child = Command::new(&t.bin) + .args(&t.args) + .current_dir(cwd) + .stdin(prev_stdout) + .stdout(Stdio::piped()) + .stderr(Stdio::inherit()) + .spawn() + .with_context(|| format!("Failed to spawn transform '{}'", name))?; + prev_stdout = child + .stdout + .take() + .with_context(|| format!("Failed to capture stdout of transform '{}'", name))? + .into(); + transform_children.push(child); + } + + // The sink: `govbot apply` consumes the final transform's result stream. + let apply_child = Command::new(govbot_bin) + .arg("apply") .current_dir(cwd) - .stdin(logs_stdout) + .stdin(prev_stdout) .stdout(Stdio::inherit()) .stderr(Stdio::inherit()) .spawn() - .context("Failed to spawn govbot tag")?; + .context("Failed to spawn govbot apply")?; - let tag_output = tag_child + // Wait downstream-to-upstream so pipes drain. + let apply_output = apply_child .wait_with_output() - .context("Failed to wait for govbot tag")?; - - let logs_status = logs_child.wait().context("Failed to wait for govbot logs")?; + .context("Failed to wait for govbot apply")?; + let mut all_ok = apply_output.status.success(); + let mut statuses: HashMap = HashMap::new(); + for (child, (name, _)) in transform_children.iter_mut().zip(transforms.iter()) { + let status = child + .wait() + .with_context(|| format!("Failed to wait for transform '{}'", name))?; + statuses.insert(name.clone(), status.success()); + all_ok &= status.success(); + } + let source_status = source_child.wait().context("Failed to wait for govbot source")?; + all_ok &= source_status.success(); - Ok(logs_status.success() && tag_output.status.success()) + Ok(all_ok) } diff --git a/actions/govbot/src/processor.rs b/actions/govbot/src/processor.rs index b2bd5052..a4257f5c 100644 --- a/actions/govbot/src/processor.rs +++ b/actions/govbot/src/processor.rs @@ -78,7 +78,13 @@ impl PipelineProcessor { config .repos .iter() - .map(|repo| search_dir.join(git::build_repo_name(repo))) + .map(|repo| { + // A dataset identifier may be namespaced; the clone + // directory is keyed on the short (slash-free) name. + let short = repo.rsplit('/').next().unwrap_or(repo); + let short = short.split('@').next().unwrap_or(short); + search_dir.join(git::repo_dir_name(short)) + }) .collect() }; @@ -87,6 +93,9 @@ impl PipelineProcessor { eprintln!("Warning: Expected repository directory does not exist: {}", search_path.display()); continue; } + // A project's repo entry may be a symlink into the shared dataset + // cache; jwalk reads through the root symlink transparently and + // reports child paths under `search_dir`, keeping them relative. // Use jwalk for fast parallel traversal // jwalk uses rayon internally for parallel processing diff --git a/actions/govbot/src/publish.rs b/actions/govbot/src/publish.rs index 6f9edb31..e02b3768 100644 --- a/actions/govbot/src/publish.rs +++ b/actions/govbot/src/publish.rs @@ -1,31 +1,200 @@ +use crate::config::{Manifest, Publisher, PublisherKind}; use crate::rss; use anyhow::{Context, Result}; use serde_json::Value; use std::collections::HashSet; use std::fs; -use std::path::Path; - -/// Load and parse govbot.yml configuration -pub fn load_config(config_path: &Path) -> Result { - let contents = fs::read_to_string(config_path) - .with_context(|| format!("Failed to read config file: {}", config_path.display()))?; - serde_yaml::from_str(&contents) - .with_context(|| format!("Failed to parse YAML: {}", config_path.display())) +use std::path::{Path, PathBuf}; + +/// Load and parse the `govbot.yml` manifest (datasets / transforms / publish / +/// pipelines). A manifest carrying the retired `tags:` block fails to parse. +pub fn load_manifest(config_path: &Path) -> Result { + Manifest::load(config_path) +} + +/// A resolved publishing job: a publisher definition plus the result stream +/// (already filtered, deduplicated, sorted, and limited) it should emit. +pub struct PublishJob<'a> { + /// The publisher name from `govbot.yml: publish:`. + pub name: &'a str, + /// The typed publisher definition. + pub publisher: &'a Publisher, + /// The records to publish — the result stream this publisher consumes. + pub entries: Vec, + /// Output directory override (CLI `--output-dir`). + pub output_dir_override: Option, + /// Output filename override (CLI `--output-file`). + pub output_file_override: Option, + /// The project directory (where `govbot.yml` lives). Stateful publishers + /// (e.g. `bluesky`'s posted-state ledger) resolve relative paths here. + pub project_dir: PathBuf, + /// `--dry-run`: render but do not emit. The `bluesky` publisher honours + /// this by touching no network and no ledger. + pub dry_run: bool, } -/// Get repos list from config, handling 'all' special case -pub fn get_repos_from_config(config: &Value) -> Vec { - if let Some(repos) = config.get("repos") { - if let Some(arr) = repos.as_array() { - return arr - .iter() - .filter_map(|v| v.as_str().map(|s| s.to_string())) - .collect(); - } else if let Some(s) = repos.as_str() { - return vec![s.to_string()]; +/// Run a single publisher against its result stream and emit artifacts. +/// +/// govbot's built-in publishers each consume the result stream and emit +/// artifacts: `rss`/`html` write a feed + HTML index, `json` writes a JSON +/// dump, `duckdb` loads the records into a DuckDB database, and `bluesky` +/// posts matched bills to a Bluesky account (see `crate::bluesky`). +pub fn run_publisher(job: &PublishJob) -> Result<()> { + let p = job.publisher; + let select = p.select.clone().unwrap_or_default(); + + let output_dir = PathBuf::from( + job.output_dir_override + .clone() + .or_else(|| p.output_dir.clone()) + .unwrap_or_else(|| "docs".to_string()), + ); + + match p.kind { + PublisherKind::Rss | PublisherKind::Html => { + emit_rss_html(job, &select, &output_dir) } + PublisherKind::Json => emit_json(job, &output_dir), + PublisherKind::Duckdb => emit_duckdb(job, &output_dir), + PublisherKind::Bluesky => crate::bluesky::run_bluesky(job, job.dry_run), + } +} + +/// Title-case a tag name (`clean_energy` -> `Clean Energy`). +fn titlecase_tag(tag: &str) -> String { + tag.replace('_', " ") + .split_whitespace() + .map(|w| { + let mut chars = w.chars(); + match chars.next() { + None => String::new(), + Some(f) => f.to_uppercase().collect::() + chars.as_str(), + } + }) + .collect::>() + .join(" ") +} + +/// The `rss`/`html` publisher: a combined RSS feed + HTML index. +fn emit_rss_html(job: &PublishJob, select: &[String], output_dir: &Path) -> Result<()> { + let p = job.publisher; + + let output_file = job + .output_file_override + .clone() + .or_else(|| p.output_file.clone()) + .unwrap_or_else(|| "feed.xml".to_string()); + + let feed_link = p.base_url.as_deref().unwrap_or("https://example.com"); + + // Auto-derive a title from the selected tags when none is configured. + let feed_title = p.title.clone().unwrap_or_else(|| { + if select.is_empty() { + "Legislation".to_string() + } else { + format!( + "{} Legislation", + select.iter().map(|t| titlecase_tag(t)).collect::>().join(" & ") + ) + } + }); + + // The auto-description previously read each tag's `description` from + // `govbot.yml`; that taxonomy data now lives in the fastclass bundle, not + // here. Fall back to a simple tag-name-derived description. + let feed_description = p.description.clone().unwrap_or_else(|| { + if select.is_empty() { + "Legislative updates".to_string() + } else { + format!( + "Legislative updates tagged {}", + select.iter().map(|t| titlecase_tag(t)).collect::>().join(", ") + ) + } + }); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + + eprintln!("Generating RSS feed with {} entries...", job.entries.len()); + let rss_xml = rss::json_to_rss( + job.entries.clone(), + &feed_title, + &feed_description, + feed_link, + Some(feed_link), + "en-us", + ); + let rss_path = output_dir.join(&output_file); + fs::write(&rss_path, rss_xml)?; + eprintln!("✓ Generated RSS feed: {}", rss_path.display()); + + eprintln!("Generating HTML index with {} entries...", job.entries.len()); + // Only pass an explicit (configured) title to the HTML header. + let html_title = p.title.as_deref().filter(|s| !s.trim().is_empty()); + let html = rss::json_to_html(job.entries.clone(), html_title, feed_link, Some(feed_link)); + let html_path = output_dir.join("index.html"); + fs::write(&html_path, html)?; + eprintln!("✓ Generated HTML index: {}", html_path.display()); + Ok(()) +} + +/// The `json` publisher: a JSON dump of the result stream. +fn emit_json(job: &PublishJob, output_dir: &Path) -> Result<()> { + let output_file = job + .output_file_override + .clone() + .or_else(|| job.publisher.output_file.clone()) + .unwrap_or_else(|| "feed.json".to_string()); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + let path = output_dir.join(&output_file); + fs::write(&path, serde_json::to_string_pretty(&job.entries)?)?; + eprintln!( + "✓ Generated JSON dump ({} entries): {}", + job.entries.len(), + path.display() + ); + Ok(()) +} + +/// The `duckdb` publisher: load the result stream into a DuckDB database by +/// writing the records to a JSON file and `read_json_auto`-ing them. +fn emit_duckdb(job: &PublishJob, output_dir: &Path) -> Result<()> { + use std::process::Command; + + let db_file = job + .output_file_override + .clone() + .or_else(|| job.publisher.output_file.clone()) + .unwrap_or_else(|| "feed.duckdb".to_string()); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + let db_path = output_dir.join(&db_file); + let json_path = output_dir.join(format!("{}.records.json", job.name)); + fs::write(&json_path, serde_json::to_string(&job.entries)?)?; + + let sql = format!( + "CREATE OR REPLACE TABLE records AS SELECT * FROM read_json_auto('{}');", + json_path.display() + ); + let status = Command::new("duckdb") + .arg(db_path.to_string_lossy().as_ref()) + .arg("-c") + .arg(&sql) + .status() + .context("Failed to run `duckdb` — is the DuckDB CLI installed?")?; + if !status.success() { + anyhow::bail!("duckdb publisher '{}' failed", job.name); } - vec!["all".to_string()] + eprintln!( + "✓ Loaded {} entries into DuckDB: {}", + job.entries.len(), + db_path.display() + ); + Ok(()) } /// Filter entries by tags diff --git a/actions/govbot/src/registry.rs b/actions/govbot/src/registry.rs new file mode 100644 index 00000000..61a361a4 --- /dev/null +++ b/actions/govbot/src/registry.rs @@ -0,0 +1,323 @@ +//! The govbot dataset registry — "npm/docker for government data." +//! +//! A registry maps a **dataset identifier** to the git repo holding its data, +//! the data schema it follows, and the glob that locates records within the +//! repo. Datasets are git repos; this index is what lets govbot resolve a +//! dataset at runtime instead of from a compiled enum. +//! +//! ## Identifier scheme +//! +//! A canonical identifier is `namespace/name[@channel]`: +//! - `namespace` — a grouping (`us-legislation`, a county set, an agency set). +//! - `name` — the dataset within the namespace (`wy`, `il`, …). +//! - `@channel` — an optional release channel / branch (defaults to the +//! repo's default branch). +//! +//! **Plain jurisdiction codes stay valid.** A bare identifier with no `/` +//! (e.g. `wy`) is resolved against the registry's `default_namespace`, so an +//! existing manifest `datasets: [wy]` keeps working unchanged. `all` is a +//! reserved alias meaning "every dataset in the registry." +//! +//! ## Where it lives / how it is fetched +//! +//! The default registry is the JSON file `data/registry.json`, **compiled into +//! the binary** via `include_str!` — so a fresh install resolves the 52 seed +//! jurisdictions with zero network access. A project can override it: +//! 1. `GOVBOT_REGISTRY_URL` — an `http(s)://` URL or a local file path. +//! 2. `/.govbot/registry.json` — a project-local registry file. +//! A fetched registry is cached at `~/.govbot/registry.json`. +//! +//! See `actions/govbot/REGISTRY.md` for the full format documentation. + +use crate::error::{Error, Result}; +use serde::{Deserialize, Serialize}; +use std::collections::BTreeMap; +use std::path::PathBuf; + +/// The bundled default registry, compiled into the binary. +const BUNDLED_REGISTRY: &str = include_str!("../data/registry.json"); + +/// A single dataset entry: where its data lives and how to read it. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct DatasetEntry { + /// The git repository URL the dataset's data is cloned from. + pub git_url: String, + + /// The data schema the dataset follows (e.g. `ocdfiles`). Informational — + /// it lets a future `source` projection pick the right reader. + #[serde(default)] + pub schema: Option, + + /// A glob, relative to the cloned repo root, that locates the dataset's + /// records. Replaces the hard-coded `**/logs/*.json` walk. + #[serde(default)] + pub path_pattern: Option, + + /// A human-readable display name (`Wyoming`, `Cook County`, …). + #[serde(default)] + pub name: Option, +} + +/// The parsed registry file. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Registry { + /// Registry format version, for forward-compatibility. + #[serde(default, rename = "$schema_version")] + pub schema_version: Option, + + /// Free-text description of this registry. + #[serde(default)] + pub description: Option, + + /// The namespace a bare (slash-free) identifier is resolved against. + #[serde(default = "default_namespace")] + pub default_namespace: String, + + /// Canonical `namespace/name` → entry map. + pub datasets: BTreeMap, +} + +fn default_namespace() -> String { + "us-legislation".to_string() +} + +/// A resolved dataset: its canonical id plus the entry it points at. +#[derive(Debug, Clone)] +pub struct ResolvedDataset { + /// The canonical `namespace/name` identifier (channel stripped). + pub id: String, + /// The optional channel (branch) requested via `@channel`. + pub channel: Option, + /// The registry entry. + pub entry: DatasetEntry, +} + +impl ResolvedDataset { + /// The short, slash-free name a clone directory is keyed on (`wy`, `il`). + /// Strips the namespace; this is also the legacy "locale" string. + pub fn short_name(&self) -> &str { + self.id.rsplit('/').next().unwrap_or(&self.id) + } +} + +impl Registry { + /// Parse the bundled default registry. Infallible in practice — the file + /// is validated at build time — but surfaces a `Config` error if not. + pub fn bundled() -> Result { + serde_json::from_str(BUNDLED_REGISTRY) + .map_err(|e| Error::Config(format!("Bundled registry is invalid: {}", e))) + } + + /// Load the active registry, honoring overrides in priority order: + /// 1. `GOVBOT_REGISTRY_URL` (an `http(s)://` URL or a filesystem path). + /// 2. `/.govbot/registry.json` — a project-local registry. + /// 3. The bundled default. + /// + /// `project_dir` is the directory holding `govbot.yml` (or the cwd). + pub fn load(project_dir: &std::path::Path) -> Result { + if let Ok(src) = std::env::var("GOVBOT_REGISTRY_URL") { + if !src.trim().is_empty() { + return Registry::from_source(&src); + } + } + let project_registry = project_dir.join(".govbot").join("registry.json"); + if project_registry.is_file() { + return Registry::from_file(&project_registry); + } + Registry::bundled() + } + + /// Load a registry from a source string: an `http(s)://` URL is fetched + /// (and cached at `~/.govbot/registry.json`), anything else is a file path. + pub fn from_source(src: &str) -> Result { + if src.starts_with("http://") || src.starts_with("https://") { + Registry::fetch(src) + } else { + Registry::from_file(std::path::Path::new(src)) + } + } + + /// Parse a registry from a JSON file on disk. + pub fn from_file(path: &std::path::Path) -> Result { + let contents = std::fs::read_to_string(path).map_err(|e| { + Error::Config(format!("Failed to read registry {}: {}", path.display(), e)) + })?; + serde_json::from_str(&contents) + .map_err(|e| Error::Config(format!("Invalid registry {}: {}", path.display(), e))) + } + + /// Fetch a registry over HTTP and cache it at `~/.govbot/registry.json`. + pub fn fetch(url: &str) -> Result { + let body = ureq::get(url) + .call() + .map_err(|e| Error::Config(format!("Failed to fetch registry {}: {}", url, e)))? + .into_body() + .read_to_string() + .map_err(|e| Error::Config(format!("Failed to read registry body: {}", e)))?; + let registry: Registry = serde_json::from_str(&body) + .map_err(|e| Error::Config(format!("Fetched registry {} is invalid: {}", url, e)))?; + // Best-effort cache write — a failure here is non-fatal. + if let Some(cache) = registry_cache_path() { + if let Some(parent) = cache.parent() { + let _ = std::fs::create_dir_all(parent); + } + let _ = std::fs::write(&cache, &body); + } + Ok(registry) + } + + /// Canonicalize a dataset identifier to `namespace/name` (channel stripped). + /// + /// `wy` → `/wy`; `us-counties/cook` is returned as-is; + /// `wy@nightly` → `/wy`. The channel is returned + /// separately by [`Registry::resolve`]. + pub fn canonical_id(&self, identifier: &str) -> (String, Option) { + let (base, channel) = match identifier.split_once('@') { + Some((b, c)) => (b, Some(c.to_string())), + None => (identifier, None), + }; + let id = if base.contains('/') { + base.to_string() + } else { + format!("{}/{}", self.default_namespace, base) + }; + (id, channel) + } + + /// Resolve a dataset identifier to its registry entry. + /// + /// Accepts a canonical `namespace/name[@channel]` id or a bare jurisdiction + /// code (resolved against `default_namespace`). Returns a `Config` error if + /// the identifier is not in the registry. + pub fn resolve(&self, identifier: &str) -> Result { + let (id, channel) = self.canonical_id(identifier); + let entry = self.datasets.get(&id).ok_or_else(|| { + Error::Config(format!( + "Unknown dataset '{}'. It is not in the registry. \ + Run `govbot search` to list available datasets.", + identifier + )) + })?; + Ok(ResolvedDataset { + id, + channel, + entry: entry.clone(), + }) + } + + /// Resolve a list of identifiers, expanding the `all` alias to every + /// dataset in the registry. Order is preserved; `all` expands in + /// canonical (sorted) order. + pub fn resolve_all(&self, identifiers: &[String]) -> Result> { + let mut out = Vec::new(); + for ident in identifiers { + let ident = ident.trim(); + if ident.is_empty() { + continue; + } + if ident.eq_ignore_ascii_case("all") { + for id in self.datasets.keys() { + out.push(self.resolve(id)?); + } + } else { + out.push(self.resolve(ident)?); + } + } + Ok(out) + } + + /// Every dataset in the registry, in canonical id order. + pub fn all(&self) -> Vec { + self.datasets + .iter() + .map(|(id, entry)| ResolvedDataset { + id: id.clone(), + channel: None, + entry: entry.clone(), + }) + .collect() + } + + /// Search the registry. A blank query matches everything; otherwise the + /// query is matched case-insensitively against the id and the name. + pub fn search(&self, query: &str) -> Vec { + let q = query.trim().to_lowercase(); + self.all() + .into_iter() + .filter(|d| { + if q.is_empty() { + return true; + } + d.id.to_lowercase().contains(&q) + || d.entry + .name + .as_deref() + .map(|n| n.to_lowercase().contains(&q)) + .unwrap_or(false) + }) + .collect() + } +} + +/// The path the most recently fetched registry is cached at: +/// `~/.govbot/registry.json`. +pub fn registry_cache_path() -> Option { + crate::cache::govbot_home().map(|h| h.join("registry.json")) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn bundled_registry_parses_and_has_seed_jurisdictions() { + let reg = Registry::bundled().expect("bundled registry must parse"); + assert!(reg.datasets.len() >= 52, "expected the 52-jurisdiction seed"); + assert!(reg.datasets.contains_key("us-legislation/wy")); + } + + #[test] + fn bare_code_resolves_via_default_namespace() { + let reg = Registry::bundled().unwrap(); + let d = reg.resolve("wy").expect("`wy` must resolve"); + assert_eq!(d.id, "us-legislation/wy"); + assert_eq!(d.short_name(), "wy"); + assert!(d.entry.git_url.contains("wy-legislation")); + } + + #[test] + fn canonical_id_and_channel_split() { + let reg = Registry::bundled().unwrap(); + let d = reg.resolve("wy@nightly").unwrap(); + assert_eq!(d.id, "us-legislation/wy"); + assert_eq!(d.channel.as_deref(), Some("nightly")); + } + + #[test] + fn namespaced_id_resolves_directly() { + let reg = Registry::bundled().unwrap(); + let d = reg.resolve("us-legislation/il").unwrap(); + assert_eq!(d.id, "us-legislation/il"); + } + + #[test] + fn unknown_dataset_errors() { + let reg = Registry::bundled().unwrap(); + assert!(reg.resolve("atlantis").is_err()); + } + + #[test] + fn all_alias_expands_to_every_dataset() { + let reg = Registry::bundled().unwrap(); + let resolved = reg.resolve_all(&["all".to_string()]).unwrap(); + assert_eq!(resolved.len(), reg.datasets.len()); + } + + #[test] + fn search_matches_id_and_name() { + let reg = Registry::bundled().unwrap(); + assert!(!reg.search("wyoming").is_empty()); + assert!(!reg.search("wy").is_empty()); + assert_eq!(reg.search("").len(), reg.datasets.len()); + } +} diff --git a/actions/govbot/src/selectors.rs b/actions/govbot/src/selectors.rs index 4e3de940..41c478ed 100644 --- a/actions/govbot/src/selectors.rs +++ b/actions/govbot/src/selectors.rs @@ -1,94 +1,156 @@ /// Default selector for OCDFiles-style JSON structures. -/// Extracts human-readable text content from a JSON value, focusing on bill and log content. +/// +/// Extracts the **full** human-readable text content of a bill from its +/// `metadata.json` projection — per the govbot stream protocol (`STREAM_PROTOCOL.md` +/// §1), the `docs` projection must emit the full bill text, not just titles, so +/// downstream transforms (classification, summarization) see the whole document. +/// +/// For an entry that joins `bill` (the full `metadata.json`) this assembles +/// every text-bearing field of the bill: title, identifier, every abstract, +/// every subject, action descriptions, sponsor names, version notes, related +/// bills, the legislative session and originating organization. For a bare log +/// entry it falls back to the action description. pub fn ocd_files_select_default(value: &serde_json::Value) -> String { + let mut texts = Vec::new(); + collect_bill_text(value, &mut texts); + // Drop empties and de-dup adjacent blanks; join with spaces. + texts + .into_iter() + .filter(|s| !s.trim().is_empty()) + .collect::>() + .join(" ") +} + +/// Append every text-bearing string of an OCD-files value into `texts`. +fn collect_bill_text(value: &serde_json::Value, texts: &mut Vec) { match value { - serde_json::Value::String(s) => s.clone(), + serde_json::Value::String(s) => texts.push(s.clone()), serde_json::Value::Object(map) => { - let mut texts = Vec::new(); - - // Extract from bill object (if present) + // The full bill metadata, when joined under `bill`. if let Some(bill) = map.get("bill") { - if let Some(title) = bill.get("title").and_then(|v| v.as_str()) { - texts.push(title.to_string()); - } - if let Some(subjects) = bill.get("subject") { - texts.push(ocd_files_select_default(subjects)); - } - if let Some(abstracts) = bill.get("abstracts") { - texts.push(ocd_files_select_default(abstracts)); - } - if let Some(session) = bill.get("legislative_session").and_then(|v| v.as_str()) { - texts.push(session.to_string()); - } - if let Some(org) = bill.get("from_organization").and_then(|v| v.as_str()) { - texts.push(org.to_string()); - } + collect_bill_fields(bill, texts); } - - // Extract from log object (if present) + // A bare log object. if let Some(log) = map.get("log") { - if let Some(action) = log.get("action") { - // Extract description from action object - if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { - texts.push(desc.to_string()); - } - // Or if action is directly a string - if let Some(desc_str) = action.as_str() { - texts.push(desc_str.to_string()); - } - } - // Also check for bill_id in log - if let Some(bill_id) = log - .get("bill_id") - .or_else(|| log.get("bill_identifier")) - .and_then(|v| v.as_str()) - { - texts.push(bill_id.to_string()); - } + collect_log_fields(log, texts); + } + // The map *is* a bill metadata.json (no `bill`/`log` wrappers). + if map.get("bill").is_none() && map.get("log").is_none() { + collect_bill_fields(value, texts); } + } + serde_json::Value::Array(arr) => { + for item in arr { + collect_bill_text(item, texts); + } + } + _ => {} + } +} + +/// Append every text-bearing field of a bill `metadata.json` object. +fn collect_bill_fields(bill: &serde_json::Value, texts: &mut Vec) { + let serde_json::Value::Object(map) = bill else { + // Not an object — recurse generically (e.g. when `bill` is a string). + collect_strings(bill, texts); + return; + }; - // Extract from action object directly (if present at top level, e.g., when processing log object) - if let Some(action) = map.get("action") { - // Extract description from action object - if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { - texts.push(desc.to_string()); - } - // Or if action is directly a string - if let Some(desc_str) = action.as_str() { - texts.push(desc_str.to_string()); - } + push_str(map, "title", texts); + push_str(map, "identifier", texts); + push_str(map, "legislative_session", texts); + push_str(map, "from_organization", texts); + + // Free-text arrays and nested arrays of objects. + if let Some(v) = map.get("abstracts") { + collect_strings(v, texts); + } + if let Some(v) = map.get("subject") { + collect_strings(v, texts); + } + if let Some(v) = map.get("other_titles") { + collect_strings(v, texts); + } + if let Some(v) = map.get("other_identifiers") { + collect_strings(v, texts); + } + + // Action descriptions. + if let Some(actions) = map.get("actions").and_then(|v| v.as_array()) { + for action in actions { + if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { + texts.push(desc.to_string()); } + } + } - // Fallback: extract from all other text fields (excluding metadata) - for (key, val) in map { - if !key.starts_with("_") - && key != "id" - && key != "sources" - && key != "timestamp" - && key != "bill" - && key != "log" - && key != "title" - && key != "action" - && key != "subjects" - && key != "abstracts" - && key != "legislative_session" - && key != "from_organization" - { - if let Some(text) = val.as_str() { - texts.push(text.to_string()); - } else if val.is_object() || val.is_array() { - texts.push(ocd_files_select_default(val)); - } - } + // Sponsor names. + if let Some(sponsors) = map.get("sponsorships").and_then(|v| v.as_array()) { + for sponsor in sponsors { + if let Some(name) = sponsor.get("name").and_then(|v| v.as_str()) { + texts.push(name.to_string()); } + } + } - texts.join(" ") + // Version notes (the closest thing to bill body text in metadata.json). + if let Some(versions) = map.get("versions").and_then(|v| v.as_array()) { + for version in versions { + if let Some(note) = version.get("note").and_then(|v| v.as_str()) { + texts.push(note.to_string()); + } + } + } + + // Documents notes. + if let Some(docs) = map.get("documents").and_then(|v| v.as_array()) { + for doc in docs { + if let Some(note) = doc.get("note").and_then(|v| v.as_str()) { + texts.push(note.to_string()); + } + } + } +} + +/// Append the text-bearing fields of a log object (action description, bill id). +fn collect_log_fields(log: &serde_json::Value, texts: &mut Vec) { + if let Some(action) = log.get("action") { + if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { + texts.push(desc.to_string()); + } else if let Some(desc) = action.as_str() { + texts.push(desc.to_string()); + } + } + if let Some(bill_id) = log + .get("bill_id") + .or_else(|| log.get("bill_identifier")) + .and_then(|v| v.as_str()) + { + texts.push(bill_id.to_string()); + } +} + +/// Append a single string-valued map field, if present. +fn push_str(map: &serde_json::Map, key: &str, texts: &mut Vec) { + if let Some(s) = map.get(key).and_then(|v| v.as_str()) { + texts.push(s.to_string()); + } +} + +/// Append every string found anywhere in a JSON value (arrays, nested objects). +fn collect_strings(value: &serde_json::Value, texts: &mut Vec) { + match value { + serde_json::Value::String(s) => texts.push(s.clone()), + serde_json::Value::Array(arr) => { + for item in arr { + collect_strings(item, texts); + } + } + serde_json::Value::Object(map) => { + for v in map.values() { + collect_strings(v, texts); + } } - serde_json::Value::Array(arr) => arr - .iter() - .map(ocd_files_select_default) - .collect::>() - .join(" "), - _ => String::new(), + _ => {} } } diff --git a/actions/govbot/src/wizard.rs b/actions/govbot/src/wizard.rs index 9368fb37..b5db102a 100644 --- a/actions/govbot/src/wizard.rs +++ b/actions/govbot/src/wizard.rs @@ -6,8 +6,9 @@ use std::path::Path; /// Represents the user's choices during the wizard. /// Used both by the interactive wizard and by tests to simulate different paths. pub struct WizardChoices { - pub repos: Vec, - pub include_example_tag: bool, + /// The datasets the project consumes (`govbot.yml: datasets:`). + pub datasets: Vec, + /// Base URL for the RSS/HTML publisher. pub base_url: String, } @@ -31,53 +32,34 @@ impl WizardSession { // Welcome display.push_str("Welcome to govbot! Let's set up your project.\n\n"); - // Step 1: Sources - display.push_str("? What data sources do you want to track?\n"); - if choices.repos == ["all"] { - display.push_str("> All states (47 jurisdictions)\n"); - display.push_str(" Select specific states\n"); + // Step 1: Datasets + display.push_str("? What datasets do you want to track?\n"); + if choices.datasets == ["all"] { + display.push_str("> All jurisdictions in the registry\n"); + display.push_str(" Select specific datasets\n"); } else { - display.push_str(" All states (47 jurisdictions)\n"); - display.push_str("> Select specific states\n"); + display.push_str(" All jurisdictions in the registry\n"); + display.push_str("> Select specific datasets\n"); display.push('\n'); - display.push_str("Available states/jurisdictions:\n"); - let all_locales = crate::locale::WorkingLocale::all(); - let locale_strs: Vec = all_locales.iter().map(|l| l.as_str().to_string()).collect(); - for chunk in locale_strs.chunks(10) { - display.push_str(&format!(" {}\n", chunk.join(", "))); - } + display.push_str("Browse the registry with `govbot search`.\n"); display.push('\n'); - display.push_str(&format!("? Enter state codes separated by spaces: {}\n", choices.repos.join(" "))); + display.push_str(&format!( + "? Enter dataset ids separated by spaces: {}\n", + choices.datasets.join(" ") + )); } display.push('\n'); - // Step 2: Tags - display.push_str("Tags let govbot categorize legislation by topics you care about.\n"); - display.push_str("Here's an example tag definition:\n\n"); - display.push_str(" education:\n"); - display.push_str(" description: |\n"); - display.push_str(" Legislation related to schools, education funding,\n"); - display.push_str(" curriculum standards, and educational policy.\n"); - display.push_str(" examples:\n"); - display.push_str(" - \"Increases per-pupil funding for public schools\"\n"); - display.push_str(" - \"Mandates comprehensive sex education curriculum\"\n\n"); - - display.push_str("? How would you like to set up tags?\n"); - if choices.include_example_tag { - display.push_str("> Use the example \"education\" tag to start\n"); - display.push_str(" I'll create my own tags later\n"); - } else { - display.push_str(" Use the example \"education\" tag to start\n"); - display.push_str("> I'll create my own tags later\n"); - display.push('\n'); - display.push_str(&ai_prompt_template()); - } - display.push('\n'); + // Step 2: Classification (a separate fastclass bundle, not govbot.yml) + display.push_str("Classification is done by fastclass against a classifier bundle.\n"); + display.push_str("Point the manifest's `transforms.classify.classifier` at your\n"); + display.push_str("bundle directory (containing classifier.yml). See the fastclass\n"); + display.push_str("docs to build one.\n\n"); // Step 3: Publishing - display.push_str("Publishing is configured for RSS feeds by default.\n"); - display.push_str("Your feeds will be generated in the \"docs\" directory.\n\n"); - display.push_str(&format!("? Base URL for your feeds: {}\n\n", choices.base_url)); + display.push_str("Publishing is configured for an RSS feed by default.\n"); + display.push_str("Your feed will be generated in the \"docs\" directory.\n\n"); + display.push_str(&format!("? Base URL for your feed: {}\n\n", choices.base_url)); // Summary display.push_str(" ✓ Created govbot.yml\n"); @@ -85,7 +67,7 @@ impl WizardSession { display.push_str(" ✓ Created .github/workflows/build.yml\n\n"); display.push_str("Setup complete! Run 'govbot' again to start the pipeline.\n"); - let govbot_yml = generate_govbot_yml(&choices.repos, choices.include_example_tag, &choices.base_url); + let govbot_yml = generate_govbot_yml(&choices.datasets, &choices.base_url); let workflow_yml = github_workflow_content().to_string(); WizardSession { @@ -125,37 +107,11 @@ impl WizardSession { } } -/// The AI prompt template shown when users choose to create their own tags. -pub fn ai_prompt_template() -> String { - let mut s = String::new(); - s.push_str("To create a tag, copy this prompt into your preferred AI tool:\n\n"); - s.push_str("---\n"); - s.push_str("Create a govbot tag definition in YAML for tracking [YOUR TOPIC] legislation.\n"); - s.push_str("The tag should have:\n"); - s.push_str("- A description (multiline, covering subtopics)\n"); - s.push_str("- 2-3 example bill descriptions that would match\n"); - s.push_str("- Optional: include_keywords and exclude_keywords lists\n\n"); - s.push_str("Format:\n"); - s.push_str(" tag_name:\n"); - s.push_str(" description: |\n"); - s.push_str(" ...\n"); - s.push_str(" examples:\n"); - s.push_str(" - \"...\"\n"); - s.push_str(" include_keywords:\n"); - s.push_str(" - keyword1\n"); - s.push_str(" exclude_keywords:\n"); - s.push_str(" - keyword1\n"); - s.push_str("---\n\n"); - s.push_str("Paste the result into your govbot.yml under the 'tags:' section.\n"); - s -} - /// Generate default govbot.yml and supporting files without interactive prompts. /// Used when `govbot init` is run in a non-interactive terminal. pub fn write_default_files(dir: &Path) -> Result<()> { let choices = WizardChoices { - repos: vec!["all".to_string()], - include_example_tag: true, + datasets: vec!["all".to_string()], base_url: "https://example.com".to_string(), }; let session = WizardSession::from_choices(&choices); @@ -181,22 +137,21 @@ pub fn run_wizard() -> Result<()> { eprintln!("Welcome to govbot! Let's set up your project."); eprintln!(); - // Step 1: Sources - let repos = prompt_sources()?; + // Step 1: Datasets + let datasets = prompt_sources()?; - // Step 2: Tags - let include_example_tag = prompt_tags()?; + // Step 2: Classification — handled by a separate fastclass bundle. + eprintln!(); + eprintln!("Classification is done by fastclass against a classifier bundle."); + eprintln!("Point the manifest's `transforms.classify.classifier` at your"); + eprintln!("bundle directory (containing classifier.yml)."); // Step 3: Publishing info let base_url = prompt_publishing()?; // Generate and write files let cwd = std::env::current_dir()?; - let choices = WizardChoices { - repos, - include_example_tag, - base_url, - }; + let choices = WizardChoices { datasets, base_url }; let session = WizardSession::from_choices(&choices); session.write_files(&cwd)?; @@ -209,8 +164,8 @@ pub fn run_wizard() -> Result<()> { fn prompt_sources() -> Result> { let options = vec![ - "All states (47 jurisdictions)", - "Select specific states", + "All jurisdictions in the registry", + "Select specific datasets", ]; let selection = Select::new() @@ -223,19 +178,22 @@ fn prompt_sources() -> Result> { return Ok(vec!["all".to_string()]); } - // Show available states and let user type them - let all_locales = crate::locale::WorkingLocale::all(); - let locale_strs: Vec = all_locales.iter().map(|l| l.as_str().to_string()).collect(); - - eprintln!(); - eprintln!("Available states/jurisdictions:"); - for chunk in locale_strs.chunks(10) { - eprintln!(" {}", chunk.join(", ")); + // List the registry's datasets so the user can pick from them. + let cwd = std::env::current_dir().unwrap_or_else(|_| std::path::PathBuf::from(".")); + if let Ok(registry) = crate::registry::Registry::load(&cwd) { + let ids: Vec = registry.all().iter().map(|d| d.short_name().to_string()).collect(); + eprintln!(); + eprintln!("Available datasets ({}):", ids.len()); + for chunk in ids.chunks(10) { + eprintln!(" {}", chunk.join(", ")); + } + eprintln!(); + eprintln!("Tip: `govbot search ` searches the registry."); + eprintln!(); } - eprintln!(); let input: String = Input::new() - .with_prompt("Enter state codes separated by spaces (e.g., il ca ny)") + .with_prompt("Enter dataset ids separated by spaces (e.g., il ca ny)") .interact_text()?; let repos: Vec = input @@ -251,42 +209,6 @@ fn prompt_sources() -> Result> { } } -fn prompt_tags() -> Result { - eprintln!(); - eprintln!("Tags let govbot categorize legislation by topics you care about."); - eprintln!("Here's an example tag definition:"); - eprintln!(); - eprintln!(" education:"); - eprintln!(" description: |"); - eprintln!(" Legislation related to schools, education funding,"); - eprintln!(" curriculum standards, and educational policy."); - eprintln!(" examples:"); - eprintln!(" - \"Increases per-pupil funding for public schools\""); - eprintln!(" - \"Mandates comprehensive sex education curriculum\""); - eprintln!(); - - let options = vec![ - "Use the example \"education\" tag to start", - "I'll create my own tags later", - ]; - - let selection = Select::new() - .with_prompt("How would you like to set up tags?") - .items(&options) - .default(0) - .interact()?; - - if selection == 1 { - let template = ai_prompt_template(); - for line in template.lines() { - eprintln!("{}", line); - } - eprintln!(); - } - - Ok(selection == 0) -} - fn prompt_publishing() -> Result { eprintln!(); eprintln!("Publishing is configured for RSS feeds by default."); @@ -301,60 +223,52 @@ fn prompt_publishing() -> Result { Ok(base_url) } -/// Generate govbot.yml content from wizard answers. +/// Generate a `govbot.yml` manifest from wizard answers. +/// +/// The manifest declares `datasets` + `transforms` + `publish` + `pipelines` — +/// it is NOT a classifier. The tag taxonomy lives in a separate fastclass +/// classifier bundle that `transforms.classify.classifier` references by path. /// This is a pure function for easy testing. -pub fn generate_govbot_yml(repos: &[String], include_example_tag: bool, base_url: &str) -> String { +pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { let mut yml = String::new(); - yml.push_str("# Govbot Configuration\n"); + yml.push_str("# Govbot Manifest\n"); yml.push_str("# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json\n"); yml.push_str("$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json\n\n"); - // Repos section - yml.push_str("repos:\n"); - for repo in repos { - yml.push_str(&format!(" - {}\n", repo)); + // datasets — the government-data sources this project consumes. + yml.push_str("datasets:\n"); + for dataset in datasets { + yml.push_str(&format!(" - {}\n", dataset)); } yml.push('\n'); - // Tags section - yml.push_str("tags:\n"); - if include_example_tag { - yml.push_str(" education:\n"); - yml.push_str(" description: |\n"); - yml.push_str(" Legislation related to schools, education funding, curriculum standards, and educational policy, including:\n"); - yml.push_str(" - K-12 public school funding, budgets, and resource allocation\n"); - yml.push_str(" - Curriculum standards, content requirements, and academic programs\n"); - yml.push_str(" - Teacher certification, training, professional development, and compensation\n"); - yml.push_str(" - Higher education policy, tuition, financial aid, and student loans\n"); - yml.push_str(" - Charter schools, school choice, vouchers, and alternative education models\n"); - yml.push_str(" - Special education services, accommodations, and individualized education plans\n"); - yml.push_str(" - School safety, security measures, and student discipline policies\n"); - yml.push_str(" - Early childhood education, pre-K programs, and childcare\n"); - yml.push_str(" - Standardized testing, assessments, and accountability measures\n"); - yml.push_str(" - School district governance, administration, and oversight\n"); - yml.push_str(" - Educational technology, digital learning, and online education\n"); - yml.push_str(" - Career and technical education, vocational training, and workforce development\n"); - yml.push_str(" examples:\n"); - yml.push_str(" - \"Increases per-pupil funding for public schools and establishes minimum teacher salary requirements\"\n"); - yml.push_str(" - \"Mandates comprehensive sex education curriculum in all public schools\"\n"); - yml.push_str(" - \"Expands eligibility for state financial aid programs to include part-time students\"\n"); - } else { - yml.push_str(" # Add your tags here. Example:\n"); - yml.push_str(" # my_topic:\n"); - yml.push_str(" # description: |\n"); - yml.push_str(" # Legislation related to ...\n"); - yml.push_str(" # examples:\n"); - yml.push_str(" # - \"Example bill description\"\n"); - yml.push_str(" {}\n"); - } + // transforms — external processes speaking the govbot stream protocol. + // The classify transform shells out to fastclass; point `classifier:` at + // your fastclass classifier bundle directory (containing classifier.yml). + yml.push_str("transforms:\n"); + yml.push_str(" classify:\n"); + yml.push_str(" command: [fastclass, classify, \"-\"]\n"); + yml.push_str(" reads: docs\n"); + yml.push_str(" writes: classification\n"); + yml.push_str(" # Path to your fastclass classifier bundle (containing classifier.yml).\n"); + yml.push_str(" classifier: ./classifier\n"); + yml.push('\n'); + + // publish — a publisher consumes the result stream and emits artifacts. + yml.push_str("publish:\n"); + yml.push_str(" feed:\n"); + yml.push_str(" type: rss\n"); + yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); + yml.push_str(" output_dir: \"docs\"\n"); + yml.push_str(" output_file: \"feed.xml\"\n"); yml.push('\n'); - // Build section - yml.push_str("build:\n"); - yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); - yml.push_str(" output_dir: \"docs\"\n"); - yml.push_str(" output_file: \"feed.xml\"\n"); + // pipelines — named `govbot run` targets, npm-script style. + yml.push_str("pipelines:\n"); + yml.push_str(" default:\n"); + yml.push_str(" - classify\n"); + yml.push_str(" - feed\n"); yml } @@ -386,7 +300,7 @@ pub fn write_gitignore(cwd: &Path) -> Result<()> { fn github_workflow_content() -> &'static str { r#"# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. +# Runs govbot to pull datasets, apply classifications, and publish feeds. name: Build Govbot @@ -399,12 +313,8 @@ on: - cron: '0 0 * * *' workflow_dispatch: inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' required: false type: string @@ -419,7 +329,6 @@ jobs: - name: Run Govbot uses: chihacknight/govbot/actions/govbot@main with: - tags: ${{ inputs.tags }} limit: ${{ inputs.limit }} "# } diff --git a/actions/govbot/tapes/govbot-clone-list.tape b/actions/govbot/tapes/govbot-pull-list.tape similarity index 50% rename from actions/govbot/tapes/govbot-clone-list.tape rename to actions/govbot/tapes/govbot-pull-list.tape index 393ece0f..e925b0bc 100644 --- a/actions/govbot/tapes/govbot-clone-list.tape +++ b/actions/govbot/tapes/govbot-pull-list.tape @@ -1,5 +1,5 @@ -# govbot clone --list output -Output tapes/govbot-clone-list.gif +# govbot pull --list output +Output tapes/govbot-pull-list.gif Set Shell bash Set FontSize 14 @@ -7,6 +7,6 @@ Set Width 900 Set Height 800 Set Padding 20 -Type "govbot clone --list" +Type "govbot pull --list" Enter Sleep 2s diff --git a/actions/govbot/tapes/logs-basic.tape b/actions/govbot/tapes/logs-basic.tape deleted file mode 100644 index ce824cfb..00000000 --- a/actions/govbot/tapes/logs-basic.tape +++ /dev/null @@ -1,12 +0,0 @@ -# govbot logs output with mock data -Output tapes/logs-basic.gif - -Set Shell bash -Set FontSize 12 -Set Width 1200 -Set Height 600 -Set Padding 20 - -Type "govbot logs --govbot-dir mocks/.govbot" -Enter -Sleep 3s diff --git a/actions/govbot/tapes/source-basic.tape b/actions/govbot/tapes/source-basic.tape new file mode 100644 index 00000000..7cb1da99 --- /dev/null +++ b/actions/govbot/tapes/source-basic.tape @@ -0,0 +1,12 @@ +# govbot source output with mock data +Output tapes/source-basic.gif + +Set Shell bash +Set FontSize 12 +Set Width 1200 +Set Height 600 +Set Padding 20 + +Type "govbot source --govbot-dir mocks/.govbot" +Enter +Sleep 3s diff --git a/actions/govbot/tests/cli_example_snaps.rs b/actions/govbot/tests/cli_example_snaps.rs index 3f902337..bbe959c6 100644 --- a/actions/govbot/tests/cli_example_snaps.rs +++ b/actions/govbot/tests/cli_example_snaps.rs @@ -119,14 +119,12 @@ fn run_example_script(script_path: &Path) -> (String, String, i32) { let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); let govbot_dir = manifest_dir.join("mocks").join(".govbot"); - // Set URL template to match existing mock data (uses -data-pipeline suffix) - let repo_url_template = "https://github.com/chn-openstates-files/{locale}-data-pipeline.git"; - + // The mock data's clone directories use the default `-legislation` suffix, + // so no `GOVBOT_REPO_SUFFIX` override is needed. let output = Command::new(&binary) .args(&args) .current_dir(&manifest_dir) .env("GOVBOT_DIR", govbot_dir.to_string_lossy().as_ref()) - .env("GOVBOT_REPO_URL_TEMPLATE", repo_url_template) .output() .expect("Failed to execute command"); @@ -179,8 +177,8 @@ fn format_snapshot_with_script(script_path: &Path, output: &str) -> String { /// Check if a script requires test data to run fn script_requires_test_data(script_path: &Path) -> bool { if let Ok(content) = fs::read_to_string(script_path) { - // Commands that need test data (repos directory) - content.contains("govbot logs") + // Commands that need test data (datasets directory) + content.contains("govbot source") } else { false } diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index 773949e9..a583596c 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -11,14 +11,20 @@ Process pipeline log files with type-safe reactive streams Usage: govbot [COMMAND] Commands: - clone Clone or pull data pipeline repositories (default: updates existing repos) Clones if repository doesn't exist, pulls if it does Use "govbot clone all" to clone all repos, or "govbot clone " for specific repos - logs Process and display pipeline log files - delete Delete data pipeline repositories Deletes local repository directories for specified locales - load Load bill metadata into a DuckDB database file Loads all metadata.json files from cloned repos into a DuckDB database for analysis. The database file is saved in the base govbot directory (e.g., ./.govbot/govbot.duckdb) - update Update govbot to the latest nightly version Downloads and installs the latest nightly build from GitHub releases - build Build RSS feed and HTML index from govbot.yml configuration Generates a combined RSS feed and HTML index from logs filtered by tags in govbot.yml - tag Tag bills using semantic or built-in similarity based on govbot.yml in the current directory. Reads JSON lines from stdin (from `govbot logs`), processes entries with bill identifiers, and writes per-tag files under the directory containing govbot.yml. By default, acts as a filter: only outputs lines that match tags. If a tag name is provided, only processes and outputs lines matching that specific tag - help Print this message or the help of the given subcommand(s) + pull Pull dataset repositories (default: updates existing datasets) Clones if the dataset repository doesn't exist, pulls if it does Use "govbot pull all" to pull all datasets, or "govbot pull " for specific ones + source Stream dataset records as JSON Lines (the govbot stream-protocol source) + delete Delete data pipeline repositories Deletes local repository directories for specified locales + load Load bill metadata into a DuckDB database file Loads all metadata.json files from cloned repos into a DuckDB database for analysis. The database file is saved in the base govbot directory (e.g., ./.govbot/govbot.duckdb) + update Update govbot to the latest nightly version Downloads and installs the latest nightly build from GitHub releases + publish Run a publisher: emit feeds/indexes/dumps from a govbot.yml publisher Reads the named publishers from govbot.yml `publish:` and emits their artifacts + apply Persist fastclass classification results into the dataset as tag files. Reads `fastclass classify` result JSON from stdin — the apply sink of `govbot source --select docs | fastclass classify - | govbot apply` — and writes per-tag `.tag.json` files under each bill's session directory, the files `govbot publish` turns into feeds. Classification itself is done by fastclass; `govbot apply` only stores the results + run Run the full govbot pipeline against the current directory's `govbot.yml`: pull/update → `source --select docs | fastclass classify - | apply` → publish. Equivalent to running `govbot` with no arguments + init Scaffold a new govbot.yml in the current directory (the setup wizard). Interactive in a TTY; writes sensible defaults when non-interactive + add Add one or more datasets to the project's `govbot.yml` `datasets:` list. Each id is validated against the registry before it is added + remove Remove one or more datasets from the project's `govbot.yml` + ls List datasets — the project's manifest datasets and the ones cached locally. With no manifest, lists every dataset in the registry + search Search the dataset registry. A blank query lists every dataset + help Print this message or the help of the given subcommand(s) Options: -h, --help Print help diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_clone_list.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_pull_list.snap similarity index 80% rename from actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_clone_list.snap rename to actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_pull_list.snap index 6679a69a..0c6b783e 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_clone_list.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_pull_list.snap @@ -3,15 +3,17 @@ source: tests/cli_example_snaps.rs expression: "&formatted_stdout" --- Command: -govbot clone --list +govbot pull --list Output: -Available repos: +Available datasets: ak al ar + az ca co + ct de fl ga @@ -50,12 +52,14 @@ Available repos: sc sd tn + tx usa ut + va vi vt wa wi wv wy - all (clone all repos) + all (pull every dataset) diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap deleted file mode 100644 index 3802b2b7..00000000 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap +++ /dev/null @@ -1,8 +0,0 @@ ---- -source: tests/cli_example_snaps.rs -expression: "&formatted_stdout" ---- -Command: -govbot logs - -Output: diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap new file mode 100644 index 00000000..9adc20f7 --- /dev/null +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap @@ -0,0 +1,14 @@ +--- +source: tests/cli_example_snaps.rs +assertion_line: 221 +expression: "&formatted_stdout" +--- +Command: +govbot source + +Output: +{"bill":{"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0001","legislative_session":"2025","title":"General government appropriations-2."},"id":"HB0001","log":{"action":{"classification":["filing"],"date":"2025-01-29T17:06:54+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0001"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250129T170654Z_h_received_for_introduction.json"},"timestamp":"20250129T170654Z"} +{"bill":{"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0002","legislative_session":"2025","title":"Hunting license application fees increase."},"id":"HB0002","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:16:16+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0002"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0002/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0002/logs/20250102T191616Z_h_received_for_introduction.json"},"timestamp":"20250102T191616Z"} +{"bill":{"abstracts":[{"abstract":"2025/Summaries/HB0005.pdf","note":"summary"}],"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0005","legislative_session":"2025","title":"Fishing outfitters and guides-registration of fishing boats."},"id":"HB0005","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:18:48+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0005"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0005/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0005/logs/20250102T191848Z_h_received_for_introduction.json"},"timestamp":"20250102T191848Z"} +{"bill":{"abstracts":[{"abstract":"2025/Summaries/HB0004.pdf","note":"summary"}],"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0004","legislative_session":"2025","title":"Snowmobile registration and user fees."},"id":"HB0004","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:17:44+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0004"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0004/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0004/logs/20250102T191744Z_h_received_for_introduction.json"},"timestamp":"20250102T191744Z"} +{"bill":{"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0003","legislative_session":"2025","title":"Animal abuse-predatory animals."},"id":"HB0003","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:17:11+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0003"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0003/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0003/logs/20250102T191711Z_h_received_for_introduction.json"},"timestamp":"20250102T191711Z"} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap new file mode 100644 index 00000000..553154a9 --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap @@ -0,0 +1,30 @@ +--- +source: tests/wizard_tests.rs +expression: "&yml" +--- +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - all + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "feed.xml" + +pipelines: + default: + - classify + - feed diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap deleted file mode 100644 index a3ae97fb..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap +++ /dev/null @@ -1,24 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap deleted file mode 100644 index df8d77bf..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap +++ /dev/null @@ -1,36 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://myuser.github.io/my-govbot" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap new file mode 100644 index 00000000..3e6dbf1f --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap @@ -0,0 +1,90 @@ +--- +source: tests/wizard_tests.rs +expression: "&session.to_snapshot()" +--- +=== Wizard Session === + +Welcome to govbot! Let's set up your project. + +? What datasets do you want to track? +> All jurisdictions in the registry + Select specific datasets + +Classification is done by fastclass against a classifier bundle. +Point the manifest's `transforms.classify.classifier` at your +bundle directory (containing classifier.yml). See the fastclass +docs to build one. + +Publishing is configured for an RSS feed by default. +Your feed will be generated in the "docs" directory. + +? Base URL for your feed: https://myuser.github.io/my-govbot + + ✓ Created govbot.yml + ✓ Created .gitignore with .govbot + ✓ Created .github/workflows/build.yml + +Setup complete! Run 'govbot' again to start the pipeline. + +=== Generated: govbot.yml === + +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - all + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "feed.xml" + +pipelines: + default: + - classify + - feed + +=== Generated: .github/workflows/build.yml === + +# Run Govbot +# Runs govbot to pull datasets, apply classifications, and publish feeds. + +name: Build Govbot + +on: + push: + branches: + - main + - master + schedule: + - cron: '0 0 * * *' + workflow_dispatch: + inputs: + limit: + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' + required: false + type: string + +jobs: + govbot: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Run Govbot + uses: chihacknight/govbot/actions/govbot@main + with: + limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap deleted file mode 100644 index 127602d2..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap +++ /dev/null @@ -1,122 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? -> All states (47 jurisdictions) - Select specific states - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? - Use the example "education" tag to start -> I'll create my own tags later - -To create a tag, copy this prompt into your preferred AI tool: - ---- -Create a govbot tag definition in YAML for tracking [YOUR TOPIC] legislation. -The tag should have: -- A description (multiline, covering subtopics) -- 2-3 example bill descriptions that would match -- Optional: include_keywords and exclude_keywords lists - -Format: - tag_name: - description: | - ... - examples: - - "..." - include_keywords: - - keyword1 - exclude_keywords: - - keyword1 ---- - -Paste the result into your govbot.yml under the 'tags:' section. - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://example.com - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap deleted file mode 100644 index b0d13d03..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap +++ /dev/null @@ -1,111 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? -> All states (47 jurisdictions) - Select specific states - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? -> Use the example "education" tag to start - I'll create my own tags later - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://myuser.github.io/my-govbot - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://myuser.github.io/my-govbot" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap index e5413262..6a3b5a68 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap @@ -6,39 +6,23 @@ expression: "&session.to_snapshot()" Welcome to govbot! Let's set up your project. -? What data sources do you want to track? - All states (47 jurisdictions) -> Select specific states +? What datasets do you want to track? + All jurisdictions in the registry +> Select specific datasets -Available states/jurisdictions: - ak, al, ar, ca, co, de, fl, ga, gu, hi - ia, id, il, in, ks, ky, la, ma, md, me - mi, mn, mo, mp, ms, mt, nc, nd, ne, nh - nj, nm, nv, ny, oh, ok, or, pa, pr, ri - sc, sd, tn, usa, ut, vi, vt, wa, wi, wv - wy +Browse the registry with `govbot search`. -? Enter state codes separated by spaces: wy +? Enter dataset ids separated by spaces: wy -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: +Classification is done by fastclass against a classifier bundle. +Point the manifest's `transforms.classify.classifier` at your +bundle directory (containing classifier.yml). See the fastclass +docs to build one. - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" +Publishing is configured for an RSS feed by default. +Your feed will be generated in the "docs" directory. -? How would you like to set up tags? -> Use the example "education" tag to start - I'll create my own tags later - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://sartaj.me/govbot +? Base URL for your feed: https://sartaj.me/govbot ✓ Created govbot.yml ✓ Created .gitignore with .govbot @@ -48,43 +32,37 @@ Setup complete! Run 'govbot' again to start the pipeline. === Generated: govbot.yml === -# Govbot Configuration +# Govbot Manifest # Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json $schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -repos: +datasets: - wy -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://sartaj.me/govbot" - output_dir: "docs" - output_file: "feed.xml" +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "feed.xml" + +pipelines: + default: + - classify + - feed === Generated: .github/workflows/build.yml === # Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. +# Runs govbot to pull datasets, apply classifications, and publish feeds. name: Build Govbot @@ -97,12 +75,8 @@ on: - cron: '0 0 * * *' workflow_dispatch: inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' required: false type: string @@ -117,5 +91,4 @@ jobs: - name: Run Govbot uses: chihacknight/govbot/actions/govbot@main with: - tags: ${{ inputs.tags }} limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap new file mode 100644 index 00000000..9230ac3b --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap @@ -0,0 +1,96 @@ +--- +source: tests/wizard_tests.rs +expression: "&session.to_snapshot()" +--- +=== Wizard Session === + +Welcome to govbot! Let's set up your project. + +? What datasets do you want to track? + All jurisdictions in the registry +> Select specific datasets + +Browse the registry with `govbot search`. + +? Enter dataset ids separated by spaces: il ca ny + +Classification is done by fastclass against a classifier bundle. +Point the manifest's `transforms.classify.classifier` at your +bundle directory (containing classifier.yml). See the fastclass +docs to build one. + +Publishing is configured for an RSS feed by default. +Your feed will be generated in the "docs" directory. + +? Base URL for your feed: https://activist.github.io/legislation + + ✓ Created govbot.yml + ✓ Created .gitignore with .govbot + ✓ Created .github/workflows/build.yml + +Setup complete! Run 'govbot' again to start the pipeline. + +=== Generated: govbot.yml === + +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - il + - ca + - ny + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://activist.github.io/legislation" + output_dir: "docs" + output_file: "feed.xml" + +pipelines: + default: + - classify + - feed + +=== Generated: .github/workflows/build.yml === + +# Run Govbot +# Runs govbot to pull datasets, apply classifications, and publish feeds. + +name: Build Govbot + +on: + push: + branches: + - main + - master + schedule: + - cron: '0 0 * * *' + workflow_dispatch: + inputs: + limit: + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' + required: false + type: string + +jobs: + govbot: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Run Govbot + uses: chihacknight/govbot/actions/govbot@main + with: + limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap deleted file mode 100644 index 727838b5..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap +++ /dev/null @@ -1,134 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? - All states (47 jurisdictions) -> Select specific states - -Available states/jurisdictions: - ak, al, ar, ca, co, de, fl, ga, gu, hi - ia, id, il, in, ks, ky, la, ma, md, me - mi, mn, mo, mp, ms, mt, nc, nd, ne, nh - nj, nm, nv, ny, oh, ok, or, pa, pr, ri - sc, sd, tn, usa, ut, vi, vt, wa, wi, wv - wy - -? Enter state codes separated by spaces: il ca ny - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? - Use the example "education" tag to start -> I'll create my own tags later - -To create a tag, copy this prompt into your preferred AI tool: - ---- -Create a govbot tag definition in YAML for tracking [YOUR TOPIC] legislation. -The tag should have: -- A description (multiline, covering subtopics) -- 2-3 example bill descriptions that would match -- Optional: include_keywords and exclude_keywords lists - -Format: - tag_name: - description: | - ... - examples: - - "..." - include_keywords: - - keyword1 - exclude_keywords: - - keyword1 ---- - -Paste the result into your govbot.yml under the 'tags:' section. - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://example.com - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - il - - ca - - ny - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap deleted file mode 100644 index 528b4933..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap +++ /dev/null @@ -1,123 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? - All states (47 jurisdictions) -> Select specific states - -Available states/jurisdictions: - ak, al, ar, ca, co, de, fl, ga, gu, hi - ia, id, il, in, ks, ky, la, ma, md, me - mi, mn, mo, mp, ms, mt, nc, nd, ne, nh - nj, nm, nv, ny, oh, ok, or, pa, pr, ri - sc, sd, tn, usa, ut, vi, vt, wa, wi, wv - wy - -? Enter state codes separated by spaces: il ca ny - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? -> Use the example "education" tag to start - I'll create my own tags later - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://activist.github.io/legislation - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - il - - ca - - ny - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://activist.github.io/legislation" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap new file mode 100644 index 00000000..0ba08307 --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap @@ -0,0 +1,30 @@ +--- +source: tests/wizard_tests.rs +expression: "&yml" +--- +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - wy + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "feed.xml" + +pipelines: + default: + - classify + - feed diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap deleted file mode 100644 index 3e513413..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap +++ /dev/null @@ -1,36 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - wy - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://sartaj.me/govbot" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap new file mode 100644 index 00000000..cf8c6819 --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap @@ -0,0 +1,32 @@ +--- +source: tests/wizard_tests.rs +expression: "&yml" +--- +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - il + - ca + - ny + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://example.com" + output_dir: "docs" + output_file: "feed.xml" + +pipelines: + default: + - classify + - feed diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap deleted file mode 100644 index f3ab59ea..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap +++ /dev/null @@ -1,26 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - il - - ca - - ny - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/wizard_tests.rs b/actions/govbot/tests/wizard_tests.rs index e8f30725..77f2cfcf 100644 --- a/actions/govbot/tests/wizard_tests.rs +++ b/actions/govbot/tests/wizard_tests.rs @@ -1,5 +1,5 @@ +use govbot::config::Manifest; use govbot::wizard::{generate_govbot_yml, WizardChoices, WizardSession}; -use govbot::publish::{load_config, get_repos_from_config}; // ============================================================ // Full wizard session snapshots — shows the entire user experience @@ -7,66 +7,35 @@ use govbot::publish::{load_config, get_repos_from_config}; // ============================================================ #[test] -fn wizard_session_all_repos_with_example_tag() { +fn wizard_session_all_datasets() { let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["all".to_string()], - include_example_tag: true, + datasets: vec!["all".to_string()], base_url: "https://myuser.github.io/my-govbot".to_string(), }); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_session_all_with_tag", &session.to_snapshot()); + insta::assert_snapshot!("wizard_session_all", &session.to_snapshot()); }); } #[test] -fn wizard_session_all_repos_own_tags() { +fn wizard_session_specific_datasets() { let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["all".to_string()], - include_example_tag: false, - base_url: "https://example.com".to_string(), - }); - let mut settings = insta::Settings::clone_current(); - settings.set_snapshot_path("snapshots"); - settings.bind(|| { - insta::assert_snapshot!("wizard_session_all_own_tags", &session.to_snapshot()); - }); -} - -#[test] -fn wizard_session_specific_repos_with_example_tag() { - let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["il".to_string(), "ca".to_string(), "ny".to_string()], - include_example_tag: true, + datasets: vec!["il".to_string(), "ca".to_string(), "ny".to_string()], base_url: "https://activist.github.io/legislation".to_string(), }); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_session_specific_with_tag", &session.to_snapshot()); - }); -} - -#[test] -fn wizard_session_specific_repos_own_tags() { - let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["il".to_string(), "ca".to_string(), "ny".to_string()], - include_example_tag: false, - base_url: "https://example.com".to_string(), - }); - let mut settings = insta::Settings::clone_current(); - settings.set_snapshot_path("snapshots"); - settings.bind(|| { - insta::assert_snapshot!("wizard_session_specific_own_tags", &session.to_snapshot()); + insta::assert_snapshot!("wizard_session_specific", &session.to_snapshot()); }); } #[test] fn wizard_session_single_state() { let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["wy".to_string()], - include_example_tag: true, + datasets: vec!["wy".to_string()], base_url: "https://sartaj.me/govbot".to_string(), }); let mut settings = insta::Settings::clone_current(); @@ -81,116 +50,105 @@ fn wizard_session_single_state() { // ============================================================ #[test] -fn test_generate_govbot_yml_all_repos_with_example_tag() { - let yml = generate_govbot_yml(&["all".to_string()], true, "https://myuser.github.io/my-govbot"); +fn test_generate_govbot_yml_all_datasets() { + let yml = generate_govbot_yml(&["all".to_string()], "https://myuser.github.io/my-govbot"); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_all_with_tag", &yml); + insta::assert_snapshot!("wizard_all", &yml); }); } #[test] -fn test_generate_govbot_yml_specific_repos_no_tag() { +fn test_generate_govbot_yml_specific_datasets() { let yml = generate_govbot_yml( &["il".to_string(), "ca".to_string(), "ny".to_string()], - false, "https://example.com", ); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_specific_no_tag", &yml); + insta::assert_snapshot!("wizard_specific", &yml); }); } #[test] -fn test_generate_govbot_yml_all_repos_no_tag() { - let yml = generate_govbot_yml(&["all".to_string()], false, "https://example.com"); +fn test_generate_govbot_yml_single_dataset() { + let yml = generate_govbot_yml(&["wy".to_string()], "https://sartaj.me/govbot"); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_all_no_tag", &yml); - }); -} - -#[test] -fn test_generate_govbot_yml_single_repo_with_tag() { - let yml = generate_govbot_yml(&["wy".to_string()], true, "https://sartaj.me/govbot"); - let mut settings = insta::Settings::clone_current(); - settings.set_snapshot_path("snapshots"); - settings.bind(|| { - insta::assert_snapshot!("wizard_single_with_tag", &yml); + insta::assert_snapshot!("wizard_single", &yml); }); } // ============================================================ -// Round-trip tests — generate YAML, write to disk, parse back, -// and verify the parsed config has the expected structure +// Round-trip tests — generate YAML, write to disk, parse back +// as a typed Manifest, and verify the parsed manifest structure // ============================================================ #[test] -fn test_generated_yml_is_valid_yaml_with_tag() { - let yml = generate_govbot_yml(&["all".to_string()], true, "https://myuser.github.io/my-govbot"); +fn test_generated_yml_is_valid_manifest() { + let yml = generate_govbot_yml(&["all".to_string()], "https://myuser.github.io/my-govbot"); let dir = tempfile::tempdir().unwrap(); let config_path = dir.path().join("govbot.yml"); std::fs::write(&config_path, &yml).unwrap(); - let config = load_config(&config_path).expect("generated govbot.yml should be valid YAML"); - - // Verify repos - let repos = get_repos_from_config(&config); - assert_eq!(repos, vec!["all"]); - - // Verify tags exist and have expected structure - let tags = config.get("tags").expect("should have tags key"); - let tags_obj = tags.as_object().expect("tags should be an object"); - assert!(tags_obj.contains_key("education"), "should contain education tag"); - let education = tags_obj.get("education").unwrap().as_object().unwrap(); - assert!(education.contains_key("description"), "education tag should have description"); - assert!(education.contains_key("examples"), "education tag should have examples"); - - // Verify build config - let build = config.get("build").expect("should have build key"); - let build_obj = build.as_object().expect("build should be an object"); - assert_eq!(build_obj.get("base_url").unwrap().as_str().unwrap(), "https://myuser.github.io/my-govbot"); - assert_eq!(build_obj.get("output_dir").unwrap().as_str().unwrap(), "docs"); - assert_eq!(build_obj.get("output_file").unwrap().as_str().unwrap(), "feed.xml"); + let manifest = Manifest::load(&config_path).expect("generated govbot.yml should parse"); + + // datasets + assert_eq!(manifest.datasets, vec!["all"]); + + // transforms — the classify transform shells out to fastclass. + let classify = manifest + .transforms + .get("classify") + .expect("should have a classify transform"); + assert_eq!(classify.reads, "docs"); + assert_eq!(classify.writes, "classification"); + assert!(classify.classifier.is_some(), "classify should reference a bundle"); + + // publish — the RSS feed publisher. + let feed = manifest.publish.get("feed").expect("should have a feed publisher"); + assert_eq!( + feed.base_url.as_deref(), + Some("https://myuser.github.io/my-govbot") + ); + + // pipelines + assert!(manifest.pipelines.contains_key("default")); } #[test] -fn test_generated_yml_is_valid_yaml_without_tag() { - let yml = generate_govbot_yml( - &["il".to_string(), "ca".to_string()], - false, - "https://example.com", - ); +fn test_generated_yml_specific_datasets_round_trip() { + let yml = generate_govbot_yml(&["il".to_string(), "ca".to_string()], "https://example.com"); let dir = tempfile::tempdir().unwrap(); let config_path = dir.path().join("govbot.yml"); std::fs::write(&config_path, &yml).unwrap(); - let config = load_config(&config_path).expect("generated govbot.yml should be valid YAML"); - - // Verify repos - let repos = get_repos_from_config(&config); - assert_eq!(repos, vec!["il", "ca"]); + let manifest = Manifest::load(&config_path).expect("generated govbot.yml should parse"); + assert_eq!(manifest.datasets, vec!["il", "ca"]); +} - // Verify tags is empty object - let tags = config.get("tags").expect("should have tags key"); - let tags_obj = tags.as_object().expect("tags should be an object"); - assert!(tags_obj.is_empty(), "tags should be empty when no example tag"); +/// A manifest carrying the retired `tags:` block must fail to parse. +#[test] +fn test_manifest_with_tags_block_fails() { + let yml = "datasets:\n - all\ntags:\n education:\n description: x\n"; + let dir = tempfile::tempdir().unwrap(); + let config_path = dir.path().join("govbot.yml"); + std::fs::write(&config_path, yml).unwrap(); - // Verify build config - let build = config.get("build").expect("should have build key"); - let build_obj = build.as_object().expect("build should be an object"); - assert_eq!(build_obj.get("base_url").unwrap().as_str().unwrap(), "https://example.com"); + let result = Manifest::load(&config_path); + assert!( + result.is_err(), + "a govbot.yml containing `tags:` must fail to parse" + ); } #[test] fn test_write_files_creates_govbot_yml() { let choices = WizardChoices { - repos: vec!["wy".to_string()], - include_example_tag: true, + datasets: vec!["wy".to_string()], base_url: "https://sartaj.me/govbot".to_string(), }; let session = WizardSession::from_choices(&choices); @@ -198,20 +156,19 @@ fn test_write_files_creates_govbot_yml() { session.write_files(dir.path()).expect("write_files should succeed"); - // Verify govbot.yml was created and is parseable + // Verify govbot.yml was created and parses as a Manifest. let config_path = dir.path().join("govbot.yml"); assert!(config_path.exists(), "govbot.yml should exist"); - let config = load_config(&config_path).expect("written govbot.yml should be valid YAML"); - let repos = get_repos_from_config(&config); - assert_eq!(repos, vec!["wy"]); + let manifest = Manifest::load(&config_path).expect("written govbot.yml should parse"); + assert_eq!(manifest.datasets, vec!["wy"]); - // Verify .gitignore was created + // Verify .gitignore was created. let gitignore_path = dir.path().join(".gitignore"); assert!(gitignore_path.exists(), ".gitignore should exist"); let gitignore = std::fs::read_to_string(&gitignore_path).unwrap(); assert!(gitignore.contains(".govbot"), ".gitignore should contain .govbot"); - // Verify workflow was created + // Verify workflow was created. let workflow_path = dir.path().join(".github/workflows/build.yml"); assert!(workflow_path.exists(), "build.yml workflow should exist"); } diff --git a/schemas/README.md b/schemas/README.md index 21841a76..fb2520fa 100644 --- a/schemas/README.md +++ b/schemas/README.md @@ -8,7 +8,7 @@ The `*.schema.json` files are JSON Schema definitions that validate the structur ### Configuration Schemas -- **`govbot.schema.json`** - Schema for `govbot.yml` configuration files used by the govbot CLI tool. Defines the structure for repositories, tags, and RSS publishing configuration. +- **`govbot.schema.json`** - Schema for `govbot.yml` manifest files used by the govbot CLI tool. A `govbot.yml` is a project manifest: it declares the `datasets` a project consumes, the `transforms` it runs over them, the `publish` publishers that emit artifacts, and named `pipelines` that wire those stages together. It is **not** a classifier — the tag taxonomy lives in a separate fastclass classifier bundle (`classifier.yml`) that govbot only references by path. ### Data Schemas @@ -102,12 +102,24 @@ Schemas can be referenced in YAML files using the `$schema` key: ```yaml # govbot.yml -$schema: https://raw.githubusercontent.com/windy-civi/toolkit/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -repos: +datasets: - all -tags: - # ... tag definitions +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier # path to the fastclass classifier bundle +publish: + feed: + type: rss + base_url: "https://example.github.io/my-govbot" +pipelines: + default: + - classify + - feed ``` This enables: diff --git a/schemas/STREAM_PROTOCOL.md b/schemas/STREAM_PROTOCOL.md new file mode 100644 index 00000000..e9105f4b --- /dev/null +++ b/schemas/STREAM_PROTOCOL.md @@ -0,0 +1,116 @@ +# govbot stack — frozen cross-domain contract + +**Status:** FROZEN for the layering refactor (build-sequence steps 1–6). +**Owner:** head architect. Subagents treat this as read-only input. A subagent that +finds the contract unworkable must escalate to the architect for a re-freeze — it does +not change the contract unilaterally. + +This is the load-bearing interface between the three layers — `fastclass` (classifier), +`govbot` (gov-data tool), and userland apps. The layers compose over **process +boundaries** (newline-delimited JSON on stdio), never as linked libraries. + +--- + +## 1. The input stream protocol (govbot → transform) + +`govbot` streams documents to a transform (e.g. `fastclass classify -`) as +**newline-delimited JSON** — one object per line, UTF-8, `\n`-terminated: + +```json +{"id": "", "text": "", "kind": "docs"} +``` + +- **`id`** — an opaque routing key. The transform treats it as opaque and **echoes it + back unchanged** in the result. For govbot's `docs` projection it is the bill's + dataset path; no consumer parses its structure. +- **`text`** — the document body. For govbot's `docs` projection this is the **full + bill text** assembled from `metadata.json` (not just titles). +- **`kind`** — **required**. Tags the stream record type (`docs` today; future + `summary`, etc.). A transform that does not recognize a `kind` **passes the record + through untouched** rather than erroring. + +A transform reads this stream line-by-line and emits one result line per input line. + +## 2. The classify result (`ClassifyResult`) + +`fastclass classify` emits one `ClassifyResult` JSON object per input document. The +echoed identifier field is named **`doc`** (NOT `id`) — this is frozen; downstream +sinks (`govbot apply`) route on `doc`. Full shape — see +`fastclass/schemas/result.schema.json` for the machine-readable schema: + +```json +{ + "doc": "", + "text_hash": "sha256:", + "classifier_version": "sha256:<12-hex>", + "fusion_version": "fusion-v1", + "tags": { + "": { + "matched": true, + "threshold": 0.3, + "matcher_outputs": [ + {"kind": "keyword", "version": "...", "role": "scorer", + "raw_score": 1.0, "evidence": [{"kind": "keyword_hit", "detail": "solar"}]} + ], + "fusion": {"version": "fusion-v1", "final_score": 0.92, "gated": false} + } + } +} +``` + +- `matcher_outputs[].role` is one of `scorer` | `gate` | `penalty`. +- `tags` is ordered by tag name (byte-stable, snapshot-testable). + +## 3. `fastclass describe` + +`fastclass describe classifier=` emits a single JSON object so govbot can +type-check a transform DAG and validate that `publish.*.select:` tag names exist: + +```json +{ + "reads": ["docs"], + "writes": ["classification"], + "tags": ["clean_energy", "conservation", "emissions_and_climate", "fossil_fuels"], + "classifier_version": "sha256:<12-hex>", + "fusion_version": "fusion-v1" +} +``` + +- `tags` is the sorted list of active tag names from the bundle. +- `describe` is a **subcommand** (not a `classify` flag). + +## 4. The classifier-bundle layout + +A classifier bundle is a **directory**. `fastclass` owns its contents; `govbot` only +passes the path (`classifier=`). `fastclass` must NOT know the word "govbot" — +`govbot.yml` is **not** a recognized bundle file. + +``` +/ + classifier.yml taxonomy (REQUIRED; `fastclass.yml` is an accepted alias) + fusion.yml fusion weights + the cascade "uncertainty band" (optional) + eval/ + constitution.yml frozen gold set — never enters an LLM context + rolling.yml refreshable working eval set (optional) + proposals/ improvement-proposal history + model/ optional embedding model + fastclass.lock pins bundle + binary versions for lineage +``` + +## 5. Calibrated scores + +`fusion.final_score` is contractually a **calibrated** probability in `[0, 1]` — +downstream consumers (publisher thresholds, summarizer gating) may threshold it +directly. Calibration **regression** is *flagged* in the backtest verdict, **not +blocked** (soft gate; hardening deferred). + +--- + +## Layer ownership rule + +Each layer owns its own config; **no file is shared across a layer boundary.** + +- `classifier.yml` / `fusion.yml` / the bundle — fastclass's. +- `govbot.yml` (the manifest: `datasets` / `transforms` / `publish` / `pipelines`) — + govbot's. It has **no `tags:`**. +- A userland repo merely *contains* both — it owns neither tool's internals. diff --git a/schemas/govbot.schema.json b/schemas/govbot.schema.json index 555a2318..9d6dd518 100644 --- a/schemas/govbot.schema.json +++ b/schemas/govbot.schema.json @@ -1,61 +1,126 @@ { "$schema": "http://json-schema.org/draft-07/schema#", - "title": "Govbot Configuration Schema", - "description": "Schema for validating govbot.yml configuration files", + "$id": "https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json", + "title": "Govbot Manifest Schema", + "description": "Schema for validating govbot.yml manifest files. govbot.yml is a project manifest -- it declares the datasets a project consumes, the transforms it runs over them, the publishers that emit artifacts, and named pipelines that wire those stages together. It is NOT a classifier: the tag taxonomy lives in a separate fastclass classifier bundle (classifier.yml) which govbot only references by path.", "type": "object", "properties": { - "repos": { - "description": "List of repositories to clone and process. Use 'all' to include all available repositories.", + "$schema": { + "description": "Optional reference to this schema for editor autocomplete and validation.", + "type": "string" + }, + "datasets": { + "description": "Government-data sources the project pulls and processes. Each entry is a dataset identifier -- a short code such as 'wy' today, or a 'namespace/name[@channel]' registry reference in a later wave. 'all' selects every dataset known to govbot. The schema is intentionally permissive (array of strings); structured registry references are validated by govbot at resolve time.", "type": "array", "items": { - "type": "string" + "type": "string", + "description": "A dataset identifier (e.g. 'wy', 'all', or a future 'namespace/name@channel' registry reference)." }, + "minItems": 1, "default": ["all"] }, - "tags": { - "description": "Tag definitions for categorizing legislation. Each tag should have a description and optional examples.", + "transforms": { + "description": "Named external-process transforms. A transform is a separate program that speaks the govbot stream protocol (newline-delimited JSON on stdio, stable 'id', typed 'kind'). govbot streams records of the transform's 'reads' kind into it and routes the records of its 'writes' kind back by 'id'. fastclass classification is one such transform; a future local-LLM summarizer is another.", "type": "object", "additionalProperties": { - "$ref": "#/definitions/tag" + "$ref": "#/definitions/transform" } }, "publish": { - "description": "RSS feed publishing configuration", + "description": "Named publishers. A publisher consumes a result stream and emits artifacts (a feed, an HTML index, a JSON dump, a DuckDB database, or Bluesky posts). Each publisher declares a 'type' plus type-specific configuration.", + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/publisher" + } + }, + "pipelines": { + "description": "Named 'govbot run' targets, npm-script style. Each pipeline is an ordered list of stage references -- names of entries in 'transforms' and 'publish' -- executed in sequence. 'govbot run ' runs the named pipeline.", + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/pipeline" + } + } + }, + "required": ["datasets"], + "additionalProperties": false, + "definitions": { + "transform": { "type": "object", + "description": "A single external-process transform stage.", "properties": { - "base_url": { - "description": "Base URL for RSS feed links (required for GitHub Pages). Should match your GitHub Pages URL.", + "command": { + "description": "The external process to run, given the govbot stream on stdin and emitting results on stdout. Either a single string (parsed as a shell-style command) or an argv array (first element is the executable, the rest are arguments).", + "oneOf": [ + { + "type": "string" + }, + { + "type": "array", + "items": { + "type": "string" + }, + "minItems": 1 + } + ] + }, + "reads": { + "description": "The stream record kind this transform consumes. govbot only feeds records of this kind into the transform; records of other kinds pass through untouched. 'docs' is the document-projection kind defined by the stream protocol.", "type": "string", - "format": "uri", - "default": "https://example.com" + "examples": ["docs", "classification", "summary"] }, - "output_dir": { - "description": "Directory where RSS feeds are generated", + "writes": { + "description": "The stream record kind this transform produces. govbot routes records of this kind back into the dataset (or onward to the next stage) by their 'id'. The classify transform writes 'classification'.", "type": "string", - "default": "feeds" + "examples": ["classification", "summary"] }, - "output_file": { - "description": "Output filename for the RSS feed", + "classifier": { + "description": "For a classify-style transform: the path to the fastclass classifier bundle directory (containing classifier.yml). govbot passes this path to the transform unchanged and never reads the bundle's contents itself.", + "type": "string" + } + }, + "required": ["command", "reads", "writes"], + "additionalProperties": true + }, + "publisher": { + "type": "object", + "description": "A single publisher stage. The required fields depend on 'type'.", + "properties": { + "type": { + "description": "The publisher kind. 'rss' and 'html' emit feed/index files; 'json' emits a JSON dump; 'duckdb' loads results into a DuckDB database; 'bluesky' posts to a Bluesky account.", "type": "string", - "default": "feed.xml" + "enum": ["rss", "html", "json", "duckdb", "bluesky"] }, - "tags": { - "description": "Specific tags to include in the combined RSS feed. If not specified, all tags are included.", + "select": { + "description": "Tag names to include. Only records carrying at least one of these tags are published; if omitted, all tagged records are published. Tag names must exist in the classifier bundle (govbot validates them against 'fastclass describe').", "type": "array", "items": { "type": "string" } }, + "base_url": { + "description": "Base URL for generated links (required for rss/html on GitHub Pages -- should match the GitHub Pages URL).", + "type": "string", + "format": "uri" + }, + "output_dir": { + "description": "Directory where the publisher writes its artifacts (used by rss/html/json).", + "type": "string", + "default": "docs" + }, + "output_file": { + "description": "Output filename for the primary artifact (e.g. the RSS feed file, the JSON dump, or the DuckDB database file).", + "type": "string" + }, "title": { - "description": "Custom feed title. If not specified, defaults to combined tag names.", + "description": "Custom feed/index title. If omitted, defaults to a title derived from the selected tag names.", "type": "string" }, "description": { - "description": "Custom feed description. If not specified, defaults to combined tag descriptions.", + "description": "Custom feed/index description. If omitted, defaults to a description derived from the selected tags.", "type": "string" }, "limit": { - "description": "Limit number of entries per RSS feed. Use 'none' for no limit, or a number. Default is 15 (RSS standard).", + "description": "Maximum number of entries to include. Use the string 'none' for no limit, or a positive integer.", "oneOf": [ { "type": "string", @@ -66,30 +131,48 @@ "minimum": 1 } ] - } - }, - "required": ["base_url", "output_dir", "output_file"] - } - }, - "required": ["repos", "tags"], - "definitions": { - "tag": { - "type": "object", - "properties": { - "description": { - "description": "Detailed description of what legislation this tag covers. Use YAML multiline strings (|) for formatting.", + }, + "min_score": { + "description": "Bluesky publisher only. The minimum calibrated 'final_score' a matched tag must reach for a record to be posted. 'final_score' is the calibrated probability in [0,1] guaranteed by the stream protocol, so it can be thresholded directly. Defaults to 0.6 -- a conservative value so a misconfigured manifest does not flood a feed with low-confidence matches.", + "type": "number", + "minimum": 0, + "maximum": 1, + "default": 0.6 + }, + "ledger": { + "description": "Bluesky publisher only. Path to the append-only posted-state ledger that makes the publisher idempotent -- it records the id of every record posted so re-runs never double-post. Relative paths resolve against the project directory. Defaults to '.govbot/bluesky-.ledger'.", "type": "string" }, - "examples": { - "description": "Example bill descriptions that would match this tag", - "type": "array", - "items": { - "type": "string" + "post_template": { + "description": "Bluesky publisher only. The post-text template. Supported placeholders, substituted per record: '{title}', '{tags}', '{link}', '{identifier}', '{session}', '{score}'. Rendered text is truncated to Bluesky's 300-character limit. If omitted, a sensible default template is used. Bluesky credentials are NEVER schema fields -- they are read from the environment (BLUESKY_HANDLE, BLUESKY_APP_PASSWORD, optional BLUESKY_SERVICE).", + "type": "string" + } + }, + "required": ["type"], + "additionalProperties": true, + "allOf": [ + { + "if": { + "properties": { + "type": { + "enum": ["rss", "html"] + } + } + }, + "then": { + "required": ["base_url"] } } + ] + }, + "pipeline": { + "type": "array", + "description": "An ordered list of stage references. Each item names an entry in the manifest's 'transforms' or 'publish' map; govbot runs them in order.", + "items": { + "type": "string", + "description": "The name of a transform or publisher defined elsewhere in this manifest." }, - "required": ["description"] + "minItems": 1 } } } - From 267f820c4ae9656a0837759a4f41a0dd5944c604 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 18:42:11 -0500 Subject: [PATCH 02/32] Tidy: cargo fmt, rename stale 'just govbot logs' refs in filter boilerplate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two mechanical sweeps after the layering refactor landed: 1. cargo fmt — applies rustfmt's preferences uniformly; no behavioral changes. Touches bluesky.rs, cache.rs, git.rs, lock.rs, main.rs, pipeline.rs, processor.rs, publish.rs, registry.rs, wizard.rs. 2. In the 55 per-jurisdiction filter boilerplate files under src/filters/, replace 'just govbot logs --repos=...' with 'just govbot source --repos=...'. The 'logs' subcommand was renamed to 'source' in the refactor; the filter-update LLM prompt embedded in each file still referenced the old name. Tests: 30/30 pass, snapshots unchanged (the diff is purely in comments and rustfmt cosmetics). Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/bluesky.rs | 26 ++- .../src/filters/ak-legislation/default.rs | 4 +- .../src/filters/al-legislation/default.rs | 4 +- .../src/filters/ar-legislation/default.rs | 4 +- .../src/filters/az-legislation/default.rs | 6 +- .../src/filters/ca-legislation/default.rs | 4 +- .../src/filters/co-legislation/default.rs | 4 +- .../src/filters/ct-legislation/default.rs | 4 +- .../src/filters/de-legislation/default.rs | 4 +- .../src/filters/fl-legislation/default.rs | 4 +- .../src/filters/ga-legislation/default.rs | 4 +- .../src/filters/gu-legislation/default.rs | 4 +- .../src/filters/hi-legislation/default.rs | 4 +- .../src/filters/ia-legislation/default.rs | 4 +- .../src/filters/id-legislation/default.rs | 4 +- .../src/filters/il-legislation/default.rs | 4 +- .../src/filters/in-legislation/default.rs | 4 +- .../src/filters/ks-legislation/default.rs | 4 +- .../src/filters/ky-legislation/default.rs | 4 +- .../src/filters/la-legislation/default.rs | 4 +- .../src/filters/ma-legislation/default.rs | 4 +- .../src/filters/md-legislation/default.rs | 4 +- .../src/filters/me-legislation/default.rs | 4 +- .../src/filters/mi-legislation/default.rs | 4 +- .../src/filters/mn-legislation/default.rs | 4 +- .../src/filters/mo-legislation/default.rs | 4 +- .../src/filters/mp-legislation/default.rs | 4 +- .../src/filters/ms-legislation/default.rs | 4 +- .../src/filters/mt-legislation/default.rs | 4 +- .../src/filters/nc-legislation/default.rs | 4 +- .../src/filters/nd-legislation/default.rs | 4 +- .../src/filters/ne-legislation/default.rs | 4 +- .../src/filters/nh-legislation/default.rs | 4 +- .../src/filters/nj-legislation/default.rs | 4 +- .../src/filters/nm-legislation/default.rs | 4 +- .../src/filters/nv-legislation/default.rs | 4 +- .../src/filters/ny-legislation/default.rs | 4 +- .../src/filters/oh-legislation/default.rs | 4 +- .../src/filters/ok-legislation/default.rs | 4 +- .../src/filters/or-legislation/default.rs | 4 +- .../src/filters/pa-legislation/default.rs | 4 +- .../src/filters/pr-legislation/default.rs | 4 +- .../src/filters/ri-legislation/default.rs | 4 +- .../src/filters/sc-legislation/default.rs | 4 +- .../src/filters/sd-legislation/default.rs | 4 +- .../src/filters/tn-legislation/default.rs | 4 +- .../src/filters/tx-legislation/default.rs | 4 +- .../src/filters/usa-legislation/default.rs | 6 +- .../src/filters/ut-legislation/default.rs | 4 +- .../src/filters/va-legislation/default.rs | 4 +- .../src/filters/vi-legislation/default.rs | 4 +- .../src/filters/vt-legislation/default.rs | 4 +- .../src/filters/wa-legislation/default.rs | 4 +- .../src/filters/wi-legislation/default.rs | 4 +- .../src/filters/wv-legislation/default.rs | 4 +- .../src/filters/wy-legislation/default.rs | 4 +- actions/govbot/src/git.rs | 156 +++++++++--------- actions/govbot/src/lock.rs | 10 +- actions/govbot/src/pipeline.rs | 21 +-- actions/govbot/src/processor.rs | 59 ++++--- actions/govbot/src/registry.rs | 5 +- actions/govbot/tests/api_snaps.rs | 15 +- 62 files changed, 270 insertions(+), 246 deletions(-) diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs index 65501906..f55f6b30 100644 --- a/actions/govbot/src/bluesky.rs +++ b/actions/govbot/src/bluesky.rs @@ -97,7 +97,12 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { pending.len(), ); for (i, post) in pending.iter().enumerate() { - println!("--- post {} of {} (id: {}) ---", i + 1, pending.len(), post.id); + println!( + "--- post {} of {} (id: {}) ---", + i + 1, + pending.len(), + post.id + ); println!("{}", post.text); println!(); } @@ -118,12 +123,14 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { .ok() .filter(|s| !s.trim().is_empty()) .unwrap_or_else(|| DEFAULT_SERVICE.to_string()); - let session = create_session(&service) - .context("Bluesky authentication failed")?; + let session = create_session(&service).context("Bluesky authentication failed")?; eprintln!( "Publisher '{}' (bluesky): authenticated as {} — posting {} record(s) to {}.", - job.name, session.handle, pending.len(), service + job.name, + session.handle, + pending.len(), + service ); // Post each pending record, appending to the ledger as we go so a @@ -330,9 +337,8 @@ fn read_ledger(path: &Path) -> Result> { /// directory) if needed. fn append_ledger(path: &Path, id: &str) -> Result<()> { if let Some(parent) = path.parent() { - fs::create_dir_all(parent).with_context(|| { - format!("Failed to create ledger directory: {}", parent.display()) - })?; + fs::create_dir_all(parent) + .with_context(|| format!("Failed to create ledger directory: {}", parent.display()))?; } let mut file = fs::OpenOptions::new() .create(true) @@ -414,7 +420,11 @@ fn create_session(service: &str) -> Result { .unwrap_or(&handle) .to_string(); - Ok(Session { access_jwt, did, handle }) + Ok(Session { + access_jwt, + did, + handle, + }) } /// Post one `app.bsky.feed.post` record via `com.atproto.repo.createRecord`. diff --git a/actions/govbot/src/filters/ak-legislation/default.rs b/actions/govbot/src/filters/ak-legislation/default.rs index 882b5efe..8ae70400 100644 --- a/actions/govbot/src/filters/ak-legislation/default.rs +++ b/actions/govbot/src/filters/ak-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ak-legislation (Alaska): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ak --limit=100` +// `just govbot source --repos=ak --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ak --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ak --limit=100 --filter=default` // // Current filter removes: routine committee abbreviations, minutes, hearings, referrals, and filing actions // ====================================== diff --git a/actions/govbot/src/filters/al-legislation/default.rs b/actions/govbot/src/filters/al-legislation/default.rs index 4884e3db..6432c71c 100644 --- a/actions/govbot/src/filters/al-legislation/default.rs +++ b/actions/govbot/src/filters/al-legislation/default.rs @@ -4,7 +4,7 @@ // to make the output more focused on important legislative actions. // // TO UPDATE THIS FILTER: -// 1. Run: `just govbot logs --repos=al --limit=100` to see recent log entries +// 1. Run: `just govbot source --repos=al --limit=100` to see recent log entries // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=al --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=al --limit=100 --filter=default` // // Current filter removes: routine filing, first reading/referral, and pending committee status updates // ====================================== diff --git a/actions/govbot/src/filters/ar-legislation/default.rs b/actions/govbot/src/filters/ar-legislation/default.rs index 99fcc315..fc77eef5 100644 --- a/actions/govbot/src/filters/ar-legislation/default.rs +++ b/actions/govbot/src/filters/ar-legislation/default.rs @@ -4,7 +4,7 @@ // to make the output more focused on important legislative actions. // // TO UPDATE THIS FILTER: -// 1. Run: `just govbot logs --repos=ar --limit=100` to see recent log entries +// 1. Run: `just govbot source --repos=ar --limit=100` to see recent log entries // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ar --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ar --limit=100 --filter=default` // // Current filter removes: routine filing, first reading/referrals, and routine procedural actions // ====================================== diff --git a/actions/govbot/src/filters/az-legislation/default.rs b/actions/govbot/src/filters/az-legislation/default.rs index 6837a4bd..db902452 100644 --- a/actions/govbot/src/filters/az-legislation/default.rs +++ b/actions/govbot/src/filters/az-legislation/default.rs @@ -4,7 +4,7 @@ // to make the output more focused on important legislative actions. // // TO UPDATE THIS FILTER: -// 1. Run: `just govbot logs --repos=az --limit=100` to see recent log entries +// 1. Run: `just govbot source --repos=az --limit=100` to see recent log entries // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,14 +15,14 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=az --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=az --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== // Filter for az-legislation (Arizona) // Note: Repository not found in test data - keeping default filter for now -// TODO: Analyze output from `just govbot logs --repos=az --limit=100` when data is available +// TODO: Analyze output from `just govbot source --repos=az --limit=100` when data is available use crate::filter::FilterResult; use serde_json::Value; diff --git a/actions/govbot/src/filters/ca-legislation/default.rs b/actions/govbot/src/filters/ca-legislation/default.rs index 335e7ff3..0a775e97 100644 --- a/actions/govbot/src/filters/ca-legislation/default.rs +++ b/actions/govbot/src/filters/ca-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ca-legislation (California): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ca --limit=100` +// `just govbot source --repos=ca --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ca --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ca --limit=100 --filter=default` // // Current filter removes: routine committee referrals, introductions, and routine reading actions // ====================================== diff --git a/actions/govbot/src/filters/co-legislation/default.rs b/actions/govbot/src/filters/co-legislation/default.rs index a3ad5579..449842d4 100644 --- a/actions/govbot/src/filters/co-legislation/default.rs +++ b/actions/govbot/src/filters/co-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR co-legislation (Colorado): // 1. First, gather real data by running this command: -// `just govbot logs --repos=co --limit=100` +// `just govbot source --repos=co --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=co --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=co --limit=100 --filter=default` // // Current filter removes: routine introductions and committee referrals // ====================================== diff --git a/actions/govbot/src/filters/ct-legislation/default.rs b/actions/govbot/src/filters/ct-legislation/default.rs index d4fe808f..b06ce867 100644 --- a/actions/govbot/src/filters/ct-legislation/default.rs +++ b/actions/govbot/src/filters/ct-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ct-legislation (Connecticut): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ct --limit=100` +// `just govbot source --repos=ct --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ct --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ct --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/de-legislation/default.rs b/actions/govbot/src/filters/de-legislation/default.rs index f9fa1cf5..3e991a07 100644 --- a/actions/govbot/src/filters/de-legislation/default.rs +++ b/actions/govbot/src/filters/de-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR de-legislation (Delaware): // 1. First, gather real data by running this command: -// `just govbot logs --repos=de --limit=100` +// `just govbot source --repos=de --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=de --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=de --limit=100 --filter=default` // // Current filter removes: routine introductions, committee assignments, and "Not Worked" status updates // ====================================== diff --git a/actions/govbot/src/filters/fl-legislation/default.rs b/actions/govbot/src/filters/fl-legislation/default.rs index 7006a46b..164b5ee0 100644 --- a/actions/govbot/src/filters/fl-legislation/default.rs +++ b/actions/govbot/src/filters/fl-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR fl-legislation (Florida): // 1. First, gather real data by running this command: -// `just govbot logs --repos=fl --limit=100` +// `just govbot source --repos=fl --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=fl --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=fl --limit=100 --filter=default` // // Current filter removes: routine filings, referrals, and committee status updates // ====================================== diff --git a/actions/govbot/src/filters/ga-legislation/default.rs b/actions/govbot/src/filters/ga-legislation/default.rs index 3ea1da95..611c3adf 100644 --- a/actions/govbot/src/filters/ga-legislation/default.rs +++ b/actions/govbot/src/filters/ga-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ga-legislation (Georgia): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ga --limit=100` +// `just govbot source --repos=ga --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ga --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ga --limit=100 --filter=default` // // Current filter removes: routine hopper entries, first readers, and routine referrals // ====================================== diff --git a/actions/govbot/src/filters/gu-legislation/default.rs b/actions/govbot/src/filters/gu-legislation/default.rs index 6b4156f3..1de32ba1 100644 --- a/actions/govbot/src/filters/gu-legislation/default.rs +++ b/actions/govbot/src/filters/gu-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR gu-legislation (Guam): // 1. First, gather real data by running this command: -// `just govbot logs --repos=gu --limit=100` +// `just govbot source --repos=gu --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=gu --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=gu --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/hi-legislation/default.rs b/actions/govbot/src/filters/hi-legislation/default.rs index ea7af414..815f8e79 100644 --- a/actions/govbot/src/filters/hi-legislation/default.rs +++ b/actions/govbot/src/filters/hi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR hi-legislation (Hawaii): // 1. First, gather real data by running this command: -// `just govbot logs --repos=hi --limit=100` +// `just govbot source --repos=hi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=hi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=hi --limit=100 --filter=default` // // Current filter removes: routine introductions, first readings, and committee referral patterns // ====================================== diff --git a/actions/govbot/src/filters/ia-legislation/default.rs b/actions/govbot/src/filters/ia-legislation/default.rs index 4c180fcc..05729144 100644 --- a/actions/govbot/src/filters/ia-legislation/default.rs +++ b/actions/govbot/src/filters/ia-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ia-legislation (Iowa): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ia --limit=100` +// `just govbot source --repos=ia --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ia --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ia --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and subcommittee notifications // ====================================== diff --git a/actions/govbot/src/filters/id-legislation/default.rs b/actions/govbot/src/filters/id-legislation/default.rs index da05b8e2..5668ef6c 100644 --- a/actions/govbot/src/filters/id-legislation/default.rs +++ b/actions/govbot/src/filters/id-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR id-legislation (Idaho): // 1. First, gather real data by running this command: -// `just govbot logs --repos=id --limit=100` +// `just govbot source --repos=id --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=id --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=id --limit=100 --filter=default` // // Current filter removes: routine introductions, readings, and status updates // ====================================== diff --git a/actions/govbot/src/filters/il-legislation/default.rs b/actions/govbot/src/filters/il-legislation/default.rs index 43184df3..d1892ef4 100644 --- a/actions/govbot/src/filters/il-legislation/default.rs +++ b/actions/govbot/src/filters/il-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR il-legislation (Illinois): // 1. First, gather real data by running this command: -// `just govbot logs --repos=il --limit=100` +// `just govbot source --repos=il --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=il --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=il --limit=100 --filter=default` // // Current filter removes: routine co-sponsor additions, Rules Committee referrals, and filings // ====================================== diff --git a/actions/govbot/src/filters/in-legislation/default.rs b/actions/govbot/src/filters/in-legislation/default.rs index fc539141..8d2342cb 100644 --- a/actions/govbot/src/filters/in-legislation/default.rs +++ b/actions/govbot/src/filters/in-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR in-legislation (Indiana): // 1. First, gather real data by running this command: -// `just govbot logs --repos=in --limit=100` +// `just govbot source --repos=in --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=in --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=in --limit=100 --filter=default` // // Current filter removes: routine first readings, referrals, and authorship notifications // ====================================== diff --git a/actions/govbot/src/filters/ks-legislation/default.rs b/actions/govbot/src/filters/ks-legislation/default.rs index 955526db..9255ddbf 100644 --- a/actions/govbot/src/filters/ks-legislation/default.rs +++ b/actions/govbot/src/filters/ks-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ks-legislation (Kansas): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ks --limit=100` +// `just govbot source --repos=ks --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ks --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ks --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and hearing notifications // ====================================== diff --git a/actions/govbot/src/filters/ky-legislation/default.rs b/actions/govbot/src/filters/ky-legislation/default.rs index 3e1e83a6..6005bb88 100644 --- a/actions/govbot/src/filters/ky-legislation/default.rs +++ b/actions/govbot/src/filters/ky-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ky-legislation (Kentucky): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ky --limit=100` +// `just govbot source --repos=ky --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ky --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ky --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/la-legislation/default.rs b/actions/govbot/src/filters/la-legislation/default.rs index eac82569..7f01a69e 100644 --- a/actions/govbot/src/filters/la-legislation/default.rs +++ b/actions/govbot/src/filters/la-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR la-legislation (Louisiana): // 1. First, gather real data by running this command: -// `just govbot logs --repos=la --limit=100` +// `just govbot source --repos=la --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=la --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=la --limit=100 --filter=default` // // Current filter removes: routine prefiling, referrals, and reading actions // ====================================== diff --git a/actions/govbot/src/filters/ma-legislation/default.rs b/actions/govbot/src/filters/ma-legislation/default.rs index f5f5d87b..95b8b1bc 100644 --- a/actions/govbot/src/filters/ma-legislation/default.rs +++ b/actions/govbot/src/filters/ma-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ma-legislation (Massachusetts): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ma --limit=100` +// `just govbot source --repos=ma --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ma --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ma --limit=100 --filter=default` // // Current filter removes: routine hearing scheduling, concurrences, and referrals // ====================================== diff --git a/actions/govbot/src/filters/md-legislation/default.rs b/actions/govbot/src/filters/md-legislation/default.rs index 2a8465c7..ac50ac1b 100644 --- a/actions/govbot/src/filters/md-legislation/default.rs +++ b/actions/govbot/src/filters/md-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR md-legislation (Maryland): // 1. First, gather real data by running this command: -// `just govbot logs --repos=md --limit=100` +// `just govbot source --repos=md --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=md --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=md --limit=100 --filter=default` // // Current filter removes: routine pre-filings, first readings, and hearing notifications // ====================================== diff --git a/actions/govbot/src/filters/me-legislation/default.rs b/actions/govbot/src/filters/me-legislation/default.rs index 27244f4e..922cf2d4 100644 --- a/actions/govbot/src/filters/me-legislation/default.rs +++ b/actions/govbot/src/filters/me-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR me-legislation (Maine): // 1. First, gather real data by running this command: -// `just govbot logs --repos=me --limit=100` +// `just govbot source --repos=me --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=me --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=me --limit=100 --filter=default` // // Current filter removes: routine referrals, author additions, and status updates // ====================================== diff --git a/actions/govbot/src/filters/mi-legislation/default.rs b/actions/govbot/src/filters/mi-legislation/default.rs index 8c97cfd6..266bebc7 100644 --- a/actions/govbot/src/filters/mi-legislation/default.rs +++ b/actions/govbot/src/filters/mi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mi-legislation (Michigan): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mi --limit=100` +// `just govbot source --repos=mi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mi --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and first readings // ====================================== diff --git a/actions/govbot/src/filters/mn-legislation/default.rs b/actions/govbot/src/filters/mn-legislation/default.rs index bc7d5ffe..2595fec1 100644 --- a/actions/govbot/src/filters/mn-legislation/default.rs +++ b/actions/govbot/src/filters/mn-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mn-legislation (Minnesota): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mn --limit=100` +// `just govbot source --repos=mn --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mn --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mn --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and author additions // ====================================== diff --git a/actions/govbot/src/filters/mo-legislation/default.rs b/actions/govbot/src/filters/mo-legislation/default.rs index 37bdb0d4..c67e4bce 100644 --- a/actions/govbot/src/filters/mo-legislation/default.rs +++ b/actions/govbot/src/filters/mo-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mo-legislation (Missouri): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mo --limit=100` +// `just govbot source --repos=mo --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mo --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mo --limit=100 --filter=default` // // Current filter removes: routine prefiling actions // ====================================== diff --git a/actions/govbot/src/filters/mp-legislation/default.rs b/actions/govbot/src/filters/mp-legislation/default.rs index d69f7119..44c86204 100644 --- a/actions/govbot/src/filters/mp-legislation/default.rs +++ b/actions/govbot/src/filters/mp-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mp-legislation (Northern Mariana Islands): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mp --limit=100` +// `just govbot source --repos=mp --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mp --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mp --limit=100 --filter=default` // // Current filter removes: routine introduction and reading actions // ====================================== diff --git a/actions/govbot/src/filters/ms-legislation/default.rs b/actions/govbot/src/filters/ms-legislation/default.rs index 105a3154..1cc8e68e 100644 --- a/actions/govbot/src/filters/ms-legislation/default.rs +++ b/actions/govbot/src/filters/ms-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ms-legislation (Mississippi): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ms --limit=100` +// `just govbot source --repos=ms --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ms --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ms --limit=100 --filter=default` // // Current filter removes: routine referrals and status updates // ====================================== diff --git a/actions/govbot/src/filters/mt-legislation/default.rs b/actions/govbot/src/filters/mt-legislation/default.rs index c1e2dd58..a9b25fdf 100644 --- a/actions/govbot/src/filters/mt-legislation/default.rs +++ b/actions/govbot/src/filters/mt-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mt-legislation (Montana): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mt --limit=100` +// `just govbot source --repos=mt --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mt --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mt --limit=100 --filter=default` // // Current filter removes: routine hearings, scheduling, and draft status updates // ====================================== diff --git a/actions/govbot/src/filters/nc-legislation/default.rs b/actions/govbot/src/filters/nc-legislation/default.rs index d7036d60..e479fb2a 100644 --- a/actions/govbot/src/filters/nc-legislation/default.rs +++ b/actions/govbot/src/filters/nc-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nc-legislation (North Carolina): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nc --limit=100` +// `just govbot source --repos=nc --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nc --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nc --limit=100 --filter=default` // // Current filter removes: routine referrals, readings, and status updates // ====================================== diff --git a/actions/govbot/src/filters/nd-legislation/default.rs b/actions/govbot/src/filters/nd-legislation/default.rs index d804b3a1..49c3b75a 100644 --- a/actions/govbot/src/filters/nd-legislation/default.rs +++ b/actions/govbot/src/filters/nd-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nd-legislation (North Dakota): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nd --limit=100` +// `just govbot source --repos=nd --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nd --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nd --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and committee hearings // ====================================== diff --git a/actions/govbot/src/filters/ne-legislation/default.rs b/actions/govbot/src/filters/ne-legislation/default.rs index 5115aeef..a00381ba 100644 --- a/actions/govbot/src/filters/ne-legislation/default.rs +++ b/actions/govbot/src/filters/ne-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ne-legislation (Nebraska): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ne --limit=100` +// `just govbot source --repos=ne --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ne --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ne --limit=100 --filter=default` // // Current filter removes: routine referrals, hearing notifications, and filing actions // ====================================== diff --git a/actions/govbot/src/filters/nh-legislation/default.rs b/actions/govbot/src/filters/nh-legislation/default.rs index d33675c1..018b7344 100644 --- a/actions/govbot/src/filters/nh-legislation/default.rs +++ b/actions/govbot/src/filters/nh-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nh-legislation (New Hampshire): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nh --limit=100` +// `just govbot source --repos=nh --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nh --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nh --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and hearing scheduling // ====================================== diff --git a/actions/govbot/src/filters/nj-legislation/default.rs b/actions/govbot/src/filters/nj-legislation/default.rs index 3948854b..76a55526 100644 --- a/actions/govbot/src/filters/nj-legislation/default.rs +++ b/actions/govbot/src/filters/nj-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nj-legislation (New Jersey): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nj --limit=100` +// `just govbot source --repos=nj --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nj --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nj --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and routine status updates // ====================================== diff --git a/actions/govbot/src/filters/nm-legislation/default.rs b/actions/govbot/src/filters/nm-legislation/default.rs index b723ddc6..5b716ad4 100644 --- a/actions/govbot/src/filters/nm-legislation/default.rs +++ b/actions/govbot/src/filters/nm-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nm-legislation (New Mexico): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nm --limit=100` +// `just govbot source --repos=nm --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nm --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nm --limit=100 --filter=default` // // Current filter removes: routine committee referrals and routine committee actions // ====================================== diff --git a/actions/govbot/src/filters/nv-legislation/default.rs b/actions/govbot/src/filters/nv-legislation/default.rs index 84d8f8a4..fb92075e 100644 --- a/actions/govbot/src/filters/nv-legislation/default.rs +++ b/actions/govbot/src/filters/nv-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nv-legislation (Nevada): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nv --limit=100` +// `just govbot source --repos=nv --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nv --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nv --limit=100 --filter=default` // // Current filter removes: routine prefiling, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/ny-legislation/default.rs b/actions/govbot/src/filters/ny-legislation/default.rs index 35b741f9..e32b8edd 100644 --- a/actions/govbot/src/filters/ny-legislation/default.rs +++ b/actions/govbot/src/filters/ny-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ny-legislation (New York): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ny --limit=100` +// `just govbot source --repos=ny --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ny --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ny --limit=100 --filter=default` // // Current filter removes: routine referrals, introductions, and status updates // ====================================== diff --git a/actions/govbot/src/filters/oh-legislation/default.rs b/actions/govbot/src/filters/oh-legislation/default.rs index 7b5a2921..886b4688 100644 --- a/actions/govbot/src/filters/oh-legislation/default.rs +++ b/actions/govbot/src/filters/oh-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR oh-legislation (Ohio): // 1. First, gather real data by running this command: -// `just govbot logs --repos=oh --limit=100` +// `just govbot source --repos=oh --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=oh --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=oh --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/ok-legislation/default.rs b/actions/govbot/src/filters/ok-legislation/default.rs index 19f1ad33..ca2b6fc4 100644 --- a/actions/govbot/src/filters/ok-legislation/default.rs +++ b/actions/govbot/src/filters/ok-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ok-legislation (Oklahoma): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ok --limit=100` +// `just govbot source --repos=ok --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ok --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ok --limit=100 --filter=default` // // Current filter removes: routine introductions, readings, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/or-legislation/default.rs b/actions/govbot/src/filters/or-legislation/default.rs index 24599c51..fdf051b1 100644 --- a/actions/govbot/src/filters/or-legislation/default.rs +++ b/actions/govbot/src/filters/or-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR or-legislation (Oregon): // 1. First, gather real data by running this command: -// `just govbot logs --repos=or --limit=100` +// `just govbot source --repos=or --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=or --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=or --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/pa-legislation/default.rs b/actions/govbot/src/filters/pa-legislation/default.rs index 44b820e5..87c7f1b2 100644 --- a/actions/govbot/src/filters/pa-legislation/default.rs +++ b/actions/govbot/src/filters/pa-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR pa-legislation (Pennsylvania): // 1. First, gather real data by running this command: -// `just govbot logs --repos=pa --limit=100` +// `just govbot source --repos=pa --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=pa --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=pa --limit=100 --filter=default` // // Current filter removes: routine referrals, readings, and status updates // ====================================== diff --git a/actions/govbot/src/filters/pr-legislation/default.rs b/actions/govbot/src/filters/pr-legislation/default.rs index 932a3d0a..badaca7d 100644 --- a/actions/govbot/src/filters/pr-legislation/default.rs +++ b/actions/govbot/src/filters/pr-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR pr-legislation (Puerto Rico): // 1. First, gather real data by running this command: -// `just govbot logs --repos=pr --limit=100` +// `just govbot source --repos=pr --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=pr --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=pr --limit=100 --filter=default` // // Current filter removes: routine referrals, introductions, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/ri-legislation/default.rs b/actions/govbot/src/filters/ri-legislation/default.rs index 1c84699b..dcae2fe6 100644 --- a/actions/govbot/src/filters/ri-legislation/default.rs +++ b/actions/govbot/src/filters/ri-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ri-legislation (Rhode Island): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ri --limit=100` +// `just govbot source --repos=ri --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ri --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ri --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and scheduling // ====================================== diff --git a/actions/govbot/src/filters/sc-legislation/default.rs b/actions/govbot/src/filters/sc-legislation/default.rs index e0c2c3e7..e4503522 100644 --- a/actions/govbot/src/filters/sc-legislation/default.rs +++ b/actions/govbot/src/filters/sc-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR sc-legislation (South Carolina): // 1. First, gather real data by running this command: -// `just govbot logs --repos=sc --limit=100` +// `just govbot source --repos=sc --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=sc --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=sc --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, filings, and readings // ====================================== diff --git a/actions/govbot/src/filters/sd-legislation/default.rs b/actions/govbot/src/filters/sd-legislation/default.rs index de9ce815..352b8732 100644 --- a/actions/govbot/src/filters/sd-legislation/default.rs +++ b/actions/govbot/src/filters/sd-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR sd-legislation (South Dakota): // 1. First, gather real data by running this command: -// `just govbot logs --repos=sd --limit=100` +// `just govbot source --repos=sd --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=sd --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=sd --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/tn-legislation/default.rs b/actions/govbot/src/filters/tn-legislation/default.rs index e57a52ea..b477ca6b 100644 --- a/actions/govbot/src/filters/tn-legislation/default.rs +++ b/actions/govbot/src/filters/tn-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR tn-legislation (Tennessee): // 1. First, gather real data by running this command: -// `just govbot logs --repos=tn --limit=100` +// `just govbot source --repos=tn --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=tn --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=tn --limit=100 --filter=default` // // Current filter removes: routine filings, introductions, referrals, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/tx-legislation/default.rs b/actions/govbot/src/filters/tx-legislation/default.rs index abba88b2..55d8095e 100644 --- a/actions/govbot/src/filters/tx-legislation/default.rs +++ b/actions/govbot/src/filters/tx-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR tx-legislation (Texas): // 1. First, gather real data by running this command: -// `just govbot logs --repos=tx --limit=100` +// `just govbot source --repos=tx --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=tx --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=tx --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/usa-legislation/default.rs b/actions/govbot/src/filters/usa-legislation/default.rs index 9b2c63b4..5faedcba 100644 --- a/actions/govbot/src/filters/usa-legislation/default.rs +++ b/actions/govbot/src/filters/usa-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR usa-legislation (United States): // 1. First, gather real data by running this command: -// `just govbot logs --repos=usa --limit=100` +// `just govbot source --repos=usa --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,13 +15,13 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=usa --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=usa --limit=100 --filter=default` // // Current filter: TODO - Analyze output to identify noisy patterns // ====================================== // Filter for usa-legislation (United States) -// TODO: Analyze output from `just govbot logs --limit=10` to identify noisy patterns +// TODO: Analyze output from `just govbot source --limit=10` to identify noisy patterns // and add specific filters for this locale use crate::filter::FilterResult; diff --git a/actions/govbot/src/filters/ut-legislation/default.rs b/actions/govbot/src/filters/ut-legislation/default.rs index 6aba440e..11ff6475 100644 --- a/actions/govbot/src/filters/ut-legislation/default.rs +++ b/actions/govbot/src/filters/ut-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ut-legislation (Utah): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ut --limit=100` +// `just govbot source --repos=ut --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ut --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ut --limit=100 --filter=default` // // Current filter removes: routine readings, status updates, and transfers // ====================================== diff --git a/actions/govbot/src/filters/va-legislation/default.rs b/actions/govbot/src/filters/va-legislation/default.rs index 089d2dc1..561dc239 100644 --- a/actions/govbot/src/filters/va-legislation/default.rs +++ b/actions/govbot/src/filters/va-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR va-legislation (Virginia): // 1. First, gather real data by running this command: -// `just govbot logs --repos=va --limit=100` +// `just govbot source --repos=va --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=va --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=va --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/vi-legislation/default.rs b/actions/govbot/src/filters/vi-legislation/default.rs index 2ede5768..8faf3efd 100644 --- a/actions/govbot/src/filters/vi-legislation/default.rs +++ b/actions/govbot/src/filters/vi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR vi-legislation (U.S. Virgin Islands): // 1. First, gather real data by running this command: -// `just govbot logs --repos=vi --limit=100` +// `just govbot source --repos=vi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=vi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=vi --limit=100 --filter=default` // // Current filter removes: routine status updates and transfers // ====================================== diff --git a/actions/govbot/src/filters/vt-legislation/default.rs b/actions/govbot/src/filters/vt-legislation/default.rs index 78e6a091..61256700 100644 --- a/actions/govbot/src/filters/vt-legislation/default.rs +++ b/actions/govbot/src/filters/vt-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR vt-legislation (Vermont): // 1. First, gather real data by running this command: -// `just govbot logs --repos=vt --limit=100` +// `just govbot source --repos=vt --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=vt --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=vt --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/wa-legislation/default.rs b/actions/govbot/src/filters/wa-legislation/default.rs index ff086464..3879a7dc 100644 --- a/actions/govbot/src/filters/wa-legislation/default.rs +++ b/actions/govbot/src/filters/wa-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wa-legislation (Washington): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wa --limit=100` +// `just govbot source --repos=wa --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wa --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wa --limit=100 --filter=default` // // Current filter removes: routine readings, referrals, and scheduling // ====================================== diff --git a/actions/govbot/src/filters/wi-legislation/default.rs b/actions/govbot/src/filters/wi-legislation/default.rs index 6c04c657..d0915f04 100644 --- a/actions/govbot/src/filters/wi-legislation/default.rs +++ b/actions/govbot/src/filters/wi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wi-legislation (Wisconsin): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wi --limit=100` +// `just govbot source --repos=wi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wi --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/wv-legislation/default.rs b/actions/govbot/src/filters/wv-legislation/default.rs index 996440ad..3a77365b 100644 --- a/actions/govbot/src/filters/wv-legislation/default.rs +++ b/actions/govbot/src/filters/wv-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wv-legislation (West Virginia): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wv --limit=100` +// `just govbot source --repos=wv --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wv --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wv --limit=100 --filter=default` // // Current filter removes: routine filings, introductions, referrals, and readings // ====================================== diff --git a/actions/govbot/src/filters/wy-legislation/default.rs b/actions/govbot/src/filters/wy-legislation/default.rs index 1a779daf..3ddba8d9 100644 --- a/actions/govbot/src/filters/wy-legislation/default.rs +++ b/actions/govbot/src/filters/wy-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wy-legislation (Wyoming): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wy --limit=100` +// `just govbot source --repos=wy --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wy --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wy --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/git.rs b/actions/govbot/src/git.rs index 86ad63d6..d76f1b26 100644 --- a/actions/govbot/src/git.rs +++ b/actions/govbot/src/git.rs @@ -131,46 +131,45 @@ pub fn clone_or_pull_dataset( let mut is_reclone = false; - let outcome_action: &'static str = if cache_entry.exists() - && Repository::open(&cache_entry).is_ok() - { - // Cached already — pull deltas. - let repo = Repository::open(&cache_entry) - .map_err(|e| Error::Config(format!("Failed to open cached repository: {}", e)))?; - match pull_repo_internal(&repo, token, quiet) { - Ok(had_updates) => { - drop(repo); - std::thread::sleep(std::time::Duration::from_millis(50)); - if had_updates { - "pulled" - } else { - "no_updates" - } - } - Err(e) => { - let error_msg = e.to_string(); - if error_msg.contains("Failed to analyze merge") - || error_msg.contains("object not found") - { + let outcome_action: &'static str = + if cache_entry.exists() && Repository::open(&cache_entry).is_ok() { + // Cached already — pull deltas. + let repo = Repository::open(&cache_entry) + .map_err(|e| Error::Config(format!("Failed to open cached repository: {}", e)))?; + match pull_repo_internal(&repo, token, quiet) { + Ok(had_updates) => { drop(repo); - if !quiet { - eprintln!("Merge analysis failed, deleting and recloning {}...", short); + std::thread::sleep(std::time::Duration::from_millis(50)); + if had_updates { + "pulled" + } else { + "no_updates" + } + } + Err(e) => { + let error_msg = e.to_string(); + if error_msg.contains("Failed to analyze merge") + || error_msg.contains("object not found") + { + drop(repo); + if !quiet { + eprintln!("Merge analysis failed, deleting and recloning {}...", short); + } + remove_dir_all_robust(&cache_entry).map_err(|e| { + Error::Config(format!("Failed to clear corrupt cache entry: {}", e)) + })?; + is_reclone = true; + // fall through to clone + "" + } else { + drop(repo); + return Err(e); } - remove_dir_all_robust(&cache_entry).map_err(|e| { - Error::Config(format!("Failed to clear corrupt cache entry: {}", e)) - })?; - is_reclone = true; - // fall through to clone - "" - } else { - drop(repo); - return Err(e); } } - } - } else { - "" - }; + } else { + "" + }; // If the cache entry is populated and we already pulled, we are done with // the heavy step — just link and report. @@ -207,9 +206,9 @@ pub fn clone_or_pull_dataset( builder.branch(channel); } - builder.clone(git_url, &cache_entry).map_err(|e| { - Error::Config(format!("Failed to clone dataset {}: {}", dataset.id, e)) - })?; + builder + .clone(git_url, &cache_entry) + .map_err(|e| Error::Config(format!("Failed to clone dataset {}: {}", dataset.id, e)))?; let repo = Repository::open(&cache_entry) .map_err(|e| Error::Config(format!("Failed to open cloned repository: {}", e)))?; @@ -247,39 +246,38 @@ fn link_dataset(cache_entry: &Path, repos_dir: &Path, short_name: &str) -> Resul /// Ensure a freshly cloned repo's HEAD points at `main` or `master`. fn ensure_default_branch(repo: &Repository) -> Result<()> { - let default_branch = if repo.find_branch("main", git2::BranchType::Local).is_ok() { - "main" - } else if repo.find_branch("master", git2::BranchType::Local).is_ok() { - "master" - } else if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - let remote_branch = repo.find_branch("origin/main", git2::BranchType::Remote)?; - let commit = remote_branch - .get() - .target() - .ok_or_else(|| Error::Config("Failed to get commit from origin/main".to_string()))?; - let commit_obj = repo.find_commit(commit)?; - repo.branch("main", &commit_obj, false)?; - "main" - } else if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - let remote_branch = repo.find_branch("origin/master", git2::BranchType::Remote)?; - let commit = remote_branch - .get() - .target() - .ok_or_else(|| Error::Config("Failed to get commit from origin/master".to_string()))?; - let commit_obj = repo.find_commit(commit)?; - repo.branch("master", &commit_obj, false)?; - "master" - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in repository".to_string(), - )); - }; + let default_branch = + if repo.find_branch("main", git2::BranchType::Local).is_ok() { + "main" + } else if repo.find_branch("master", git2::BranchType::Local).is_ok() { + "master" + } else if repo + .find_branch("origin/main", git2::BranchType::Remote) + .is_ok() + { + let remote_branch = repo.find_branch("origin/main", git2::BranchType::Remote)?; + let commit = remote_branch.get().target().ok_or_else(|| { + Error::Config("Failed to get commit from origin/main".to_string()) + })?; + let commit_obj = repo.find_commit(commit)?; + repo.branch("main", &commit_obj, false)?; + "main" + } else if repo + .find_branch("origin/master", git2::BranchType::Remote) + .is_ok() + { + let remote_branch = repo.find_branch("origin/master", git2::BranchType::Remote)?; + let commit = remote_branch.get().target().ok_or_else(|| { + Error::Config("Failed to get commit from origin/master".to_string()) + })?; + let commit_obj = repo.find_commit(commit)?; + repo.branch("master", &commit_obj, false)?; + "master" + } else { + return Err(Error::Config( + "Neither 'main' nor 'master' branch found in repository".to_string(), + )); + }; let needs_set = match repo.head() { Ok(head) => head.name() != Some(&format!("refs/heads/{}", default_branch)[..]), @@ -287,7 +285,9 @@ fn ensure_default_branch(repo: &Repository) -> Result<()> { }; if needs_set { repo.set_head(&format!("refs/heads/{}", default_branch)) - .map_err(|e| Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)))?; + .map_err(|e| { + Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)) + })?; repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) .map_err(|e| Error::Config(format!("Failed to checkout {}: {}", default_branch, e)))?; } @@ -327,10 +327,7 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re } // Fetch the current branch plus the usual defaults. - let branch_refspec = format!( - "refs/heads/{0}:refs/remotes/origin/{0}", - local_branch_name - ); + let branch_refspec = format!("refs/heads/{0}:refs/remotes/origin/{0}", local_branch_name); let refspecs = vec![ branch_refspec.as_str(), "refs/heads/main:refs/remotes/origin/main", @@ -641,10 +638,7 @@ pub fn delete_dataset(short_name: &str, repos_dir: &Path) -> Result<()> { if let Ok(meta) = std::fs::symlink_metadata(&target_dir) { if meta.file_type().is_symlink() { return std::fs::remove_file(&target_dir).map_err(|e| { - Error::Config(format!( - "Failed to unlink dataset {}: {}", - short_name, e - )) + Error::Config(format!("Failed to unlink dataset {}: {}", short_name, e)) }); } } diff --git a/actions/govbot/src/lock.rs b/actions/govbot/src/lock.rs index 2307a4e4..b01d2813 100644 --- a/actions/govbot/src/lock.rs +++ b/actions/govbot/src/lock.rs @@ -94,9 +94,8 @@ impl LockFile { if !path.is_file() { return Ok(LockFile::default()); } - let contents = std::fs::read_to_string(&path).map_err(|e| { - Error::Config(format!("Failed to read {}: {}", path.display(), e)) - })?; + let contents = std::fs::read_to_string(&path) + .map_err(|e| Error::Config(format!("Failed to read {}: {}", path.display(), e)))?; serde_json::from_str(&contents) .map_err(|e| Error::Config(format!("Invalid {}: {}", path.display(), e))) } @@ -130,9 +129,8 @@ impl LockFile { let path = LockFile::path_for(project_dir); let json = serde_json::to_string_pretty(self) .map_err(|e| Error::Config(format!("Failed to serialize lockfile: {}", e)))?; - std::fs::write(&path, format!("{}\n", json)).map_err(|e| { - Error::Config(format!("Failed to write {}: {}", path.display(), e)) - })?; + std::fs::write(&path, format!("{}\n", json)) + .map_err(|e| Error::Config(format!("Failed to write {}: {}", path.display(), e)))?; Ok(()) } } diff --git a/actions/govbot/src/pipeline.rs b/actions/govbot/src/pipeline.rs index 6c14dbde..a98b1f48 100644 --- a/actions/govbot/src/pipeline.rs +++ b/actions/govbot/src/pipeline.rs @@ -17,8 +17,7 @@ use std::process::{Command, Stdio}; /// Smart update behavior: if `/repos/` already has datasets, just /// `git pull`; otherwise clone the manifest's `datasets`. pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>) -> Result<()> { - let govbot_bin = std::env::current_exe() - .context("Failed to determine govbot binary path")?; + let govbot_bin = std::env::current_exe().context("Failed to determine govbot binary path")?; let cwd = config_path.parent().unwrap_or_else(|| Path::new(".")); @@ -32,9 +31,7 @@ pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>) -> Result<()> // Fast-fail if a transform's binary cannot be resolved. let resolved: Vec<(String, ResolvedTransform)> = transforms .iter() - .map(|(name, t)| { - resolve_transform(t).map(|r| (name.clone(), r)) - }) + .map(|(name, t)| resolve_transform(t).map(|r| (name.clone(), r))) .collect::>()?; // Resolve the repos directory the way subcommands do. @@ -162,7 +159,11 @@ fn resolve_pipeline_transforms(manifest: &Manifest) -> Result Result { let argv = t.command.argv(); - let (bin_name, rest) = argv - .split_first() - .context("transform `command` is empty")?; + let (bin_name, rest) = argv.split_first().context("transform `command` is empty")?; let bin = resolve_transform_binary(bin_name).ok_or_else(|| { anyhow::anyhow!( @@ -324,7 +323,9 @@ fn run_transform_dag( statuses.insert(name.clone(), status.success()); all_ok &= status.success(); } - let source_status = source_child.wait().context("Failed to wait for govbot source")?; + let source_status = source_child + .wait() + .context("Failed to wait for govbot source")?; all_ok &= source_status.success(); Ok(all_ok) diff --git a/actions/govbot/src/processor.rs b/actions/govbot/src/processor.rs index a4257f5c..4e46b1d0 100644 --- a/actions/govbot/src/processor.rs +++ b/actions/govbot/src/processor.rs @@ -1,10 +1,7 @@ use crate::config::Config; use crate::error::{Error, Result}; use crate::git; -use crate::types::{ - FileWithTimestamp, LogContent, LogEntry, Metadata, - VoteEventResult, -}; +use crate::types::{FileWithTimestamp, LogContent, LogEntry, Metadata, VoteEventResult}; use async_stream::stream; use futures::Stream; use jwalk::WalkDir; @@ -90,7 +87,10 @@ impl PipelineProcessor { for search_path in search_paths { if !search_path.exists() { - eprintln!("Warning: Expected repository directory does not exist: {}", search_path.display()); + eprintln!( + "Warning: Expected repository directory does not exist: {}", + search_path.display() + ); continue; } // A project's repo entry may be a symlink into the shared dataset @@ -213,29 +213,37 @@ impl PipelineProcessor { /// Calculate relative path from search directory fn calculate_relative_path(path: &Path, search_dir: &Path) -> Result { let search_dir_abs = search_dir.canonicalize().map_err(|_| { - Error::Path(format!("Failed to canonicalize search directory: {}", search_dir.display())) + Error::Path(format!( + "Failed to canonicalize search directory: {}", + search_dir.display() + )) })?; - - let path_abs = path.parent() - .ok_or_else(|| Error::Path(format!("Failed to get parent of path: {}", path.display())))? + + let path_abs = path + .parent() + .ok_or_else(|| { + Error::Path(format!("Failed to get parent of path: {}", path.display())) + })? .canonicalize() - .map_err(|_| { - Error::Path(format!("Failed to canonicalize path: {}", path.display())) - })?; + .map_err(|_| Error::Path(format!("Failed to canonicalize path: {}", path.display())))?; let relative = pathdiff::diff_paths(&path_abs, &search_dir_abs) .ok_or_else(|| Error::Path("Failed to calculate relative path".to_string()))?; // Reconstruct the full relative path including the filename - let filename = path.file_name() + let filename = path + .file_name() .ok_or_else(|| Error::Path(format!("Failed to get filename: {}", path.display())))?; - + Ok(relative.join(filename).to_string_lossy().to_string()) } /// Sort files by timestamp according to sort order /// Uses relative_path as a secondary sort key to ensure deterministic ordering - fn sort_files_internal(config: &Config, mut files: Vec) -> Vec { + fn sort_files_internal( + config: &Config, + mut files: Vec, + ) -> Vec { match config.sort_order { crate::config::SortOrder::Descending => { files.sort_by(|a, b| { @@ -280,7 +288,10 @@ impl PipelineProcessor { } /// Apply limit to files - fn apply_limit_internal(config: &Config, files: Vec) -> Vec { + fn apply_limit_internal( + config: &Config, + files: Vec, + ) -> Vec { if let Some(limit) = config.limit { files.into_iter().take(limit).collect() } else { @@ -289,7 +300,10 @@ impl PipelineProcessor { } /// Process a single file and return a log entry - async fn process_file_internal(config: &Config, file: &FileWithTimestamp) -> Result> { + async fn process_file_internal( + config: &Config, + file: &FileWithTimestamp, + ) -> Result> { // Check if it's a vote event file let is_vote_event = file.relative_path.contains(".vote_event."); @@ -301,7 +315,10 @@ impl PipelineProcessor { } /// Process a vote event file - async fn process_vote_event_file_internal(_config: &Config, file: &FileWithTimestamp) -> Result> { + async fn process_vote_event_file_internal( + _config: &Config, + file: &FileWithTimestamp, + ) -> Result> { // Extract vote event result from filename let vote_event_regex = Regex::new(r"\.vote_event\.([^.]+)\.")?; let result = vote_event_regex @@ -321,7 +338,10 @@ impl PipelineProcessor { } /// Process a regular (non-vote-event) file - async fn process_regular_file_internal(_config: &Config, file: &FileWithTimestamp) -> Result> { + async fn process_regular_file_internal( + _config: &Config, + file: &FileWithTimestamp, + ) -> Result> { // Read and parse JSON content let json_content = tokio::fs::read_to_string(&file.path).await?; let log_value: serde_json::Value = serde_json::from_str(&json_content)?; @@ -367,4 +387,3 @@ impl PipelineProcessor { Ok(Some(metadata)) } } - diff --git a/actions/govbot/src/registry.rs b/actions/govbot/src/registry.rs index 61a361a4..b32e3674 100644 --- a/actions/govbot/src/registry.rs +++ b/actions/govbot/src/registry.rs @@ -272,7 +272,10 @@ mod tests { #[test] fn bundled_registry_parses_and_has_seed_jurisdictions() { let reg = Registry::bundled().expect("bundled registry must parse"); - assert!(reg.datasets.len() >= 52, "expected the 52-jurisdiction seed"); + assert!( + reg.datasets.len() >= 52, + "expected the 52-jurisdiction seed" + ); assert!(reg.datasets.contains_key("us-legislation/wy")); } diff --git a/actions/govbot/tests/api_snaps.rs b/actions/govbot/tests/api_snaps.rs index 245b302e..536c98d7 100644 --- a/actions/govbot/tests/api_snaps.rs +++ b/actions/govbot/tests/api_snaps.rs @@ -1,10 +1,10 @@ -use govbot::prelude::*; use futures::StreamExt; +use govbot::prelude::*; use insta; /// Snapshot test for the pipeline processor -/// +/// /// This test processes log files and compares the output against stored snapshots. /// To update snapshots after making changes, run: /// cargo insta review @@ -12,7 +12,7 @@ use insta; async fn test_pipeline_processor_snapshot() { // Use the same test data directory as the example let git_dir = "tmp/git/repos"; - + // Build configuration matching the render-snapshots.sh script let config = ConfigBuilder::new(git_dir) .sort_order_str("DESC") @@ -21,7 +21,7 @@ async fn test_pipeline_processor_snapshot() { .join_options_str("bill") .unwrap() .build(); - + // Skip test if git_dir doesn't exist (e.g., in CI without test data) let config = match config { Ok(c) => c, @@ -37,7 +37,7 @@ async fn test_pipeline_processor_snapshot() { // Collect all entries from the stream let mut stream = processor.process(); let mut entries = Vec::new(); - + while let Some(result) = stream.next().await { match result { Ok(entry) => entries.push(entry), @@ -49,8 +49,8 @@ async fn test_pipeline_processor_snapshot() { } // Serialize to JSON for snapshot comparison - let json_output = serde_json::to_string_pretty(&entries) - .expect("Failed to serialize entries to JSON"); + let json_output = + serde_json::to_string_pretty(&entries).expect("Failed to serialize entries to JSON"); // Use insta's assert_snapshot! macro for string comparison // The snapshot will be stored in tests/snapshots/api_snapshot_tests__test_pipeline_processor_snapshot.snap @@ -88,4 +88,3 @@ async fn test_vote_event_processing() { // Test vote event result serialization insta::assert_json_snapshot!("vote_event_results", &results); } - From 457190910e9ba061b218b5aff6c7583f1112bd65 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 18:42:51 -0500 Subject: [PATCH 03/32] Polish CLI help, wizard, and ls behavior MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CLI help (main.rs): - Replace the legacy top-level 'about' ('Process pipeline log files with type-safe reactive streams') with a description of what govbot actually does today. - Rewrite every subcommand's help to reflect current behavior — drop retired 'locales' terminology in 'delete', expand 'load' / 'update' / 'publish' / 'run' from one-liners into actually-useful paragraphs. - 'source --repos' now has visible_alias 'datasets' so the flag name lines up with the manifest field that was renamed in the refactor. The old '--repos' continues to work. Wizard (.gitignore): - Generate a richer .gitignore that covers the publisher output dirs ('dist/', 'docs/') and the credential file ('.env'), not just '.govbot/'. A fresh project from 'govbot init' is now safer to push to GitHub without accidentally committing generated artifacts or credentials. - The writer is now idempotent: re-running on an existing .gitignore only adds the missing entries. ls behavior: - 'govbot ls' help promises 'with no manifest, lists every dataset in the registry'. It didn't — now it does. Discovery in a bare directory works as advertised. Cleanup (clippy): - Drop a redundant 'use serde_json;' and a couple of small lints ('DESC | _' wildcard, an 'if \!x.is_empty() {""} else {""}' dead branch in the HTML index footer, a stylistic 'return' before a cfg-gated block ending). Snapshots: updated 'cli_example_snaps@govbot_help' + the three wizard session snapshots to match. 30/30 tests still green. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/cache.rs | 13 +- actions/govbot/src/config.rs | 2 +- actions/govbot/src/main.rs | 518 ++++++++++++------ actions/govbot/src/publish.rs | 21 +- actions/govbot/src/rss.rs | 2 - actions/govbot/src/wizard.rs | 72 ++- ...i_example_snaps__snapshot@govbot_help.snap | 16 +- .../wizard_tests__wizard_session_all.snap | 2 +- ...rd_tests__wizard_session_single_state.snap | 2 +- ...wizard_tests__wizard_session_specific.snap | 2 +- actions/govbot/tests/wizard_tests.rs | 19 +- 11 files changed, 446 insertions(+), 223 deletions(-) diff --git a/actions/govbot/src/cache.rs b/actions/govbot/src/cache.rs index 54a0f21d..388822f7 100644 --- a/actions/govbot/src/cache.rs +++ b/actions/govbot/src/cache.rs @@ -76,7 +76,11 @@ pub fn cache_key(short_name: &str, git_url: &str, channel: Option<&str>) -> Stri hasher.update(b"@"); hasher.update(channel.unwrap_or("").as_bytes()); let digest = hasher.finalize(); - let hex: String = digest.iter().take(6).map(|b| format!("{:02x}", b)).collect(); + let hex: String = digest + .iter() + .take(6) + .map(|b| format!("{:02x}", b)) + .collect(); let safe_name = short_name.replace('/', "__"); format!("{}-{}", safe_name, hex) } @@ -91,7 +95,10 @@ pub fn cache_path(short_name: &str, git_url: &str, channel: Option<&str>) -> Res /// Prefers a symlink (cheap, shared); falls back to recording the cache path /// when symlinks are unavailable. Idempotent — an existing correct link is a /// no-op; a stale link is replaced. -pub fn link_into_project(cache_entry: &std::path::Path, project_repo: &std::path::Path) -> Result<()> { +pub fn link_into_project( + cache_entry: &std::path::Path, + project_repo: &std::path::Path, +) -> Result<()> { if let Some(parent) = project_repo.parent() { std::fs::create_dir_all(parent)?; } @@ -118,7 +125,7 @@ pub fn link_into_project(cache_entry: &std::path::Path, project_repo: &std::path e )) })?; - return Ok(()); + Ok(()) } #[cfg(not(unix))] diff --git a/actions/govbot/src/config.rs b/actions/govbot/src/config.rs index ace5c09c..769c10b7 100644 --- a/actions/govbot/src/config.rs +++ b/actions/govbot/src/config.rs @@ -191,7 +191,7 @@ impl From<&str> for SortOrder { fn from(s: &str) -> Self { match s.to_uppercase().as_str() { "ASC" => SortOrder::Ascending, - "DESC" | _ => SortOrder::Descending, + _ => SortOrder::Descending, } } } diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 2f974413..147e69bb 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -1,19 +1,18 @@ use clap::{Parser, Subcommand}; +use futures::stream; +use futures::StreamExt; use govbot::git; use govbot::lock::LockFile; +use govbot::publish::{deduplicate_entries, filter_by_tags, load_manifest, sort_by_timestamp}; use govbot::registry::Registry; -use govbot::{hash_text, TagFile, TagFileMetadata, BillTagResult}; use govbot::selectors::ocd_files_select_default; -use govbot::publish::{load_manifest, filter_by_tags, deduplicate_entries, sort_by_timestamp}; -use futures::StreamExt; -use futures::stream; -use std::io::{self, Write, BufRead, BufReader}; -use std::path::PathBuf; -use serde_json; +use govbot::{hash_text, BillTagResult, TagFile, TagFileMetadata}; use jwalk::WalkDir; +use std::collections::HashMap; use std::fs; +use std::io::{self, BufRead, BufReader, Write}; +use std::path::PathBuf; use std::process::Command as ProcessCommand; -use std::collections::HashMap; /// Write a line to stdout, gracefully handling broken pipe errors /// This is essential for piping to tools like yq, jq, etc. @@ -41,7 +40,7 @@ fn write_json_line(line: &str) -> io::Result<()> { #[derive(Debug, Clone)] struct CloneResult { locale: String, - result: String, // emoji, or "failed" + result: String, // emoji, or "failed" position: String, // "1/37" size: Option, local_size: Option, @@ -63,10 +62,12 @@ struct DatasetPin { cache_key: String, } -/// Type-safe, functional reactive processor for pipeline log files +/// govbot — gov-data package manager and transform/publish orchestrator. #[derive(Parser, Debug)] #[command(name = "govbot")] -#[command(about = "Process pipeline log files with type-safe reactive streams")] +#[command( + about = "Government-data tool: pull dataset repositories, run transforms over them, and publish artifacts (RSS / HTML / JSON / DuckDB / Bluesky). Configured by a `govbot.yml` manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook." +)] #[command(version)] struct Args { #[command(subcommand)] @@ -75,11 +76,12 @@ struct Args { #[derive(Subcommand, Debug)] enum Command { - /// Pull dataset repositories (default: updates existing datasets) - /// Clones if the dataset repository doesn't exist, pulls if it does - /// Use "govbot pull all" to pull all datasets, or "govbot pull " for specific ones + /// Pull (clone or update) dataset repositories into the shared cache and + /// link them into the project. Use `govbot pull all` to pull every dataset, + /// `govbot pull ...` for specific ones, or `govbot pull` with no args + /// to refresh whatever's already linked into the project. Pull { - /// Dataset names to pull (e.g., usa, il, ca, or "all" for all datasets). If not specified, updates existing datasets. + /// Dataset identifiers to pull (e.g. `wy`, `il`, `us-legislation/ca`, or `all`). With no args, refreshes datasets already linked into the project. #[arg(num_args = 0..)] repos: Vec, @@ -104,12 +106,15 @@ enum Command { list: bool, }, - /// Stream dataset records as JSON Lines (the govbot stream-protocol source) + /// Stream dataset records as JSON Lines — the govbot stream-protocol + /// `source` stage. Pipe into a transform (`fastclass classify -`) or into + /// `govbot apply` for the persistence sink. See `schemas/STREAM_PROTOCOL.md`. Source { - /// Datasets to output (default: `all`) `--repos="il,ca"` - #[arg(long, num_args = 0..)] + /// Datasets to emit (default: every linked dataset). Accepts the same + /// identifiers as `govbot pull` (`wy`, `il`, `us-legislation/ca`). + #[arg(long = "datasets", visible_alias = "repos", num_args = 0..)] repos: Vec, - + /// Per repo limit (default: 100) options: `none` | number #[arg(long, default_value = "100")] limit: String, @@ -135,13 +140,15 @@ enum Command { /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] - govbot_dir: Option, + govbot_dir: Option, }, - /// Delete data pipeline repositories - /// Deletes local repository directories for specified locales + /// Delete locally-linked dataset clones from the project's `.govbot/repos/`. + /// Use `govbot delete all` to clear every linked dataset, or + /// `govbot delete ...` for specific ones. The shared cache at + /// `~/.govbot/cache/` is not touched — a subsequent `pull` re-links instantly. Delete { - /// Locale names to delete (e.g., usa, il, ca, or "all" for all locales). Use "all" to delete all repositories. + /// Dataset identifiers to unlink (e.g. `wy`, `il`, `us-legislation/ca`, or `all`). #[arg(num_args = 0..)] locales: Vec, @@ -158,9 +165,11 @@ enum Command { verbose: bool, }, - /// Load bill metadata into a DuckDB database file - /// Loads all metadata.json files from cloned repos into a DuckDB database for analysis. - /// The database file is saved in the base govbot directory (e.g., ./.govbot/govbot.duckdb) + /// Load bill metadata into a DuckDB database for SQL analysis. Walks every + /// linked dataset's `metadata.json` files, creates a `bills` table + a + /// `bills_summary` view, and writes the database into the base govbot + /// directory (default `./.govbot/govbot.duckdb`). Requires the `duckdb` CLI + /// on PATH. Load { /// Output database filename (default: govbot.duckdb). Saved in the base govbot directory. #[arg(long, default_value = "govbot.duckdb")] @@ -179,12 +188,16 @@ enum Command { threads: Option, }, - /// Update govbot to the latest nightly version - /// Downloads and installs the latest nightly build from GitHub releases + /// Update the installed govbot binary to the latest nightly build from + /// GitHub releases. Installs into `~/.govbot/bin/govbot` and prefers the + /// platform-native `.tar.gz` asset. Update, - /// Run a publisher: emit feeds/indexes/dumps from a govbot.yml publisher - /// Reads the named publishers from govbot.yml `publish:` and emits their artifacts. + /// Run one or more publishers from `govbot.yml: publish:`. A publisher + /// consumes the tagged result stream and emits artifacts: `rss`/`html`/`json` + /// write feed/index/dump files, `duckdb` loads records into a database, + /// `bluesky` posts matches to a Bluesky account (always dry-run first with + /// `--dry-run`). Publish { /// Publisher name(s) from govbot.yml `publish:` (default: every publisher) #[arg(long = "publisher", num_args = 0..)] @@ -231,9 +244,11 @@ enum Command { overwrite: bool, }, - /// Run the full govbot pipeline against the current directory's `govbot.yml`: - /// pull/update → `source --select docs | fastclass classify - | apply` → publish. - /// Equivalent to running `govbot` with no arguments. + /// Run the full pipeline against the current directory's `govbot.yml`: + /// pull/update datasets → `source --select docs | fastclass classify - | apply` + /// (the classify transform) → publish every configured publisher. + /// `govbot` with no arguments is equivalent (and falls back to `init` if no + /// `govbot.yml` is present). Run { /// Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] @@ -283,7 +298,6 @@ enum Command { }, } - fn get_govbot_dir(govbot_dir: Option) -> anyhow::Result { // Check flag first, then environment variable, then default if let Some(govbot_dir) = govbot_dir { @@ -400,15 +414,17 @@ fn print_result(result: &CloneResult) { } else { let size_str = if let Some(ref size) = result.size { size.clone() - } else if let (Some(ref local), Some(ref final_size)) = (&result.local_size, &result.final_size) { + } else if let (Some(ref local), Some(ref final_size)) = + (&result.local_size, &result.final_size) + { format!("{} -> {}", local, final_size) } else { String::new() }; - + // result.result now contains the emoji directly (🆕, ⬇️, ✅, 🔄) let action_emoji = &result.result; - + if !size_str.is_empty() { eprintln!("{} {:<6} [{}]", action_emoji, result.locale, size_str); } else { @@ -452,7 +468,12 @@ async fn perform_clone_operations( let verbose_flag = verbose; tokio::task::spawn_blocking(move || { - let mut result = process_single_dataset(&dataset, &repos_dir, token.as_deref(), verbose_flag); + let mut result = process_single_dataset( + &dataset, + &repos_dir, + token.as_deref(), + verbose_flag, + ); let mut count = completed.lock().unwrap(); *count += 1; result.position = format!("{}/{}", *count, total); @@ -499,7 +520,10 @@ fn update_lockfile(project_dir: &std::path::Path, results: &[CloneResult]) { let mut lock = match LockFile::load_or_default(project_dir) { Ok(l) => l, Err(e) => { - eprintln!("⚠️ Could not read govbot.lock ({}); skipping pin update", e); + eprintln!( + "⚠️ Could not read govbot.lock ({}); skipping pin update", + e + ); return; } }; @@ -525,7 +549,6 @@ fn update_lockfile(project_dir: &std::path::Path, results: &[CloneResult]) { } } - async fn run_pull_command(cmd: Command) -> anyhow::Result<()> { let Command::Pull { repos, @@ -534,7 +557,8 @@ async fn run_pull_command(cmd: Command) -> anyhow::Result<()> { parallel, verbose, list, - } = cmd else { + } = cmd + else { unreachable!() }; @@ -559,7 +583,11 @@ async fn run_pull_command(cmd: Command) -> anyhow::Result<()> { // Get parallelization setting let num_jobs = parallel - .or_else(|| std::env::var("GOVBOT_JOBS").ok().and_then(|s| s.parse().ok())) + .or_else(|| { + std::env::var("GOVBOT_JOBS") + .ok() + .and_then(|s| s.parse().ok()) + }) .unwrap_or(4); // Resolve which datasets to pull. @@ -609,20 +637,23 @@ async fn run_pull_command(cmd: Command) -> anyhow::Result<()> { if !errors.is_empty() { eprintln!("\n❌ Errors occurred: {}/{}", errors.len(), results.len()); } else if !results.is_empty() { - eprintln!("\n✅ Successfully processed all {} datasets!", results.len()); + eprintln!( + "\n✅ Successfully processed all {} datasets!", + results.len() + ); } Ok(()) } - async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let Command::Delete { locales, govbot_dir, parallel, verbose, - } = cmd else { + } = cmd + else { unreachable!() }; @@ -644,7 +675,11 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { // Get parallelization setting let num_jobs = parallel - .or_else(|| std::env::var("GOVBOT_JOBS").ok().and_then(|s| s.parse().ok())) + .or_else(|| { + std::env::var("GOVBOT_JOBS") + .ok() + .and_then(|s| s.parse().ok()) + }) .unwrap_or(4); // Parse datasets and handle "all". `all` expands to whatever is cloned @@ -680,7 +715,7 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let total = locales_to_delete.len(); let mut deleted_count = 0; let mut failed_count = 0; - + if total == 1 || num_jobs == 1 { // Sequential delete for (idx, locale) in locales_to_delete.iter().enumerate() { @@ -712,7 +747,7 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { use std::sync::{Arc, Mutex}; let deleted = Arc::new(Mutex::new(0usize)); let failed = Arc::new(Mutex::new(0usize)); - + let delete_futures = stream::iter(locales_to_delete.iter()) .map(|locale| { let locale = locale.clone(); @@ -721,7 +756,7 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let failed = failed.clone(); let total = total; let verbose_flag = verbose; - + tokio::task::spawn_blocking(move || { let repo_name = git::repo_dir_name(&locale); let target_dir = repos_dir.join(&repo_name); @@ -733,7 +768,8 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { eprintln!("[{}/{}] Deleting {}...", current, total, locale); } - let existed = target_dir.exists() || std::fs::symlink_metadata(&target_dir).is_ok(); + let existed = + target_dir.exists() || std::fs::symlink_metadata(&target_dir).is_ok(); match git::delete_dataset(&locale, &repos_dir) { Ok(_) => { if existed { @@ -755,7 +791,7 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { .buffer_unordered(num_jobs); let mut stream = delete_futures; - + while let Some(result) = stream.next().await { match result { Ok((locale, Ok(status))) => { @@ -771,11 +807,11 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { } } } - + deleted_count = *deleted.lock().unwrap(); failed_count = *failed.lock().unwrap(); } - + // Show summary if failed_count > 0 { eprintln!("\n❌ Errors occurred: {}/{}", failed_count, total); @@ -784,11 +820,11 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { } else { eprintln!("\n✅ No repositories found to delete."); } - + Ok(()) } -/// Collapse a fully-joined `govbot logs` entry into the +/// Collapse a fully-joined `govbot source` entry into the /// `{"id","text","kind":"docs"}` document the govbot stream protocol defines /// (`STREAM_PROTOCOL.md` §1) — the record `fastclass classify -` consumes. /// @@ -824,10 +860,11 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { join, select, filter, - } = cmd else { + } = cmd + else { unreachable!() }; - + // Parse join options - now supports field paths like "bill.title" and special "tags" let mut join_specs: Vec<(String, Vec)> = Vec::new(); let mut join_tags = false; @@ -847,7 +884,11 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { let limit_parsed: Option = if limit.to_lowercase() == "none" { None } else { - Some(limit.parse().map_err(|e| anyhow::anyhow!("Invalid limit value '{}': {}", limit, e))?) + Some( + limit + .parse() + .map_err(|e| anyhow::anyhow!("Invalid limit value '{}': {}", limit, e))?, + ) }; // Parse comma-separated repos if provided as single string @@ -913,7 +954,7 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { // Walk the repo directory to find log files matching the pattern: // repo_name/country:{country}/state:{state}/sessions/{session_name}/logs/*.json let mut file_count = 0; - + for entry_result in WalkDir::new(&repo_path) .process_read_dir(|_depth, _path, _read_dir_state, _children| { // Optional: customize directory reading behavior @@ -933,7 +974,7 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { } let path = entry.path(); - + // Check if it's a JSON file in a logs directory if !path.is_file() { continue; @@ -946,7 +987,7 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { // Check if path matches: country:{country}/state:{state}/sessions/{session_name}/logs/*.json let path_str = path.to_string_lossy(); let repo_prefix = repo_path.to_string_lossy(); - + // Get relative path by stripping the repo prefix // Handle both absolute and relative paths let relative_path = if let Some(stripped) = path_str.strip_prefix(&*repo_prefix) { @@ -956,11 +997,11 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { // If prefix doesn't match, skip this file continue; }; - + // Match pattern: country:*/state:*/sessions/*/logs/*.json // Use a simple regex-like check: must have these components in order - if relative_path.starts_with("country:") - && relative_path.contains("/state:") + if relative_path.starts_with("country:") + && relative_path.contains("/state:") && relative_path.contains("/sessions/") && relative_path.contains("/logs/") && relative_path.ends_with(".json") @@ -970,12 +1011,12 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { let state_pos = relative_path.find("/state:").unwrap_or(usize::MAX); let sessions_pos = relative_path.find("/sessions/").unwrap_or(usize::MAX); let logs_pos = relative_path.find("/logs/").unwrap_or(usize::MAX); - + // Verify order: country < state < sessions < logs if country_pos < state_pos && state_pos < sessions_pos && sessions_pos < logs_pos { // Compute relative source path let source_path_str = compute_relative_source_path(&path, &git_dir); - + // Read JSON file, parse it, and build extensible output structure match fs::read_to_string(&path) { Ok(contents) => { @@ -989,19 +1030,22 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { .or_else(|| json_value.get("bill_identifier")) .and_then(|id| id.as_str()) .map(|s| s.to_string()); - + // Build output with extensible structure: // - Data keys (log, bill, etc.) are singular entity names matching source keys // - sources object automatically tracks all data sources let mut output = serde_json::Map::new(); - + // Add the log data with key "log" (matching sources.log) output.insert("log".to_string(), json_value); - + // Add sources with the log path let mut sources = serde_json::Map::new(); - sources.insert("log".to_string(), serde_json::Value::String(source_path_str.clone())); - + sources.insert( + "log".to_string(), + serde_json::Value::String(source_path_str.clone()), + ); + // Join additional datasets if requested for (dataset_name, field_path) in &join_specs { match dataset_name.as_str() { @@ -1013,36 +1057,59 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { Ok(p) => p, Err(_) => path.clone(), }; - - let metadata_path = canonical_log_path.parent() + + let metadata_path = canonical_log_path + .parent() .and_then(|logs_dir| { logs_dir.parent().map(|bill_dir| { bill_dir.join("metadata.json") }) }); - + if let Some(ref metadata_path) = metadata_path { if metadata_path.exists() { match fs::read_to_string(metadata_path) { Ok(metadata_contents) => { - match serde_json::from_str::(&metadata_contents) { + match serde_json::from_str::< + serde_json::Value, + >( + &metadata_contents + ) { Ok(metadata_value) => { // If field_path is specified, extract just that field // Otherwise, include the full bill data if field_path.is_empty() { // No field path specified, include full bill data - output.insert("bill".to_string(), metadata_value); + output.insert( + "bill".to_string(), + metadata_value, + ); } else { // Extract specific field(s) from bill data - if let Some(field_value) = extract_json_field(&metadata_value, field_path) { + if let Some( + field_value, + ) = + extract_json_field( + &metadata_value, + field_path, + ) + { // Use the full join path as the key (e.g., "bill.title") - let output_key = format!("{}.{}", dataset_name, field_path.join(".")); - output.insert(output_key, field_value); + let output_key = format!( + "{}.{}", + dataset_name, + field_path + .join(".") + ); + output.insert( + output_key, + field_value, + ); } else { eprintln!("Warning: Field path {:?} not found in metadata from {}", field_path, metadata_path.display()); } } - + // Add bill source path let bill_source_path = compute_relative_source_path(metadata_path, &git_dir); sources.insert("bill".to_string(), serde_json::Value::String(bill_source_path)); @@ -1064,38 +1131,54 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { } } _ => { - eprintln!("Warning: Unknown join dataset: {}", dataset_name); + eprintln!( + "Warning: Unknown join dataset: {}", + dataset_name + ); } } } - + // Join tags if requested if join_tags { // Extract country, state, session_id from the path - if let Some((country, state, session_id)) = extract_path_info(&source_path_str) { + if let Some((country, state, session_id)) = + extract_path_info(&source_path_str) + { // Use bill_id extracted earlier if let Some(ref bill_id) = bill_id_opt { // Look for tags in cwd/country:us/state:{state}/sessions/{session_id}/tags/ - let cwd = std::env::current_dir().unwrap_or_else(|_| PathBuf::from(".")); + let cwd = std::env::current_dir() + .unwrap_or_else(|_| PathBuf::from(".")); let tags_dir = cwd .join(&format!("country:{}", country)) .join(&format!("state:{}", state)) .join("sessions") .join(&session_id) .join("tags"); - + if tags_dir.exists() && tags_dir.is_dir() { let mut matched_tags = serde_json::Map::new(); if let Ok(entries) = fs::read_dir(&tags_dir) { for entry in entries.flatten() { let path = entry.path(); // Check for both .tag.json and .json files - if let Some(ext) = path.extension().and_then(|s| s.to_str()) { + if let Some(ext) = path + .extension() + .and_then(|s| s.to_str()) + { if ext == "json" { - if let Some(stem) = path.file_stem().and_then(|s| s.to_str()) { + if let Some(stem) = path + .file_stem() + .and_then(|s| s.to_str()) + { // Remove .tag suffix if present (e.g., "budget.tag" -> "budget") - let tag_name = stem.strip_suffix(".tag").unwrap_or(stem); - match fs::read_to_string(&path) { + let tag_name = stem + .strip_suffix(".tag") + .unwrap_or(stem); + match fs::read_to_string( + &path, + ) { Ok(contents) => { if let Ok(tag_file) = serde_json::from_str::(&contents) { // Check if bill_id exists in bills map @@ -1113,24 +1196,33 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { } } if !matched_tags.is_empty() { - output.insert("tags".to_string(), serde_json::Value::Object(matched_tags)); + output.insert( + "tags".to_string(), + serde_json::Value::Object(matched_tags), + ); } } } } } - - output.insert("sources".to_string(), serde_json::Value::Object(sources)); - + + output.insert( + "sources".to_string(), + serde_json::Value::Object(sources), + ); + // Extract timestamp from sources.log path (after "logs/" and before "_") // Do this after sources is inserted so we can use the final sources.log value let timestamp = extract_timestamp_from_path(&source_path_str); if let Some(ref ts) = timestamp { - output.insert("timestamp".to_string(), serde_json::Value::String(ts.clone())); + output.insert( + "timestamp".to_string(), + serde_json::Value::String(ts.clone()), + ); } - + let mut output_value = serde_json::Value::Object(output); - + // Apply select transformation if requested. // `default` trims each entry to the familiar // title/abstracts/subject shape. `docs` deliberately @@ -1142,26 +1234,44 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { if select == "default" { // Select specific keys from nested objects, preserving structure let mut selected_output = serde_json::Map::new(); - + // Top: id (from log.bill_id), then log object with selected fields - if let Some(id) = output_value.get("log").and_then(|l| l.get("bill_id").or_else(|| l.get("bill_identifier"))).and_then(|v| v.as_str()) { - selected_output.insert("id".to_string(), serde_json::Value::String(id.to_string())); + if let Some(id) = output_value + .get("log") + .and_then(|l| { + l.get("bill_id") + .or_else(|| l.get("bill_identifier")) + }) + .and_then(|v| v.as_str()) + { + selected_output.insert( + "id".to_string(), + serde_json::Value::String(id.to_string()), + ); } - + // Create log object with only action and bill_id if let Some(log) = output_value.get("log") { let mut log_obj = serde_json::Map::new(); if let Some(action) = log.get("action") { - log_obj.insert("action".to_string(), action.clone()); + log_obj + .insert("action".to_string(), action.clone()); } - if let Some(bill_id) = log.get("bill_id").or_else(|| log.get("bill_identifier")) { - log_obj.insert("bill_id".to_string(), bill_id.clone()); + if let Some(bill_id) = log + .get("bill_id") + .or_else(|| log.get("bill_identifier")) + { + log_obj + .insert("bill_id".to_string(), bill_id.clone()); } if !log_obj.is_empty() { - selected_output.insert("log".to_string(), serde_json::Value::Object(log_obj)); + selected_output.insert( + "log".to_string(), + serde_json::Value::Object(log_obj), + ); } } - + // Create bill object with only selected fields if let Some(bill) = output_value.get("bill") { let mut bill_obj = serde_json::Map::new(); @@ -1169,50 +1279,74 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { bill_obj.insert("title".to_string(), title.clone()); } if let Some(abstracts) = bill.get("abstracts") { - bill_obj.insert("abstracts".to_string(), abstracts.clone()); + bill_obj.insert( + "abstracts".to_string(), + abstracts.clone(), + ); } if let Some(subject) = bill.get("subject") { - bill_obj.insert("subject".to_string(), subject.clone()); + bill_obj + .insert("subject".to_string(), subject.clone()); } if let Some(identifier) = bill.get("identifier") { - bill_obj.insert("identifier".to_string(), identifier.clone()); + bill_obj.insert( + "identifier".to_string(), + identifier.clone(), + ); } if let Some(session) = bill.get("legislative_session") { - bill_obj.insert("legislative_session".to_string(), session.clone()); + bill_obj.insert( + "legislative_session".to_string(), + session.clone(), + ); } if let Some(org) = bill.get("from_organization") { - bill_obj.insert("from_organization".to_string(), org.clone()); + bill_obj.insert( + "from_organization".to_string(), + org.clone(), + ); } if !bill_obj.is_empty() { - selected_output.insert("bill".to_string(), serde_json::Value::Object(bill_obj)); + selected_output.insert( + "bill".to_string(), + serde_json::Value::Object(bill_obj), + ); } } - + // Always include tags (even if empty/null) since it's part of the default selector if let Some(tags) = output_value.get("tags") { - selected_output.insert("tags".to_string(), tags.clone()); + selected_output + .insert("tags".to_string(), tags.clone()); } else { // Include empty tags object if not present - selected_output.insert("tags".to_string(), serde_json::Value::Null); + selected_output.insert( + "tags".to_string(), + serde_json::Value::Null, + ); } - + // Bottom: sources, timestamp if let Some(sources) = output_value.get("sources") { - selected_output.insert("sources".to_string(), sources.clone()); + selected_output + .insert("sources".to_string(), sources.clone()); } if let Some(timestamp) = output_value.get("timestamp") { - selected_output.insert("timestamp".to_string(), timestamp.clone()); + selected_output + .insert("timestamp".to_string(), timestamp.clone()); } - + output_value = serde_json::Value::Object(selected_output); } - + // Apply filter - let should_output = match filter_manager.should_keep(&output_value, &repo_name) { + let should_output = match filter_manager + .should_keep(&output_value, &repo_name) + { govbot::FilterResult::Keep => true, govbot::FilterResult::FilterOut => false, }; - + if should_output { // `docs` mode: collapse the surviving entry to the // {id,text} document shape fastclass consumes. @@ -1223,7 +1357,7 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { }; // Deep prune empty/null values before serialization let pruned_value = deep_prune_json(output_value); - + // Serialize as compact JSON (single line) match serde_json::to_string(&pruned_value) { Ok(json_line) => { @@ -1233,7 +1367,11 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { } } Err(e) => { - eprintln!("Error serializing JSON from {}: {}", path.display(), e); + eprintln!( + "Error serializing JSON from {}: {}", + path.display(), + e + ); } } } @@ -1255,28 +1393,30 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { Ok(()) } - /// Parse a join string like "bill.title" into (dataset_name, field_path) fn parse_join_string(join_str: &str) -> Option<(String, Vec)> { let parts: Vec<&str> = join_str.split('.').collect(); if parts.is_empty() { return None; } - + let dataset_name = parts[0].to_string(); let field_path = if parts.len() > 1 { parts[1..].iter().map(|s| s.to_string()).collect() } else { Vec::new() }; - + Some((dataset_name, field_path)) } /// Extract a value from JSON using a field path (e.g., ["title"] or ["bill", "title"]) -fn extract_json_field(value: &serde_json::Value, field_path: &[String]) -> Option { +fn extract_json_field( + value: &serde_json::Value, + field_path: &[String], +) -> Option { let mut current = value; - + for field in field_path { match current { serde_json::Value::Object(map) => { @@ -1292,7 +1432,7 @@ fn extract_json_field(value: &serde_json::Value, field_path: &[String]) -> Optio _ => return None, } } - + Some(current.clone()) } @@ -1375,7 +1515,9 @@ fn compute_relative_source_path(file_path: &PathBuf, git_dir: &PathBuf) -> Strin } // Fallback: canonicalize both ends and diff. - let canonical_file = file_path.canonicalize().unwrap_or_else(|_| file_path.clone()); + let canonical_file = file_path + .canonicalize() + .unwrap_or_else(|_| file_path.clone()); let canonical_git_dir = git_dir.canonicalize().unwrap_or_else(|_| git_dir.clone()); match pathdiff::diff_paths(&canonical_file, &canonical_git_dir) { Some(rel_path) => rel_path.to_string_lossy().replace('\\', "/"), @@ -1391,7 +1533,8 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { govbot_dir, memory_limit, threads, - } = cmd else { + } = cmd + else { unreachable!() }; @@ -1399,23 +1542,25 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { // Check if directory exists if !repos_dir.exists() { - eprintln!("Error: Govbot repos directory not found: {}", repos_dir.display()); + eprintln!( + "Error: Govbot repos directory not found: {}", + repos_dir.display() + ); eprintln!("Run 'govbot pull all' first to pull datasets."); return Ok(()); } // Get base govbot directory (parent of repos) // e.g., if repos_dir is ./.govbot/repos, base_dir is ./.govbot - let base_govbot_dir = repos_dir.parent() + let base_govbot_dir = repos_dir + .parent() .ok_or_else(|| anyhow::anyhow!("Could not determine base govbot directory"))?; - + // Ensure base directory exists std::fs::create_dir_all(base_govbot_dir)?; // Check if duckdb is available - let duckdb_check = ProcessCommand::new("duckdb") - .arg("--version") - .output(); + let duckdb_check = ProcessCommand::new("duckdb").arg("--version").output(); if duckdb_check.is_err() { eprintln!("Error: 'duckdb' command not found."); @@ -1425,7 +1570,8 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { // Database file goes in the base govbot directory // Resolve to absolute path to ensure it's created in the right location - let db_path = base_govbot_dir.canonicalize() + let db_path = base_govbot_dir + .canonicalize() .unwrap_or_else(|_| base_govbot_dir.to_path_buf()) .join(&database); let db_path_str = db_path.to_string_lossy().to_string(); @@ -1468,7 +1614,10 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { sql_script.push_str("SELECT \n"); sql_script.push_str(" *,\n"); sql_script.push_str(" filename as source_file\n"); - sql_script.push_str(&format!("FROM read_json_auto('{}/**/bills/*/metadata.json', \n", repos_dir_str)); + sql_script.push_str(&format!( + "FROM read_json_auto('{}/**/bills/*/metadata.json', \n", + repos_dir_str + )); sql_script.push_str(" filename=true, \n"); sql_script.push_str(" union_by_name=true);\n"); sql_script.push_str("\n"); @@ -1500,7 +1649,7 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { duckdb_cmd.stderr(std::process::Stdio::piped()); let mut child = duckdb_cmd.spawn()?; - + // Write SQL to stdin if let Some(mut stdin) = child.stdin.take() { stdin.write_all(sql_script.as_bytes())?; @@ -1539,19 +1688,25 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { fn extract_path_info(path: &str) -> Option<(String, String, String)> { // Find country: pattern let country_start = path.find("country:")?; - let country_end = path[country_start + 8..].find('/').unwrap_or(path.len() - country_start - 8); + let country_end = path[country_start + 8..] + .find('/') + .unwrap_or(path.len() - country_start - 8); let country = path[country_start + 8..country_start + 8 + country_end].to_string(); - + // Find state: pattern let state_start = path.find("/state:")?; - let state_end = path[state_start + 7..].find('/').unwrap_or(path.len() - state_start - 7); + let state_end = path[state_start + 7..] + .find('/') + .unwrap_or(path.len() - state_start - 7); let state = path[state_start + 7..state_start + 7 + state_end].to_string(); - + // Find sessions/ pattern let sessions_start = path.find("/sessions/")?; - let session_end = path[sessions_start + 10..].find('/').unwrap_or(path.len() - sessions_start - 10); + let session_end = path[sessions_start + 10..] + .find('/') + .unwrap_or(path.len() - sessions_start - 10); let session_id = path[sessions_start + 10..sessions_start + 10 + session_end].to_string(); - + Some((country, state, session_id)) } @@ -1789,7 +1944,8 @@ async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { output_file, dry_run, govbot_dir, - } = cmd else { + } = cmd + else { unreachable!() }; @@ -1963,28 +2119,34 @@ async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { async fn run_update_command() -> anyhow::Result<()> { let install_script_url = "https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh"; - + eprintln!("🔄 Updating govbot to latest nightly version..."); - eprintln!("Downloading and running install script from: {}", install_script_url); - + eprintln!( + "Downloading and running install script from: {}", + install_script_url + ); + // Execute the install script by piping curl directly to sh // This avoids issues with shebang lines being interpreted as commands let mut cmd = ProcessCommand::new("sh"); cmd.arg("-c"); cmd.arg(&format!("curl -fsSL {} | sh", install_script_url)); - + // Inherit stdin/stdout/stderr so the install script can interact with the user cmd.stdin(std::process::Stdio::inherit()); cmd.stdout(std::process::Stdio::inherit()); cmd.stderr(std::process::Stdio::inherit()); - + let status = cmd.status()?; - + if status.success() { eprintln!("\n✅ Update completed successfully!"); eprintln!("You may need to restart your terminal or run 'source ~/.zshrc' (or your shell profile) to use the updated version."); } else { - return Err(anyhow::anyhow!("Update failed with exit code: {}", status.code().unwrap_or(-1))); + return Err(anyhow::anyhow!( + "Update failed with exit code: {}", + status.code().unwrap_or(-1) + )); } Ok(()) @@ -2021,9 +2183,7 @@ fn run_add_command(cmd: Command) -> anyhow::Result<()> { to_add.push("all".to_string()); continue; } - let resolved = registry - .resolve(id) - .map_err(|e| anyhow::anyhow!("{}", e))?; + let resolved = registry.resolve(id).map_err(|e| anyhow::anyhow!("{}", e))?; // Add the identifier the user typed (keeps `wy` short and familiar); // resolution proved it valid. let _ = resolved; @@ -2064,7 +2224,10 @@ fn run_add_command(cmd: Command) -> anyhow::Result<()> { for id in &added { eprintln!(" + added {}", id); } - eprintln!("✅ Updated {}. Run `govbot pull` to fetch.", manifest_path.display()); + eprintln!( + "✅ Updated {}. Run `govbot pull` to fetch.", + manifest_path.display() + ); Ok(()) } @@ -2165,6 +2328,17 @@ fn run_ls_command(cmd: Command) -> anyhow::Result<()> { println!(" {}", d); } } + + // With no project manifest, list the registry — the help promises this so + // `govbot ls` in a bare directory is genuinely useful for discovery. + if manifest_datasets.is_empty() { + println!(); + println!("Registry ({} dataset(s) — run `govbot search` to filter):", registry.datasets.len()); + for d in registry.all() { + let name = d.entry.name.as_deref().unwrap_or(""); + println!(" {:<28} {}", d.id, name); + } + } Ok(()) } @@ -2211,27 +2385,13 @@ async fn main() -> anyhow::Result<()> { let args = Args::parse(); match args.command { - Some(cmd @ Command::Pull { .. }) => { - run_pull_command(cmd).await - } - Some(cmd @ Command::Delete { .. }) => { - run_delete_command(cmd).await - } - Some(cmd @ Command::Source { .. }) => { - run_source_command(cmd).await - } - Some(cmd @ Command::Load { .. }) => { - run_load_command(cmd).await - } - Some(Command::Update) => { - run_update_command().await - } - Some(cmd @ Command::Apply { .. }) => { - run_apply_command(cmd).await - } - Some(cmd @ Command::Publish { .. }) => { - run_publish_command(cmd).await - } + Some(cmd @ Command::Pull { .. }) => run_pull_command(cmd).await, + Some(cmd @ Command::Delete { .. }) => run_delete_command(cmd).await, + Some(cmd @ Command::Source { .. }) => run_source_command(cmd).await, + Some(cmd @ Command::Load { .. }) => run_load_command(cmd).await, + Some(Command::Update) => run_update_command().await, + Some(cmd @ Command::Apply { .. }) => run_apply_command(cmd).await, + Some(cmd @ Command::Publish { .. }) => run_publish_command(cmd).await, Some(Command::Run { govbot_dir }) => { let cwd = std::env::current_dir()?; let config_path = cwd.join("govbot.yml"); diff --git a/actions/govbot/src/publish.rs b/actions/govbot/src/publish.rs index e02b3768..b0ade976 100644 --- a/actions/govbot/src/publish.rs +++ b/actions/govbot/src/publish.rs @@ -51,9 +51,7 @@ pub fn run_publisher(job: &PublishJob) -> Result<()> { ); match p.kind { - PublisherKind::Rss | PublisherKind::Html => { - emit_rss_html(job, &select, &output_dir) - } + PublisherKind::Rss | PublisherKind::Html => emit_rss_html(job, &select, &output_dir), PublisherKind::Json => emit_json(job, &output_dir), PublisherKind::Duckdb => emit_duckdb(job, &output_dir), PublisherKind::Bluesky => crate::bluesky::run_bluesky(job, job.dry_run), @@ -94,7 +92,11 @@ fn emit_rss_html(job: &PublishJob, select: &[String], output_dir: &Path) -> Resu } else { format!( "{} Legislation", - select.iter().map(|t| titlecase_tag(t)).collect::>().join(" & ") + select + .iter() + .map(|t| titlecase_tag(t)) + .collect::>() + .join(" & ") ) } }); @@ -108,7 +110,11 @@ fn emit_rss_html(job: &PublishJob, select: &[String], output_dir: &Path) -> Resu } else { format!( "Legislative updates tagged {}", - select.iter().map(|t| titlecase_tag(t)).collect::>().join(", ") + select + .iter() + .map(|t| titlecase_tag(t)) + .collect::>() + .join(", ") ) } }); @@ -129,7 +135,10 @@ fn emit_rss_html(job: &PublishJob, select: &[String], output_dir: &Path) -> Resu fs::write(&rss_path, rss_xml)?; eprintln!("✓ Generated RSS feed: {}", rss_path.display()); - eprintln!("Generating HTML index with {} entries...", job.entries.len()); + eprintln!( + "Generating HTML index with {} entries...", + job.entries.len() + ); // Only pass an explicit (configured) title to the HTML header. let html_title = p.title.as_deref().filter(|s| !s.trim().is_empty()); let html = rss::json_to_html(job.entries.clone(), html_title, feed_link, Some(feed_link)); diff --git a/actions/govbot/src/rss.rs b/actions/govbot/src/rss.rs index fdf55d1e..524e542c 100644 --- a/actions/govbot/src/rss.rs +++ b/actions/govbot/src/rss.rs @@ -559,7 +559,6 @@ pub fn json_to_html(
- {}
"#, @@ -572,7 +571,6 @@ pub fn json_to_html( .and_then(|t| t.as_str()) .unwrap_or(""), date_html, - if !date_html.is_empty() { "" } else { "" } )); } diff --git a/actions/govbot/src/wizard.rs b/actions/govbot/src/wizard.rs index b5db102a..959a6500 100644 --- a/actions/govbot/src/wizard.rs +++ b/actions/govbot/src/wizard.rs @@ -59,11 +59,14 @@ impl WizardSession { // Step 3: Publishing display.push_str("Publishing is configured for an RSS feed by default.\n"); display.push_str("Your feed will be generated in the \"docs\" directory.\n\n"); - display.push_str(&format!("? Base URL for your feed: {}\n\n", choices.base_url)); + display.push_str(&format!( + "? Base URL for your feed: {}\n\n", + choices.base_url + )); // Summary display.push_str(" ✓ Created govbot.yml\n"); - display.push_str(" ✓ Created .gitignore with .govbot\n"); + display.push_str(" ✓ Created .gitignore\n"); display.push_str(" ✓ Created .github/workflows/build.yml\n\n"); display.push_str("Setup complete! Run 'govbot' again to start the pipeline.\n"); @@ -181,7 +184,11 @@ fn prompt_sources() -> Result> { // List the registry's datasets so the user can pick from them. let cwd = std::env::current_dir().unwrap_or_else(|_| std::path::PathBuf::from(".")); if let Ok(registry) = crate::registry::Registry::load(&cwd) { - let ids: Vec = registry.all().iter().map(|d| d.short_name().to_string()).collect(); + let ids: Vec = registry + .all() + .iter() + .map(|d| d.short_name().to_string()) + .collect(); eprintln!(); eprintln!("Available datasets ({}):", ids.len()); for chunk in ids.chunks(10) { @@ -273,26 +280,57 @@ pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { yml } -/// Write .gitignore with .govbot entry +/// Write .gitignore with govbot's generated dirs and secret-bearing files. +/// +/// Everything under `.govbot/` (cloned datasets, ledgers, lockfile state), +/// every publisher output dir (`dist/`, `docs/`), and any local `.env` is +/// untracked. The userland repo is a few dozen text files plus tool artifacts; +/// the artifacts never belong in git. pub fn write_gitignore(cwd: &Path) -> Result<()> { let gitignore_path = cwd.join(".gitignore"); - let gitignore_entry = ".govbot\n"; + // Single canonical block — easy to grep, easy to update. + let block = "\ +# govbot — generated, reconstructed on every run +.govbot/ +dist/ +docs/ + +# Secrets — never commit +.env +"; + + // Idempotency: only append entries that are not already present. + let existing = if gitignore_path.exists() { + fs::read_to_string(&gitignore_path)? + } else { + String::new() + }; - if gitignore_path.exists() { - let mut content = fs::read_to_string(&gitignore_path)?; - if content.contains(".govbot") { - eprintln!(" ✓ .gitignore already contains .govbot"); - } else { - if !content.ends_with('\n') { - content.push('\n'); + let mut updated = existing.clone(); + let mut added = Vec::new(); + for line in block.lines() { + let trimmed = line.trim(); + if trimmed.is_empty() || trimmed.starts_with('#') { + continue; + } + if !existing.lines().any(|l| l.trim() == trimmed) { + if !updated.is_empty() && !updated.ends_with('\n') { + updated.push('\n'); } - content.push_str(gitignore_entry); - fs::write(&gitignore_path, content)?; - eprintln!(" ✓ Updated .gitignore to include .govbot"); + updated.push_str(line); + updated.push('\n'); + added.push(trimmed.to_string()); } + } + + if existing.is_empty() { + fs::write(&gitignore_path, block)?; + eprintln!(" ✓ Created .gitignore"); + } else if !added.is_empty() { + fs::write(&gitignore_path, &updated)?; + eprintln!(" ✓ Updated .gitignore ({} entries added)", added.len()); } else { - fs::write(&gitignore_path, gitignore_entry)?; - eprintln!(" ✓ Created .gitignore with .govbot"); + eprintln!(" ✓ .gitignore already covers govbot's generated dirs"); } Ok(()) diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index a583596c..5f75b259 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -6,19 +6,19 @@ Command: govbot --help Output: -Process pipeline log files with type-safe reactive streams +Government-data tool: pull dataset repositories, run transforms over them, and publish artifacts (RSS / HTML / JSON / DuckDB / Bluesky). Configured by a `govbot.yml` manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook. Usage: govbot [COMMAND] Commands: - pull Pull dataset repositories (default: updates existing datasets) Clones if the dataset repository doesn't exist, pulls if it does Use "govbot pull all" to pull all datasets, or "govbot pull " for specific ones - source Stream dataset records as JSON Lines (the govbot stream-protocol source) - delete Delete data pipeline repositories Deletes local repository directories for specified locales - load Load bill metadata into a DuckDB database file Loads all metadata.json files from cloned repos into a DuckDB database for analysis. The database file is saved in the base govbot directory (e.g., ./.govbot/govbot.duckdb) - update Update govbot to the latest nightly version Downloads and installs the latest nightly build from GitHub releases - publish Run a publisher: emit feeds/indexes/dumps from a govbot.yml publisher Reads the named publishers from govbot.yml `publish:` and emits their artifacts + pull Pull (clone or update) dataset repositories into the shared cache and link them into the project. Use `govbot pull all` to pull every dataset, `govbot pull ...` for specific ones, or `govbot pull` with no args to refresh whatever's already linked into the project + source Stream dataset records as JSON Lines — the govbot stream-protocol `source` stage. Pipe into a transform (`fastclass classify -`) or into `govbot apply` for the persistence sink. See `schemas/STREAM_PROTOCOL.md` + delete Delete locally-linked dataset clones from the project's `.govbot/repos/`. Use `govbot delete all` to clear every linked dataset, or `govbot delete ...` for specific ones. The shared cache at `~/.govbot/cache/` is not touched — a subsequent `pull` re-links instantly + load Load bill metadata into a DuckDB database for SQL analysis. Walks every linked dataset's `metadata.json` files, creates a `bills` table + a `bills_summary` view, and writes the database into the base govbot directory (default `./.govbot/govbot.duckdb`). Requires the `duckdb` CLI on PATH + update Update the installed govbot binary to the latest nightly build from GitHub releases. Installs into `~/.govbot/bin/govbot` and prefers the platform-native `.tar.gz` asset + publish Run one or more publishers from `govbot.yml: publish:`. A publisher consumes the tagged result stream and emits artifacts: `rss`/`html`/`json` write feed/index/dump files, `duckdb` loads records into a database, `bluesky` posts matches to a Bluesky account (always dry-run first with `--dry-run`) apply Persist fastclass classification results into the dataset as tag files. Reads `fastclass classify` result JSON from stdin — the apply sink of `govbot source --select docs | fastclass classify - | govbot apply` — and writes per-tag `.tag.json` files under each bill's session directory, the files `govbot publish` turns into feeds. Classification itself is done by fastclass; `govbot apply` only stores the results - run Run the full govbot pipeline against the current directory's `govbot.yml`: pull/update → `source --select docs | fastclass classify - | apply` → publish. Equivalent to running `govbot` with no arguments + run Run the full pipeline against the current directory's `govbot.yml`: pull/update datasets → `source --select docs | fastclass classify - | apply` (the classify transform) → publish every configured publisher. `govbot` with no arguments is equivalent (and falls back to `init` if no `govbot.yml` is present) init Scaffold a new govbot.yml in the current directory (the setup wizard). Interactive in a TTY; writes sensible defaults when non-interactive add Add one or more datasets to the project's `govbot.yml` `datasets:` list. Each id is validated against the registry before it is added remove Remove one or more datasets from the project's `govbot.yml` diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap index 3e6dbf1f..7799b348 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap @@ -21,7 +21,7 @@ Your feed will be generated in the "docs" directory. ? Base URL for your feed: https://myuser.github.io/my-govbot ✓ Created govbot.yml - ✓ Created .gitignore with .govbot + ✓ Created .gitignore ✓ Created .github/workflows/build.yml Setup complete! Run 'govbot' again to start the pipeline. diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap index 6a3b5a68..2ecec7d9 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap @@ -25,7 +25,7 @@ Your feed will be generated in the "docs" directory. ? Base URL for your feed: https://sartaj.me/govbot ✓ Created govbot.yml - ✓ Created .gitignore with .govbot + ✓ Created .gitignore ✓ Created .github/workflows/build.yml Setup complete! Run 'govbot' again to start the pipeline. diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap index 9230ac3b..e96ed593 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap @@ -25,7 +25,7 @@ Your feed will be generated in the "docs" directory. ? Base URL for your feed: https://activist.github.io/legislation ✓ Created govbot.yml - ✓ Created .gitignore with .govbot + ✓ Created .gitignore ✓ Created .github/workflows/build.yml Setup complete! Run 'govbot' again to start the pipeline. diff --git a/actions/govbot/tests/wizard_tests.rs b/actions/govbot/tests/wizard_tests.rs index 77f2cfcf..0ac38cdb 100644 --- a/actions/govbot/tests/wizard_tests.rs +++ b/actions/govbot/tests/wizard_tests.rs @@ -106,10 +106,16 @@ fn test_generated_yml_is_valid_manifest() { .expect("should have a classify transform"); assert_eq!(classify.reads, "docs"); assert_eq!(classify.writes, "classification"); - assert!(classify.classifier.is_some(), "classify should reference a bundle"); + assert!( + classify.classifier.is_some(), + "classify should reference a bundle" + ); // publish — the RSS feed publisher. - let feed = manifest.publish.get("feed").expect("should have a feed publisher"); + let feed = manifest + .publish + .get("feed") + .expect("should have a feed publisher"); assert_eq!( feed.base_url.as_deref(), Some("https://myuser.github.io/my-govbot") @@ -154,7 +160,9 @@ fn test_write_files_creates_govbot_yml() { let session = WizardSession::from_choices(&choices); let dir = tempfile::tempdir().unwrap(); - session.write_files(dir.path()).expect("write_files should succeed"); + session + .write_files(dir.path()) + .expect("write_files should succeed"); // Verify govbot.yml was created and parses as a Manifest. let config_path = dir.path().join("govbot.yml"); @@ -166,7 +174,10 @@ fn test_write_files_creates_govbot_yml() { let gitignore_path = dir.path().join(".gitignore"); assert!(gitignore_path.exists(), ".gitignore should exist"); let gitignore = std::fs::read_to_string(&gitignore_path).unwrap(); - assert!(gitignore.contains(".govbot"), ".gitignore should contain .govbot"); + assert!( + gitignore.contains(".govbot"), + ".gitignore should contain .govbot" + ); // Verify workflow was created. let workflow_path = dir.path().join(".github/workflows/build.yml"); From d79480c194bd144eba079c061d380b59928a190e Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 18:45:10 -0500 Subject: [PATCH 04/32] Docs: align README / AGENT.md / CLAUDE.md with the post-refactor stack MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit README.md: - Drop the '47 states' boilerplate (the bundled registry now lists 55 datasets — every state legislature + DC + the territories + federal Congress). - Fix the wizard description: it no longer 'guides you through creating tags' — that moved into the fastclass classifier bundle when the layers split. - Rewrite 'Other Commands' to cover the new registry-backed verbs (search/add/remove/ls) and call out 'publish --dry-run' for Bluesky. - Add a 'Classifying with fastclass' section pointing at AGENT.md and STREAM_PROTOCOL.md so the README explains the two-tool composition instead of just listing flags. - Replace the 'WIP / GitHub Actionable' note with a real description of the data catalog: coverage + GOVBOT_REGISTRY_URL override. AGENT.md: - Be honest about the fastclass install: '' was a placeholder that would have stopped a fresh session cold. Now the playbook tells the agent to ask the user for their fastclass checkout path (the public remote is still an architecture open question) and to STOP rather than scaffold a broken project if the user has no fastclass install. CLAUDE.md: - Fix the publish help-line which only listed RSS/HTML/JSON/DuckDB and omitted Bluesky. - Update the '~47 jurisdictions' 10x-data line to ~55 dataset repos and point at the registry as what makes 10x feasible. registry.rs: - Update the '52-jurisdiction seed' comment + test message to reflect the catalog's actual size; switch the floor assertion from '>= 52' to '>= 55' and include the actual count in the failure message. All 30 tests still pass. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 13 +++++- CLAUDE.md | 4 +- README.md | 79 +++++++++++++++++++++++----------- actions/govbot/src/main.rs | 5 ++- actions/govbot/src/registry.rs | 10 +++-- 5 files changed, 77 insertions(+), 34 deletions(-) diff --git a/AGENT.md b/AGENT.md index ad6fcbe4..55b6afb0 100644 --- a/AGENT.md +++ b/AGENT.md @@ -108,12 +108,21 @@ If `govbot` is missing, install the nightly: sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh)" ``` -If `fastclass` is missing, build it from source (it is a separate repo): +If `fastclass` is missing, build it from source. `fastclass` is a separate +repo; its public home is still being decided (architecture open question), so +ask the user where their fastclass checkout lives and adapt: ```bash -git clone && cd fastclass && just install # -> ~/.cargo/bin/fastclass +# In the user's fastclass checkout: +just install # -> ~/.cargo/bin/fastclass +# or: +cargo install --path . # same effect, without `just` ``` +If the user has no checkout yet, ask them for a path; if they have neither, the +classify stage cannot run — say so and stop here rather than scaffold a broken +project. + Ensure `~/.cargo/bin` and `~/.govbot/bin` are on `PATH`: ```bash diff --git a/CLAUDE.md b/CLAUDE.md index 10ee566a..56b44cdd 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -42,7 +42,7 @@ Use these meta-prompts to guide architectural decisions and code quality. ### Performance & Scale -- **"What happens with 10x the data?"** - Current scale is ~47 jurisdictions. Consider: What if we add counties? Cities? Federal agencies? +- **"What happens with 10x the data?"** - Current scale is ~55 dataset repos (all US state/territory legislatures + federal). The runtime registry (`registry.json`) is what makes 10x feasible — adding counties, cities, or agencies is a data change, not a recompile. - **"Can this be parallelized?"** - State-level operations are inherently parallel. Pipelines should support concurrent execution. @@ -86,7 +86,7 @@ govbot pull wy il # Download specific states govbot source # Stream legislative activity as JSON Lines govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply govbot load # Load bill metadata into DuckDB -govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB) +govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky) govbot run # Run the full pipeline: pull -> classify -> apply -> publish ``` diff --git a/README.md b/README.md index 95a60ef5..1914c37e 100644 --- a/README.md +++ b/README.md @@ -5,10 +5,10 @@ # 🏛️ govbot -- Download the legislation of [47 states/jurisdicitions](github.com/govbot-data) in under 1 minute. -- Tag/summarize bills with with private/local models optimized to run on free Github Actions. +- Download the legislation of [50+ states/jurisdictions](https://github.com/govbot-data) in under 1 minute. +- Classify and summarize bills with private/local models — runs on free GitHub Actions. -`govbot` enables distributed data anaylsis of government updates via a friendly terminal interface. Git repos function as datasets, including the legislation of all 47 states/jurisdictions. +`govbot` is a CLI for distributed analysis of government data. Git repos function as datasets — the legislation of every US state, DC, the territories, and federal Congress. It composes with [`fastclass`](#classifying-with-fastclass) (the classifier) over a Unix pipe; together they pull, classify, and publish a tagged feed of legislation to RSS, HTML, JSON, DuckDB, or a Bluesky posting bot. ## 🤖 Build a newsbot with Claude Code @@ -39,51 +39,78 @@ sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/a ### 2. Set up your project ```bash -govbot +govbot init # or just `govbot` — the wizard runs when no govbot.yml is present ``` -Running `govbot` with no config file launches an interactive setup wizard that: -1. Asks what data sources you want (all 47 states or specific ones) -2. Guides you through creating tags for topics you care about -3. Creates `govbot.yml`, `.gitignore`, and a GitHub Actions workflow +Running `govbot init` (or `govbot` in an empty directory) launches an interactive setup wizard that: +1. Asks which datasets you want — all jurisdictions or a hand-picked subset (browse with `govbot search`). +2. Writes a `govbot.yml` manifest (`datasets` / `transforms` / `publish` / `pipelines`), a `.gitignore`, and a GitHub Actions workflow. + +Classification lives in a separate [`fastclass`](#classifying-with-fastclass) bundle — point `transforms.classify.classifier` at it. ### 3. Run the pipeline ```bash -govbot +govbot run # or just `govbot` — runs the pipeline when a govbot.yml is present ``` -With a `govbot.yml` in your directory, running `govbot` executes the full pipeline: -1. Clones/updates legislation repositories -2. Tags bills based on your tag definitions -3. Generates RSS feeds in the `docs/` directory +With a `govbot.yml` in your directory, `govbot run` executes the full pipeline: +1. Pulls/updates the declared dataset repositories. +2. Classifies bills against your fastclass bundle (`source --select docs | fastclass classify - | apply`). +3. Runs every publisher in `govbot.yml: publish:` — RSS / HTML / JSON / DuckDB / Bluesky. ### Other Commands ```bash -govbot pull all # download all state legislation datasets -govbot pull il ca ny # download specific states -govbot source # stream legislative activity as JSON Lines +govbot search wyoming # search the dataset registry +govbot add wy il # add datasets to govbot.yml (validated against the registry) +govbot remove wy # remove datasets from govbot.yml +govbot ls # list the manifest's datasets + what is cached locally +govbot pull all # clone/update every dataset +govbot pull il ca ny # clone/update specific datasets +govbot source # stream dataset records as JSON Lines govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply -govbot publish # run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky) +govbot apply # persist a fastclass result stream into the dataset +govbot publish # run every configured publisher (RSS / HTML / JSON / DuckDB / Bluesky) +govbot publish --publisher bluesky --dry-run # ALWAYS dry-run Bluesky first govbot run # the full pipeline: pull -> classify -> apply -> publish govbot load # load bill metadata into DuckDB -govbot delete all # remove all downloaded data -govbot update # update govbot to latest version +govbot delete all # unlink all locally-linked datasets (the shared cache stays) +govbot update # update govbot to the latest nightly govbot --help # see all commands and options ``` -# 🏛️ Govbot Legislation Data Catalogs +## Classifying with fastclass + +govbot does not classify bills itself — it streams them to a separate +[`fastclass`](#) CLI (a token-free, deterministic text classifier) and writes +the result back. The pipe: -See the data catalogs [here](github.com/govbot-data). +```bash +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` + +`govbot run` wires this automatically. The classifier is a **bundle directory** +(`classifier.yml` + `fusion.yml` + `eval/`) owned by fastclass; govbot only +references its path. See [`AGENT.md`](AGENT.md) for the end-to-end newsbot +playbook (make / manage / update) and the [stream protocol](schemas/STREAM_PROTOCOL.md) +for the wire format. + +# 🏛️ Govbot Legislation Data Catalogs -- Nearly all state governments -- Federal +govbot pulls data from a registry of git-repo datasets. The bundled default +registry (`actions/govbot/data/registry.json`) ships every US state, DC, the +territories, and federal Congress — see [`actions/govbot/REGISTRY.md`](actions/govbot/REGISTRY.md) +for the format, and the [govbot-data org](https://github.com/govbot-data) for +the dataset repos themselves. -WIP: Ideally, these scripts should be accessible via the following ways. +Coverage today: +- Every US state legislature +- US territories (DC, PR, GU, VI, MP) +- US federal (Congress) -- CLI / Unix pipe friendliness where possible. CLI is the most portable of solutions. -- GitHub Actionable if possible +Override the registry with `GOVBOT_REGISTRY_URL=` or a project-local +`.govbot/registry.json`. ## Contribute diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 147e69bb..3656c5eb 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -2333,7 +2333,10 @@ fn run_ls_command(cmd: Command) -> anyhow::Result<()> { // `govbot ls` in a bare directory is genuinely useful for discovery. if manifest_datasets.is_empty() { println!(); - println!("Registry ({} dataset(s) — run `govbot search` to filter):", registry.datasets.len()); + println!( + "Registry ({} dataset(s) — run `govbot search` to filter):", + registry.datasets.len() + ); for d in registry.all() { let name = d.entry.name.as_deref().unwrap_or(""); println!(" {:<28} {}", d.id, name); diff --git a/actions/govbot/src/registry.rs b/actions/govbot/src/registry.rs index b32e3674..f5161e81 100644 --- a/actions/govbot/src/registry.rs +++ b/actions/govbot/src/registry.rs @@ -21,7 +21,7 @@ //! ## Where it lives / how it is fetched //! //! The default registry is the JSON file `data/registry.json`, **compiled into -//! the binary** via `include_str!` — so a fresh install resolves the 52 seed +//! the binary** via `include_str!` — so a fresh install resolves the seed //! jurisdictions with zero network access. A project can override it: //! 1. `GOVBOT_REGISTRY_URL` — an `http(s)://` URL or a local file path. //! 2. `/.govbot/registry.json` — a project-local registry file. @@ -272,9 +272,13 @@ mod tests { #[test] fn bundled_registry_parses_and_has_seed_jurisdictions() { let reg = Registry::bundled().expect("bundled registry must parse"); + // The seed catalog covers every US state legislature + DC + the + // territories + federal Congress. Asserting the floor (not an exact + // count) keeps the test stable when datasets are added. assert!( - reg.datasets.len() >= 52, - "expected the 52-jurisdiction seed" + reg.datasets.len() >= 55, + "expected the seed catalog (>= 55 datasets); got {}", + reg.datasets.len() ); assert!(reg.datasets.contains_key("us-legislation/wy")); } From 80617ac0a217f393e738bc7ee91d27abcbd930b7 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 21:22:49 -0500 Subject: [PATCH 05/32] apply: route tag files back to source dataset by default Symptom: with a project manifest at the repo root, `govbot apply` materialized `country:us/state:wy/sessions//tags/.tag.json` at the project root, not inside the dataset's own `.govbot/repos//` directory. Userland had to patch its `.gitignore` to suppress the misplaced files, and the dataset's on-disk layout no longer mirrored where the bill metadata came from. Root cause: `run_apply_command` defaulted its base output dir to the current working dir, ignoring the dataset's identity which is already carried in the result's `doc` field (`/country:.../sessions/.../bills/...`). Fix: `parse_doc_route` now also extracts the leading `` segment (the dataset's `short_name`); when `--output-dir` is unset `apply` routes each tag file under `/.govbot/repos//country:.../sessions/.../tags/` so the file lands alongside the bill's `metadata.json`. An explicit `--output-dir` preserves current behaviour (verbatim root). A prefix-less doc id (non-govbot source) still falls back to the project dir rather than dropping the record. Added unit tests for `parse_doc_route`. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 114 +++++++++++++++++++++++++++++++++---- 1 file changed, 104 insertions(+), 10 deletions(-) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 3656c5eb..dcac4bcc 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -1738,6 +1738,10 @@ struct FastclassFusion { /// A bill's location in the dataset, parsed from a fastclass result's `doc` /// id — which `govbot source --select docs` set to the bill's directory path. struct BillRoute { + /// The dataset's `short_name` — the path segment before `country:` in + /// the doc id (e.g. `wy-legislation`). `None` if the doc id has no + /// recognisable prefix. + dataset: Option, country: String, state: String, session: String, @@ -1745,15 +1749,23 @@ struct BillRoute { } /// Parse a `doc` id of the form -/// `/country:/state:/sessions//bills/` into the +/// `/country:/state:/sessions//bills/` into the /// pieces needed to place its `.tag.json` file. Returns `None` for any id that /// is not a dataset bill path (e.g. a document from a non-govbot source). +/// +/// The leading `` segment is the dataset's `short_name` (e.g. +/// `wy-legislation`); it is what lets `govbot apply` route each tag file back +/// to `/.govbot/repos//` by default. fn parse_doc_route(doc: &str) -> Option { let segments: Vec<&str> = doc.split('/').collect(); let (mut country, mut state, mut session, mut bill_id) = (None, None, None, None); + let mut country_idx: Option = None; for (i, seg) in segments.iter().enumerate() { if let Some(c) = seg.strip_prefix("country:") { country = Some(c.to_string()); + if country_idx.is_none() { + country_idx = Some(i); + } } else if let Some(s) = seg.strip_prefix("state:") { state = Some(s.to_string()); } else if *seg == "sessions" { @@ -1762,7 +1774,24 @@ fn parse_doc_route(doc: &str) -> Option { bill_id = segments.get(i + 1).map(|s| s.to_string()); } } + // Anything sitting in front of `country:` is the dataset short_name. + // For today's `/country:/...` shape that is exactly one + // segment, but tolerate nested prefixes by joining everything before the + // `country:` segment (skipping empties from a leading `/`). + let dataset = country_idx.and_then(|i| { + let prefix: Vec<&str> = segments[..i] + .iter() + .copied() + .filter(|s| !s.is_empty()) + .collect(); + if prefix.is_empty() { + None + } else { + Some(prefix.join("/")) + } + }); Some(BillRoute { + dataset, country: country?, state: state?, session: session?, @@ -1819,12 +1848,13 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { }; let current_dir = std::env::current_dir()?; - // Tag files land under --output-dir when given, otherwise the current - // directory (which, for a govbot project, holds govbot.yml). - let base_output_dir = output_dir - .as_ref() - .map(PathBuf::from) - .unwrap_or_else(|| current_dir.clone()); + // Tag files land under --output-dir when given. When unset, each tag file + // is routed back to its source dataset under + // `/.govbot/repos//country:.../sessions/.../tags/` + // — mirroring the path the bill's `metadata.json` came from — using the + // first segment of the fastclass result's `doc` field. The explicit + // `--output-dir` override stays a verbatim root for back-compat. + let explicit_output_dir = output_dir.as_ref().map(PathBuf::from); // The taxonomy now lives in a fastclass classifier bundle, not in // govbot.yml — each `.tag.json` is stamped with a stub `tag_config` @@ -1838,6 +1868,9 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { let mut skipped = 0usize; eprintln!("Reading fastclass classification results from stdin..."); + // Track per-dataset write counts so the final summary reflects where the + // tag files actually landed. + let mut written_dirs: std::collections::BTreeSet = Default::default(); for line_result in reader.lines() { let line = line_result?; let line = line.trim(); @@ -1879,6 +1912,18 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { continue; } + // Resolve where this bill's tag files land. With an explicit + // `--output-dir`, that path is the root and the dataset short_name is + // dropped (back-compat). With no override, route the file back to its + // source dataset under `/.govbot/repos//...` so the + // file lands alongside the bill's `metadata.json`. If the `doc` id + // lacks a recognisable dataset prefix (a non-govbot source), fall + // back to the project directory so the record is still persisted. + let base_output_dir = match (&explicit_output_dir, &route.dataset) { + (Some(root), _) => root.clone(), + (None, Some(dataset)) => current_dir.join(".govbot").join("repos").join(dataset), + (None, None) => current_dir.clone(), + }; let tags_dir = base_output_dir .join(format!("country:{}", route.country)) .join(format!("state:{}", route.state)) @@ -1886,6 +1931,7 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { .join(&route.session) .join("tags"); fs::create_dir_all(&tags_dir)?; + written_dirs.insert(base_output_dir.clone()); for (tag_key, final_score) in matched { let tag_path = tags_dir.join(format!("{}.tag.json", tag_key)); @@ -1921,11 +1967,21 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { written += 1; } + let dirs_summary = if written_dirs.is_empty() { + explicit_output_dir + .as_ref() + .map(|d| d.display().to_string()) + .unwrap_or_else(|| current_dir.display().to_string()) + } else { + written_dirs + .iter() + .map(|p| p.display().to_string()) + .collect::>() + .join(", ") + }; eprintln!( "\n✅ Persisted {} tagged bill(s) under {}; skipped {} entr(ies).", - written, - base_output_dir.display(), - skipped + written, dirs_summary, skipped ); Ok(()) } @@ -2441,3 +2497,41 @@ async fn main() -> anyhow::Result<()> { } } } + +#[cfg(test)] +mod tests { + use super::*; + + /// A typical `govbot source --select docs` id — the leading dataset + /// `short_name` is what `govbot apply` uses to route the `.tag.json` back + /// to `/.govbot/repos//...` by default. + #[test] + fn parse_doc_route_extracts_dataset_prefix() { + let route = + parse_doc_route("wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001") + .expect("dataset path should parse"); + assert_eq!(route.dataset.as_deref(), Some("wy-legislation")); + assert_eq!(route.country, "us"); + assert_eq!(route.state, "wy"); + assert_eq!(route.session, "2025"); + assert_eq!(route.bill_id, "HB0001"); + } + + /// A doc id with no dataset prefix — `apply` falls back to the project + /// dir rather than dropping the record on the floor. + #[test] + fn parse_doc_route_handles_missing_dataset_prefix() { + let route = parse_doc_route("country:us/state:wy/sessions/2025/bills/HB0001") + .expect("dataset path without prefix should still parse"); + assert!(route.dataset.is_none()); + assert_eq!(route.bill_id, "HB0001"); + } + + /// A non-bill doc id (e.g. a future stream-kind) — `None` so `apply` + /// skips the record with a warning. + #[test] + fn parse_doc_route_rejects_non_bill_ids() { + assert!(parse_doc_route("just-some-other-id").is_none()); + assert!(parse_doc_route("wy-legislation/country:us").is_none()); + } +} From 877367ed9e0cafd46e6ad6705162685e9b2a5466 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 21:26:30 -0500 Subject: [PATCH 06/32] run: --dry-run flag; bluesky skips with WARN when creds absent MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: a first-time `govbot run` against a manifest with a `bluesky` publisher dies red — even after classify / apply / rss / html all succeeded — with "the bluesky publisher needs the BLUESKY_HANDLE environment variable". Activists setting up the bot before they have an app password see a hostile error and no way to preview what the pipeline produces. Root cause: the bluesky publisher unconditionally authenticated as soon as it had pending records, and `govbot run` had no flag to propagate `--dry-run` to the publishers it spawned. The existing `govbot publish --dry-run` only covered the standalone-publish flow. Fix (both, the bug report asked for either): - `govbot run --dry-run`: a new flag on the `Run` subcommand that propagates to every publisher (passes `--dry-run` through to `govbot publish`). Recommended first invocation: every publisher renders, nothing emits. - `bluesky` publisher: when `BLUESKY_HANDLE` / `BLUESKY_APP_PASSWORD` are not set and the publisher is not in dry-run, log a `WARN` and skip rather than bailing. The rest of the pipeline (rss / html / json / duckdb) keeps running. AGENT.md §1.4 and §2.3 and README.md updated to mention both behaviours. Added `creds_present_reflects_env` unit test (and an in-test ENV_LOCK mutex so the env mutation is safe under parallel tests). Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 18 +++++++- README.md | 11 ++++- actions/govbot/src/bluesky.rs | 76 +++++++++++++++++++++++++++++++++- actions/govbot/src/main.rs | 17 ++++++-- actions/govbot/src/pipeline.rs | 9 +++- 5 files changed, 123 insertions(+), 8 deletions(-) diff --git a/AGENT.md b/AGENT.md index 55b6afb0..47e4d835 100644 --- a/AGENT.md +++ b/AGENT.md @@ -371,7 +371,8 @@ elsewhere. ```bash govbot pull il ny # clone the datasets (or: govbot pull all) -govbot run # pull -> classify -> apply -> publish +govbot run --dry-run # pull -> classify -> apply -> publish (render-only) +govbot run # same, but actually emits / posts ``` `govbot pull` clones each dataset once into the shared `~/.govbot/cache/` and @@ -379,7 +380,15 @@ writes `govbot.lock` pinning the exact commit each resolved to. Commit `govbot.lock` to the project repo — it makes runs reproducible. A second `pull` (here or in any other project) reuses the cache instead of re-cloning. -For the Bluesky publisher, **always dry-run first** — see §2. +`govbot run --dry-run` propagates `--dry-run` to every publisher — the +`bluesky` publisher honours this by rendering the posts it *would* send and +touching no network and no ledger. Pair the dry-run with §2.3 before going +live with the Bluesky bot. + +When the Bluesky creds (`BLUESKY_HANDLE` / `BLUESKY_APP_PASSWORD`) are not +set, the `bluesky` publisher logs a `WARN` and **skips** rather than failing +the pipeline — so a first-time `govbot run` without creds still emits the +RSS / HTML feeds. --- @@ -422,6 +431,8 @@ Credentials are **never** config fields — they are env-only. ```bash govbot publish --publisher bluesky --dry-run +# or, end-to-end through the whole pipeline: +govbot run --dry-run ``` `--dry-run` renders the posts that *would* be sent and **touches no network @@ -429,6 +440,9 @@ and no ledger**. Review the rendered text with the user — check the template, the 300-char truncation, and that `min_score` is neither too loose (spam) nor too tight (silence). Adjust `post_template` / `min_score` and re-dry-run. +`govbot run --dry-run` is the recommended first invocation: it propagates +`--dry-run` to every publisher and exits clean even without Bluesky creds. + ### 2.4 Go live ```bash diff --git a/README.md b/README.md index 1914c37e..e8c47541 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,8 @@ Classification lives in a separate [`fastclass`](#classifying-with-fastclass) bu ### 3. Run the pipeline ```bash -govbot run # or just `govbot` — runs the pipeline when a govbot.yml is present +govbot run --dry-run # render-only: every publisher previews its output +govbot run # or just `govbot` — runs the pipeline when a govbot.yml is present ``` With a `govbot.yml` in your directory, `govbot run` executes the full pipeline: @@ -59,6 +60,13 @@ With a `govbot.yml` in your directory, `govbot run` executes the full pipeline: 2. Classifies bills against your fastclass bundle (`source --select docs | fastclass classify - | apply`). 3. Runs every publisher in `govbot.yml: publish:` — RSS / HTML / JSON / DuckDB / Bluesky. +`govbot run --dry-run` propagates `--dry-run` to every publisher — the +`bluesky` publisher renders posts to stderr/stdout and touches no network or +ledger. Without `--dry-run`, a `bluesky` publisher whose `BLUESKY_HANDLE` / +`BLUESKY_APP_PASSWORD` env vars are not set is **skipped with a `WARN`** +rather than failing the pipeline — first-time runs without creds still emit +the RSS / HTML feeds. + ### Other Commands ```bash @@ -73,6 +81,7 @@ govbot source --select docs | fastclass classify - classifier=./classifier | gov govbot apply # persist a fastclass result stream into the dataset govbot publish # run every configured publisher (RSS / HTML / JSON / DuckDB / Bluesky) govbot publish --publisher bluesky --dry-run # ALWAYS dry-run Bluesky first +govbot run --dry-run # full pipeline, every publisher dry-run (recommended first run) govbot run # the full pipeline: pull -> classify -> apply -> publish govbot load # load bill metadata into DuckDB govbot delete all # unlink all locally-linked datasets (the shared cache stays) diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs index f55f6b30..38763792 100644 --- a/actions/govbot/src/bluesky.rs +++ b/actions/govbot/src/bluesky.rs @@ -118,7 +118,23 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { return Ok(()); } - // Authenticate — credentials are environment-only. + // Authenticate — credentials are environment-only. If they are absent, + // skip the publisher with a WARN rather than failing the whole pipeline: + // first-time activists running `govbot run` without Bluesky creds yet + // should still get their RSS / HTML feeds rather than a red error. + // Pair this with `govbot run --dry-run` to render-only without + // requiring creds at all. + if !creds_present() { + eprintln!( + "⚠️ Publisher '{}' (bluesky): BLUESKY_HANDLE / BLUESKY_APP_PASSWORD \ + not set — skipping. Set them (an app password from Bluesky \ + Settings → App Passwords) to go live; or use `govbot run \ + --dry-run` / `govbot publish --publisher {} --dry-run` to \ + render-only.", + job.name, job.name + ); + return Ok(()); + } let service = std::env::var("BLUESKY_SERVICE") .ok() .filter(|s| !s.trim().is_empty()) @@ -475,6 +491,21 @@ fn create_post(service: &str, session: &Session, text: &str) -> Result { .to_string()) } +/// True when both required Bluesky credential env vars are set and non-empty. +/// Used to decide whether the publisher should skip-with-WARN (missing creds) +/// or attempt the live authentication flow. +fn creds_present() -> bool { + env_nonempty("BLUESKY_HANDLE") && env_nonempty("BLUESKY_APP_PASSWORD") +} + +/// True when `key` is set to a non-empty (and non-whitespace) value. +fn env_nonempty(key: &str) -> bool { + std::env::var(key) + .ok() + .map(|v| !v.trim().is_empty()) + .unwrap_or(false) +} + /// Read a required environment variable, with an actionable error message. fn require_env(key: &str) -> Result { std::env::var(key) @@ -533,6 +564,49 @@ mod tests { assert!(!record_clears_threshold(&json!({ "tags": {} }), &[], 0.0)); } + /// When BLUESKY_HANDLE / BLUESKY_APP_PASSWORD are absent, `creds_present` + /// reports `false` — the signal that lets `run_bluesky` skip with a WARN + /// instead of failing the whole pipeline. With both set non-empty, + /// `true`. + /// + /// This test mutates process env; `cargo test` runs threads in parallel by + /// default, so it locks a mutex around the env touch. + #[test] + fn creds_present_reflects_env() { + // Serialise env mutation across the env-touching tests so parallel + // test threads can't see each other's writes mid-check. + use std::sync::Mutex; + static ENV_LOCK: Mutex<()> = Mutex::new(()); + let _g = ENV_LOCK.lock().unwrap(); + + // Snapshot original values to restore at the end. + let prev_h = std::env::var("BLUESKY_HANDLE").ok(); + let prev_p = std::env::var("BLUESKY_APP_PASSWORD").ok(); + + std::env::remove_var("BLUESKY_HANDLE"); + std::env::remove_var("BLUESKY_APP_PASSWORD"); + assert!(!creds_present()); + + std::env::set_var("BLUESKY_HANDLE", "x.bsky.social"); + assert!(!creds_present()); // password still missing + + std::env::set_var("BLUESKY_APP_PASSWORD", "abcd-efgh-ijkl-mnop"); + assert!(creds_present()); + + std::env::set_var("BLUESKY_HANDLE", " "); // whitespace-only + assert!(!creds_present()); + + // Restore. + match prev_h { + Some(v) => std::env::set_var("BLUESKY_HANDLE", v), + None => std::env::remove_var("BLUESKY_HANDLE"), + } + match prev_p { + Some(v) => std::env::set_var("BLUESKY_APP_PASSWORD", v), + None => std::env::remove_var("BLUESKY_APP_PASSWORD"), + } + } + #[test] fn render_substitutes_template_placeholders() { let entry = json!({ diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index dcac4bcc..cd459f5b 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -253,6 +253,14 @@ enum Command { /// Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] govbot_dir: Option, + + /// Render but do not emit. Propagates to every publisher — the + /// `bluesky` publisher honours this by printing the posts it would + /// send and touching no network/ledger. Recommended for first runs: + /// a missing-cred `bluesky` publisher already auto-skips with a + /// WARN, but `--dry-run` makes it explicit. + #[arg(long = "dry-run")] + dry_run: bool, }, /// Scaffold a new govbot.yml in the current directory (the setup wizard). @@ -2451,7 +2459,10 @@ async fn main() -> anyhow::Result<()> { Some(Command::Update) => run_update_command().await, Some(cmd @ Command::Apply { .. }) => run_apply_command(cmd).await, Some(cmd @ Command::Publish { .. }) => run_publish_command(cmd).await, - Some(Command::Run { govbot_dir }) => { + Some(Command::Run { + govbot_dir, + dry_run, + }) => { let cwd = std::env::current_dir()?; let config_path = cwd.join("govbot.yml"); if !config_path.exists() { @@ -2460,7 +2471,7 @@ async fn main() -> anyhow::Result<()> { cwd.display() ); } - govbot::pipeline::run_pipeline(&config_path, govbot_dir.as_deref()) + govbot::pipeline::run_pipeline(&config_path, govbot_dir.as_deref(), dry_run) } Some(Command::Init) => { let cwd = std::env::current_dir()?; @@ -2493,7 +2504,7 @@ async fn main() -> anyhow::Result<()> { // to start the pipeline (matches the wizard's own message). return Ok(()); } - govbot::pipeline::run_pipeline(&config_path, None) + govbot::pipeline::run_pipeline(&config_path, None, false) } } } diff --git a/actions/govbot/src/pipeline.rs b/actions/govbot/src/pipeline.rs index a98b1f48..18045b01 100644 --- a/actions/govbot/src/pipeline.rs +++ b/actions/govbot/src/pipeline.rs @@ -14,9 +14,13 @@ use std::process::{Command, Stdio}; /// `govbot apply`. /// 3. **publish** — run `govbot publish` to emit the manifest's publishers. /// +/// `dry_run` is passed through to step 3 so publishers render but do not +/// emit; the `bluesky` publisher in particular honours it by touching no +/// network and no ledger. +/// /// Smart update behavior: if `/repos/` already has datasets, just /// `git pull`; otherwise clone the manifest's `datasets`. -pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>) -> Result<()> { +pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>, dry_run: bool) -> Result<()> { let govbot_bin = std::env::current_exe().context("Failed to determine govbot binary path")?; let cwd = config_path.parent().unwrap_or_else(|| Path::new(".")); @@ -103,6 +107,9 @@ pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>) -> Result<()> if let Some(d) = govbot_dir { publish_cmd.arg("--govbot-dir").arg(d); } + if dry_run { + publish_cmd.arg("--dry-run"); + } let publish_status = publish_cmd .current_dir(cwd) .stdin(Stdio::inherit()) From 29b97031407b356db3d2ecb22a479a986e69b51c Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 21:28:25 -0500 Subject: [PATCH 07/32] run: skip cache write when project has a local dataset seed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: with `.govbot/repos/wy-legislation/` already populated (seeded locally rather than `govbot pull`-ed), `govbot run` still ran `govbot pull` which tried to clone-or-pull through the shared `~/.govbot/cache/` cache. In a sandboxed environment (read-only HOME) that surfaced as `❌ wy IO error: Operation not permitted` inside an otherwise-green pipeline. Root cause: `pipeline::run_pipeline` always shelled out to `govbot pull`, regardless of whether the project already had every declared dataset sitting in `.govbot/repos/`. Fix: before step 1, classify each manifest dataset against the project-local `repos/` directory. If every dataset has a non-empty `//` directory, the pull substep is reduced to one `📂 using local seed: ` line per dataset and the subprocess is not spawned at all. When *any* dataset is missing locally, the pull subprocess still runs as today (cache failures during a real pull stay loud — they are only silenced when the seed is what `pull` would have synced to). Added `is_local_seed_detects_populated_dir` unit test. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/pipeline.rs | 137 +++++++++++++++++++++++++++------ 1 file changed, 112 insertions(+), 25 deletions(-) diff --git a/actions/govbot/src/pipeline.rs b/actions/govbot/src/pipeline.rs index 18045b01..8cc16739 100644 --- a/actions/govbot/src/pipeline.rs +++ b/actions/govbot/src/pipeline.rs @@ -1,4 +1,5 @@ use crate::config::{Command_, Manifest, Transform}; +use crate::git::repo_dir_name; use anyhow::{Context, Result}; use std::collections::HashMap; use std::path::{Path, PathBuf}; @@ -48,40 +49,68 @@ pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>, dry_run: bool) .map(|mut d| d.next().is_some()) .unwrap_or(false); + // Classify each manifest dataset: is the project-local seed already + // populated for it? If every declared dataset has a non-empty + // project-local directory under `repos/`, the pull substep is a no-op — + // skip it (and the shared `~/.govbot/cache/` write the registry-driven + // pull would attempt) so a sandbox / read-only HOME does not error out a + // run that has all the data it needs sitting right there. + let locally_seeded: Vec<&String> = manifest + .datasets + .iter() + .filter(|name| name.as_str() != "all" && is_local_seed(&repos_dir, name)) + .collect(); + let all_locally_seeded = + !manifest.datasets.is_empty() && locally_seeded.len() == manifest.datasets.len(); + // Step 1: pull or update datasets. eprintln!(); eprintln!( "=== Step 1/3: {} datasets ===", - if has_repos { "Updating" } else { "Pulling" } + if all_locally_seeded { + "Using local seed for" + } else if has_repos { + "Updating" + } else { + "Pulling" + } ); eprintln!(); - let pull_status = { - let mut cmd = Command::new(&govbot_bin); - cmd.arg("pull"); - if !has_repos { - // Initial pull: clone the manifest's datasets. - for dataset in &manifest.datasets { - cmd.arg(dataset); - } - } - if let Some(d) = govbot_dir { - cmd.arg("--govbot-dir").arg(d); - } - cmd.current_dir(cwd) - .stdin(Stdio::inherit()) - .stdout(Stdio::inherit()) - .stderr(Stdio::inherit()) - .status() - }; - match pull_status { - Ok(status) if !status.success() => { - eprintln!("⚠️ Pull/update had errors (continuing anyway)"); + if all_locally_seeded { + for name in &locally_seeded { + let seed = repos_dir.join(seed_dir_name(name)); + eprintln!("📂 using local seed: {}", seed.display()); } - Err(e) => { - eprintln!("⚠️ Failed to run pull: {} (continuing anyway)", e); + // Skip the cache-touching pull subprocess entirely. + } else { + let pull_status = { + let mut cmd = Command::new(&govbot_bin); + cmd.arg("pull"); + if !has_repos { + // Initial pull: clone the manifest's datasets. + for dataset in &manifest.datasets { + cmd.arg(dataset); + } + } + if let Some(d) = govbot_dir { + cmd.arg("--govbot-dir").arg(d); + } + cmd.current_dir(cwd) + .stdin(Stdio::inherit()) + .stdout(Stdio::inherit()) + .stderr(Stdio::inherit()) + .status() + }; + match pull_status { + Ok(status) if !status.success() => { + eprintln!("⚠️ Pull/update had errors (continuing anyway)"); + } + Err(e) => { + eprintln!("⚠️ Failed to run pull: {} (continuing anyway)", e); + } + _ => {} } - _ => {} } // Step 2: run the transform DAG (source | transform... | apply). @@ -337,3 +366,61 @@ fn run_transform_dag( Ok(all_ok) } + +/// Map a manifest dataset id to the on-disk directory name under `repos/`. +/// +/// A manifest id can be a bare jurisdiction code (`wy`) — which the registry +/// resolves to a `short_name`, then `repo_dir_name` suffixes (`wy-legislation`) +/// — or it can already match the on-disk dir name. We try the suffixed form +/// first; the raw id is the fallback for the (rare) namespaced-id case. +fn seed_dir_name(manifest_id: &str) -> String { + // Strip a `namespace/` prefix (`us-legislation/wy` -> `wy`) so the + // suffixed form matches `wy-legislation`. + let bare = manifest_id.rsplit('/').next().unwrap_or(manifest_id); + repo_dir_name(bare) +} + +/// True when `//` (or the raw name) exists and +/// has at least one entry. The directory walks `govbot source` does for the +/// dataset will succeed iff this is the case. +fn is_local_seed(repos_dir: &Path, manifest_id: &str) -> bool { + let candidate1 = repos_dir.join(seed_dir_name(manifest_id)); + let candidate2 = repos_dir.join(manifest_id); + [candidate1, candidate2] + .into_iter() + .any(|p| dir_has_entries(&p)) +} + +/// True when `p` is a directory (or a symlink resolving to one) with at least +/// one child entry. +fn dir_has_entries(p: &Path) -> bool { + std::fs::read_dir(p) + .map(|mut it| it.next().is_some()) + .unwrap_or(false) +} + +#[cfg(test)] +mod tests { + use super::*; + + /// `govbot run` should detect a project-local dataset seed + /// (`.govbot/repos//`) and skip the cache-touching pull substep. + /// We test the detector — the substep skip itself is exercised by the + /// integration repro in the bug 3 PR description. + #[test] + fn is_local_seed_detects_populated_dir() { + let tmp = tempfile::tempdir().expect("tmpdir"); + let repos = tmp.path(); + // Empty repos/: no seed. + assert!(!is_local_seed(repos, "wy")); + + // Create the expected dataset dir with a file inside. + let seed = repos.join("wy-legislation"); + std::fs::create_dir_all(&seed).unwrap(); + std::fs::write(seed.join("data.json"), b"{}").unwrap(); + assert!(is_local_seed(repos, "wy")); + + // Namespaced id — still finds the suffixed dir. + assert!(is_local_seed(repos, "us-legislation/wy")); + } +} From c989129e351c04e14bfa63cdd7932537bc94ebbb Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 21:30:09 -0500 Subject: [PATCH 08/32] bluesky: render {link} via publisher base_url, fall back to bill source MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: with the userland manifest's `post_template: "{title}\n\n{tags} · {link}"`, a dry-run rendered `clean_energy ·` with nothing after the bullet — the `{link}` placeholder was always empty. Root cause: `bluesky::render_post` called `rss::extract_link(entry, None)` — `extract_link`'s `base_url` argument is what prefixes the record's source-relative path, so without it the rss/html path was never assembled and the empty string fell back through. Fix: thread the publisher's `base_url` field — same shape as the rss/html publishers — into `render_post`, then into `extract_link`. When unset, `extract_link`'s existing fallback to `bill.sources[0].url` kicks in so manifests without a base_url still get a sensible link (or, last-resort, an empty string). `schemas/govbot.schema.json`: extended the `base_url` description to mention the bluesky `{link}` placeholder semantics. (The field was already declared on every publisher; this is documentation.) AGENT.md §1.3 manifest template and §2.2 bluesky publisher table updated. **Userland note:** the next AGENT.md make-flow should add `base_url:` to the bluesky publisher in climate-activist-gov-news-bot's `govbot.yml` (e.g. the same URL as the rss/html publishers'). Without it, `{link}` will still resolve via `bill.sources[0].url` when available, so it is not strictly required. Added `render_link_uses_publisher_base_url` and `render_link_falls_back_to_bill_source_url` unit tests. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 6 ++++ actions/govbot/src/bluesky.rs | 62 ++++++++++++++++++++++++++++++++--- schemas/govbot.schema.json | 2 +- 3 files changed, 65 insertions(+), 5 deletions(-) diff --git a/AGENT.md b/AGENT.md index 47e4d835..a3343b01 100644 --- a/AGENT.md +++ b/AGENT.md @@ -177,6 +177,11 @@ publish: type: bluesky select: [transit_funding, transit_safety] # tag names from classifier.yml min_score: 0.6 # calibrated final_score threshold; 0..1 + # base_url is the prefix used by the `{link}` placeholder below — set it + # to wherever the bill page actually lives (e.g. the GitHub Pages URL of + # the rss/html publisher), otherwise `{link}` falls back to the bill's + # bill.sources[0].url (if any) or renders empty. + base_url: "https://.github.io/" post_template: "{title}\n\n{tags} · {link}" # ledger: .govbot/bluesky-bluesky.ledger # default; tracks posted bills @@ -422,6 +427,7 @@ Under `govbot.yml: publish:` (see the template in §1.3): | `type: bluesky` | selects the Bluesky publisher | | `select` | tag names to post — must exist in the classifier bundle | | `min_score` | minimum calibrated `final_score` (0..1) to post; default `0.6` | +| `base_url` | prefix used to render `{link}` in `post_template`; same shape as the rss/html publishers' `base_url`. Falls back to `bill.sources[0].url` when unset | | `post_template` | post text; placeholders `{title} {tags} {link} {identifier} {session} {score}`; truncated to 300 chars | | `ledger` | posted-state ledger path; default `.govbot/bluesky-.ledger` | diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs index 38763792..784ca0a8 100644 --- a/actions/govbot/src/bluesky.rs +++ b/actions/govbot/src/bluesky.rs @@ -67,7 +67,7 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { .entries .iter() .filter(|e| record_clears_threshold(e, &select, min_score)) - .map(|e| render_post(e, p.post_template.as_deref())) + .map(|e| render_post(e, p.post_template.as_deref(), p.base_url.as_deref())) .collect(); if posts.is_empty() { @@ -212,7 +212,13 @@ fn record_clears_threshold(entry: &Value, select: &[String], min_score: f64) -> /// Render a record into post text, applying the template and truncating to /// the Bluesky character limit. -fn render_post(entry: &Value, template: Option<&str>) -> RenderedPost { +/// +/// `base_url` — the publisher's `base_url` field — is the prefix prepended to +/// the bill's source-relative path when assembling `{link}`. Mirrors the +/// rss/html publishers' shape so a user can put a public URL into post text. +/// If `base_url` is `None`, the link falls back to the bill's +/// `bill.sources[0].url` (if present); otherwise `{link}` renders empty. +fn render_post(entry: &Value, template: Option<&str>, base_url: Option<&str>) -> RenderedPost { let id = crate::rss::extract_guid(entry); let template = template.unwrap_or(DEFAULT_TEMPLATE); @@ -222,7 +228,7 @@ fn render_post(entry: &Value, template: Option<&str>) -> RenderedPost { .and_then(|t| t.as_object()) .map(|m| m.keys().cloned().collect::>().join(", ")) .unwrap_or_default(); - let link = crate::rss::extract_link(entry, None).unwrap_or_default(); + let link = crate::rss::extract_link(entry, base_url).unwrap_or_default(); let identifier = entry .get("bill") .and_then(|b| b.get("identifier")) @@ -614,10 +620,58 @@ mod tests { "bill": { "title": "Renewable energy storage act", "identifier": "HB 1" }, "tags": { "clean_energy": { "final_score": 0.92 } } }); - let post = render_post(&entry, Some("{title} [{identifier}] {tags} {score}")); + let post = render_post(&entry, Some("{title} [{identifier}] {tags} {score}"), None); assert!(post.text.contains("Renewable energy storage act")); assert!(post.text.contains("[HB 1]")); assert!(post.text.contains("clean_energy")); assert!(post.text.contains("0.92")); } + + /// `{link}` renders the publisher's `base_url` joined to the bill's + /// source-log path — same shape as the rss/html publishers. Before the + /// fix, bluesky passed `None` and `{link}` rendered empty. + #[test] + fn render_link_uses_publisher_base_url() { + let entry = json!({ + "id": "wy-legislation/.../HB0001", + "bill": { "title": "Wind energy permitting act", "identifier": "HB 1" }, + "sources": { "bill": "wy-legislation/.../HB0001/metadata.json" }, + "tags": { "clean_energy": { "final_score": 0.91 } } + }); + let post = render_post( + &entry, + Some("{title} {link}"), + Some("https://example.org/climate-tracker"), + ); + assert!( + post.text.contains( + "https://example.org/climate-tracker/wy-legislation/.../HB0001/metadata.json" + ), + "expected base_url to be prepended to source path; got: {}", + post.text + ); + } + + /// Without a configured `base_url`, `{link}` falls back to the bill's + /// `bill.sources[0].url` (when present) — preserves the historical + /// shape and gives manifest authors a sensible default before they pick + /// a base_url. + #[test] + fn render_link_falls_back_to_bill_source_url() { + let entry = json!({ + "id": "wy-legislation/.../HB0001", + "bill": { + "title": "Solar tax-credit act", + "identifier": "HB 1", + "sources": [{ "url": "https://wyoleg.gov/2025/Bills/HB0001" }] + }, + "tags": { "clean_energy": { "final_score": 0.9 } } + }); + let post = render_post(&entry, Some("{title} -> {link}"), None); + assert!( + post.text.contains("https://wyoleg.gov/2025/Bills/HB0001"), + "expected bill.sources[0].url to render as {{link}}; got: {}", + post.text + ); + } } diff --git a/schemas/govbot.schema.json b/schemas/govbot.schema.json index 9d6dd518..bc2d65e1 100644 --- a/schemas/govbot.schema.json +++ b/schemas/govbot.schema.json @@ -98,7 +98,7 @@ } }, "base_url": { - "description": "Base URL for generated links (required for rss/html on GitHub Pages -- should match the GitHub Pages URL).", + "description": "Base URL for generated links. Required for rss/html (e.g. the GitHub Pages URL). Optional for bluesky -- when set, it is the prefix used to render the '{link}' placeholder in 'post_template'; when omitted, '{link}' falls back to the bill's 'bill.sources[0].url' (or renders empty).", "type": "string", "format": "uri" }, From 2798dd190b335af8c6936e1310e2e5338dc39ff8 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 21:31:28 -0500 Subject: [PATCH 09/32] source: document the default --filter policy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: a wy-legislation bill seeded with only a 2025-01 log was silently absent from `govbot source` output until a 2025-12 log was added. The `--filter default` rule wasn't documented anywhere and its behaviour was opaque: users assumed the cut was date-based when it is actually action-based. Root cause: `--filter default` runs the per-dataset `default.rs` under `actions/govbot/src/filters//`. Those filters drop routine log actions ("introduction", "referral-committee", "Bill Number Assigned", "Placed on General File", etc.) so the stream emits only substantive events. A freshly-filed bill whose only logs are routine actions emits zero records until a substantive event lands. Nothing about this was documented. Fix: documentation only — preserving current behaviour. Per the bug brief, prefer documentation + an explicit `--filter none` opt-out over silently widening the cut (a behaviour change). - `govbot source --help`'s `--filter` description now spells out what `default` drops and notes that `--filter none` is the "is this a filter problem?" troubleshooting flag. - CLAUDE.md gains a "govbot source" section under Common Commands covering the `--filter default` policy and the `--select docs` projection. `--filter none` already exists as the opt-out — no new `--all-logs` flag is introduced. Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 38 ++++++++++++++++++++++++++++++++++++++ actions/govbot/src/main.rs | 14 +++++++++++++- 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/CLAUDE.md b/CLAUDE.md index 56b44cdd..967a4fff 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -90,6 +90,44 @@ govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB govbot run # Run the full pipeline: pull -> classify -> apply -> publish ``` +## govbot source — streaming legislative activity + +`govbot source` walks every linked dataset and emits one JSON record per +bill log entry. It is the **source** stage of the stream protocol — the +records `govbot publish` and `fastclass classify` consume. + +### The `--filter default` policy + +`--filter` defaults to `default`, which applies the per-dataset filter under +`actions/govbot/src/filters//default.rs`. Each dataset's `default.rs` +implements an **action-based** rule that drops *routine* log entries — +introductions, committee referrals, "Bill Number Assigned", "Placed on +General File", boilerplate "President Signed" lines, prefiling, status +updates — so the stream emits only **substantive** events (passage votes, +executive signatures, amendments, defeats, committee reports with content). + +This is not a recency cut. A bill whose only log entries are routine +actions — e.g. a freshly-filed bill with just an "Introduction" log — +emits **zero records** under `--filter default` until a substantive event +lands. The bill itself is not deleted; it simply produces no stream rows +yet. Once a substantive log appears (e.g. a passage vote later in the +session), the bill flows through. + +If a bill is unexpectedly missing from `source` output: +```bash +govbot source --filter none --repos # confirm it's the filter +``` +If `--filter none` shows the bill and `--filter default` does not, the +fix is to add a substantive log entry, not to change the filter. + +### The `--select docs` projection + +`--select docs` collapses each surviving entry to the +`{"id","text","kind":"docs"}` document the stream protocol defines +(`schemas/STREAM_PROTOCOL.md` §1) — the record `fastclass classify -` +consumes. The default `--select default` keeps the full joined record +for `govbot publish` and ad-hoc analysis. + ## DuckDB Integration The `govbot load` command loads bill metadata into a DuckDB database for SQL analysis. diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index cd459f5b..22a2c865 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -130,7 +130,19 @@ enum Command { #[arg(long, default_value = "default", value_parser = ["default", "docs"])] select: String, - /// Filter log entries based on per-repo AI generated filters (default: `default`) options: `default` | `none` + /// Per-repo log filter (default: `default`). Options: `default` | + /// `none`. `default` applies the per-dataset filter under + /// `src/filters//default.rs` — it drops *routine* log + /// actions (introductions, committee referrals, "Bill Number + /// Assigned", "Placed on General File", boilerplate "President + /// Signed" log lines, etc.) so the stream emits only **substantive** + /// events: passage votes, executive signatures, amendments, defeats. + /// `none` keeps every log entry. The default filter is action-based, + /// not date-based: a bill whose only logs are routine actions + /// (e.g. a freshly-filed bill with just an "Introduction" log) will + /// emit zero records under `--filter default` until a substantive + /// event lands. Use `--filter none` to confirm a bill is missing + /// because of the filter rather than a data problem. #[arg(long, default_value = "default", value_parser = ["default", "none"])] filter: String, From 6cbb12e82dc9aeefe5872b97804e6e56f1e0eb79 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 21:41:08 -0500 Subject: [PATCH 10/32] source: read tags from dataset, fall back to cwd-rooted layout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: after Wave 6 (80617ac) moved tag files into the dataset (`.govbot/repos//.../tags/`), `govbot source --join tags` returned 0 tag joins and the publishers reported `Generating RSS feed with 0 entries...` even when tag files were present at the new location. Root cause: the `--join tags` branch built `tags_dir` from `std::env::current_dir()` joined with `country:.../state:.../sessions//tags` — the *old* project-root layout. The apply side moved; the consumer side did not. Fix: derive the tags dir from the dataset path the source iterator is already walking. `resolve_tags_dir` walks up from the log file path until it finds an ancestor whose immediate child is `bills/` and returns the sibling `tags/`. If the dataset-rooted lookup yields no matching tags, fall back to the legacy cwd-rooted construction so any pre-existing project-root layouts (and any explicit `--output-dir` overrides that landed there) still resolve. Tag-matching is pulled into `match_tags_in_dir` so both lookups share one implementation. Unit tests cover the resolver (canonical layout, loose path outside the layout) and the matcher (hit, miss, absent dir). Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 269 ++++++++++++++++++++++++++++--------- 1 file changed, 207 insertions(+), 62 deletions(-) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 22a2c865..9a854ec5 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -11,7 +11,7 @@ use jwalk::WalkDir; use std::collections::HashMap; use std::fs; use std::io::{self, BufRead, BufReader, Write}; -use std::path::PathBuf; +use std::path::{Path, PathBuf}; use std::process::Command as ProcessCommand; /// Write a line to stdout, gracefully handling broken pipe errors @@ -1159,70 +1159,63 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { } } - // Join tags if requested + // Join tags if requested. + // + // Wave 6 (`80617ac`) moved tag files from + // the project root *into* the dataset, so + // the primary lookup walks up from the log + // file's actual path to find the dataset's + // own `tags/` dir. The cwd-rooted layout is + // kept as a fallback so any pre-existing + // project-root layouts (and any explicit + // `--output-dir` overrides that landed + // there) still resolve. if join_tags { - // Extract country, state, session_id from the path - if let Some((country, state, session_id)) = - extract_path_info(&source_path_str) - { - // Use bill_id extracted earlier - if let Some(ref bill_id) = bill_id_opt { - // Look for tags in cwd/country:us/state:{state}/sessions/{session_id}/tags/ - let cwd = std::env::current_dir() - .unwrap_or_else(|_| PathBuf::from(".")); - let tags_dir = cwd - .join(&format!("country:{}", country)) - .join(&format!("state:{}", state)) - .join("sessions") - .join(&session_id) - .join("tags"); - - if tags_dir.exists() && tags_dir.is_dir() { - let mut matched_tags = serde_json::Map::new(); - if let Ok(entries) = fs::read_dir(&tags_dir) { - for entry in entries.flatten() { - let path = entry.path(); - // Check for both .tag.json and .json files - if let Some(ext) = path - .extension() - .and_then(|s| s.to_str()) - { - if ext == "json" { - if let Some(stem) = path - .file_stem() - .and_then(|s| s.to_str()) - { - // Remove .tag suffix if present (e.g., "budget.tag" -> "budget") - let tag_name = stem - .strip_suffix(".tag") - .unwrap_or(stem); - match fs::read_to_string( - &path, - ) { - Ok(contents) => { - if let Ok(tag_file) = serde_json::from_str::(&contents) { - // Check if bill_id exists in bills map - if let Some(bill_result) = tag_file.bills.get(bill_id) { - // Return the score breakdown - matched_tags.insert(tag_name.to_string(), serde_json::to_value(&bill_result.score).unwrap_or(serde_json::Value::Null)); - } - } - } - Err(_) => {} - } - } - } - } - } - } - if !matched_tags.is_empty() { - output.insert( - "tags".to_string(), - serde_json::Value::Object(matched_tags), - ); - } + if let Some(ref bill_id) = bill_id_opt { + let mut matched_tags: serde_json::Map< + String, + serde_json::Value, + > = serde_json::Map::new(); + + // Primary: dataset-rooted tags dir + // sibling of the bills/ dir the log + // file lives under. + if let Some(dataset_tags_dir) = resolve_tags_dir(&path) + { + matched_tags = + match_tags_in_dir(&dataset_tags_dir, bill_id); + } + + // Fallback: legacy project-root + // layout (`cwd/country:.../state:.../ + // sessions//tags/`). Only + // consulted when the dataset-rooted + // lookup found nothing. + if matched_tags.is_empty() { + if let Some((country, state, session_id)) = + extract_path_info(&source_path_str) + { + let cwd = std::env::current_dir() + .unwrap_or_else(|_| PathBuf::from(".")); + let legacy_tags_dir = cwd + .join(format!("country:{}", country)) + .join(format!("state:{}", state)) + .join("sessions") + .join(&session_id) + .join("tags"); + matched_tags = match_tags_in_dir( + &legacy_tags_dir, + bill_id, + ); } } + + if !matched_tags.is_empty() { + output.insert( + "tags".to_string(), + serde_json::Value::Object(matched_tags), + ); + } } } @@ -1730,6 +1723,71 @@ fn extract_path_info(path: &str) -> Option<(String, String, String)> { Some((country, state, session_id)) } +/// Resolve the dataset-rooted `tags/` directory for a given log file path. +/// +/// Wave 6 (`80617ac`) moved tag files from the project root into the dataset +/// (`/country:.../state:.../sessions//tags/`). This walks **up** +/// from the log file path until it finds a directory whose immediate child is +/// `bills/` — that ancestor is the session dir, and `tags/` is its sibling of +/// `bills/`. Returns `None` if no such ancestor exists (e.g. a path outside +/// the canonical dataset layout). +fn resolve_tags_dir(log_path: &Path) -> Option { + let mut cursor = log_path.parent(); + while let Some(dir) = cursor { + let bills_child = dir.join("bills"); + if bills_child.is_dir() { + return Some(dir.join("tags")); + } + cursor = dir.parent(); + } + None +} + +/// Read every `*.json` / `*.tag.json` file in `tags_dir`, parse each as a +/// `TagFile`, and return the subset whose `bills` map contains `bill_id`, +/// keyed by tag name (file stem with any `.tag` suffix stripped). Returns an +/// empty map if `tags_dir` does not exist or contains no matching tags. +/// +/// Pulled out so the same logic serves the dataset-rooted lookup *and* the +/// project-root fallback below without duplication. +fn match_tags_in_dir(tags_dir: &Path, bill_id: &str) -> serde_json::Map { + let mut matched = serde_json::Map::new(); + if !tags_dir.is_dir() { + return matched; + } + let entries = match fs::read_dir(tags_dir) { + Ok(e) => e, + Err(_) => return matched, + }; + for entry in entries.flatten() { + let path = entry.path(); + if path.extension().and_then(|s| s.to_str()) != Some("json") { + continue; + } + let stem = match path.file_stem().and_then(|s| s.to_str()) { + Some(s) => s, + None => continue, + }; + // `budget.tag.json` -> `budget`; plain `budget.json` -> `budget`. + let tag_name = stem.strip_suffix(".tag").unwrap_or(stem); + let contents = match fs::read_to_string(&path) { + Ok(c) => c, + Err(_) => continue, + }; + let tag_file: govbot::TagFile = match serde_json::from_str(&contents) { + Ok(t) => t, + Err(_) => continue, + }; + if let Some(bill_result) = tag_file.bills.get(bill_id) { + matched.insert( + tag_name.to_string(), + serde_json::to_value(&bill_result.score).unwrap_or(serde_json::Value::Null), + ); + } + } + matched +} + /// The slice of a `fastclass classify` result that `govbot apply` consumes. /// Unknown fields are ignored, so fastclass may evolve its output freely. #[derive(serde::Deserialize)] @@ -2557,4 +2615,91 @@ mod tests { assert!(parse_doc_route("just-some-other-id").is_none()); assert!(parse_doc_route("wy-legislation/country:us").is_none()); } + + /// Regression for the Wave 6 follow-up: `source --join tags` must read + /// from the dataset-rooted `tags/` dir (sibling of the `bills/` the log + /// lives under), not from a cwd-rooted layout. After `80617ac` moved + /// `govbot apply` to write tag files into the dataset, the consumer side + /// stayed pointed at the old project-root path and the publishers saw 0 + /// entries. `resolve_tags_dir` is what closes that loop. + #[test] + fn resolve_tags_dir_finds_sibling_of_bills() { + let tmp = tempfile::tempdir().expect("tempdir"); + let session = tmp + .path() + .join("wy-legislation") + .join("country:us") + .join("state:wy") + .join("sessions") + .join("2025"); + let log_path = session + .join("bills") + .join("HB0001") + .join("logs") + .join("2025-01-15T12:00:00Z.json"); + fs::create_dir_all(log_path.parent().unwrap()).unwrap(); + fs::write(&log_path, "{}").unwrap(); + // The `bills/` sibling must exist for the resolver to find this + // ancestor — that is the whole signal that says "this is a session + // dir". Creating the log under `bills/HB0001/logs/` already does it. + + let resolved = resolve_tags_dir(&log_path).expect("resolver should find a tags dir"); + assert_eq!(resolved, session.join("tags")); + } + + /// A log file outside any dataset layout (no `bills/` ancestor) yields + /// `None`, letting the caller fall back to the legacy cwd-rooted lookup. + #[test] + fn resolve_tags_dir_returns_none_outside_dataset_layout() { + let tmp = tempfile::tempdir().expect("tempdir"); + let stray = tmp.path().join("loose").join("file.json"); + fs::create_dir_all(stray.parent().unwrap()).unwrap(); + fs::write(&stray, "{}").unwrap(); + assert!(resolve_tags_dir(&stray).is_none()); + } + + /// End-to-end of the helper: a tag file in the dataset-rooted `tags/` + /// dir produces a `{tag_name: score}` map for the bill it lists, and an + /// empty map for a bill it does not list. + #[test] + fn match_tags_in_dir_returns_scores_for_matching_bill() { + let tmp = tempfile::tempdir().expect("tempdir"); + let tags_dir = tmp.path().join("tags"); + fs::create_dir_all(&tags_dir).unwrap(); + let tag_file = serde_json::json!({ + "metadata": { + "last_run": "2025-01-15T12:00:00Z", + "model": "fastclass-test", + "tag_config_hash": "abc123" + }, + "tag_config": { + "name": "clean_energy" + }, + "bills": { + "HB0001": { + "text_hash": "deadbeef", + "score": { + "final_score": 0.92, + "base_embedding": null, + "example_similarity": null, + "keyword_match": [], + "negative_penalty": 0.0 + } + } + } + }); + fs::write(tags_dir.join("clean_energy.tag.json"), tag_file.to_string()).unwrap(); + + let matched = match_tags_in_dir(&tags_dir, "HB0001"); + assert_eq!(matched.len(), 1); + assert!(matched.contains_key("clean_energy")); + + let missing = match_tags_in_dir(&tags_dir, "HB9999"); + assert!(missing.is_empty()); + + // Missing dir is not an error — callers chain dataset-rooted then + // cwd-rooted lookups, and a non-existent dir is the common case. + let absent = match_tags_in_dir(&tmp.path().join("no-such-dir"), "HB0001"); + assert!(absent.is_empty()); + } } From 24fb5638f0b35a38df69af1a72a58652696dade3 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Fri, 22 May 2026 23:32:32 -0500 Subject: [PATCH 11/32] apply/source: move tag files out of .govbot/ to project tags/ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `.govbot/` is the tool's cache — the equivalent of `node_modules/`, `target/`, or `.venv/`. It holds cloned datasets, sync state, and an optional registry override. It must stay tool-owned and user-edit-free so a `rm -rf .govbot/` always restores a clean cache without losing the bot's actual work. Bug 1's fix (and Bug 6's read-side follow-up `6cbb12e`) moved tag files *into* the cache. That solved the dirty-project-root complaint but it was the wrong answer: tag files are derived classification **outputs**, not cache contents. They are produced by `apply`, consumed by publishers, and represent the bot's state of "which bills have been classified, and how". They belong in their own dedicated location. What changed: - `govbot apply` now writes per-tag `.tag.json` files under `/tags//country:.../state:.../sessions//.tag.json`. `tags/` is a project-rooted classification-output dir, peer to `dist/` (publisher output) and distinct from `.govbot/` (the cache). The `` short-name prefix is what isolates same-named tag files across jurisdictions in a multi-dataset project. - `govbot source --join tags` resolves tag dirs in a three-stage fallback chain (first non-empty wins, silent on miss): 1. Primary: `/tags//country:.../sessions//` — the new layout above. 2. Fallback A: `/tags/` inside `.govbot/repos/...` — the Bug 6 location, kept read-only for working trees mid- migration. `apply` never writes here. 3. Fallback B: the pre-Bug-1 cwd-rooted `country:.../sessions// tags/` — kept for layouts that pre-date the dataset-rooted move, and for explicit `--output-dir` overrides that landed there. - Wizard `.gitignore` now adds `tags/` with an inline comment documenting the trade-off (file count grows with the catalog and most bots regenerate from raw data — git-ignore by default; remove the line to commit classification provenance). - `--output-dir` help text updated to say the default is now `/tags/` (was: the directory containing `govbot.yml`). Explicit overrides remain a verbatim root — the dataset prefix is dropped, as before. - Docs (CLAUDE.md, AGENT.md, README.md) gained an explicit three-dir layout section: `.govbot/` = cache, `tags/` = classification output, `dist/` = publisher output. Tests: - `parse_doc_route_*` comments updated for the new destination. - `resolve_tags_dir` replaced with `resolve_tags_dir_candidates` returning the ordered candidate list; new regression pins the project-rooted primary over the cache-rooted fallback. - New `tag_paths_are_dataset_isolated` asserts two datasets sharing a country/state/session route same-named tag files to distinct files under `tags//...`, never into `.govbot/`. - 40 → 41 tests passing offline. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 24 +- CLAUDE.md | 22 +- README.md | 15 +- actions/govbot/src/main.rs | 341 ++++++++++++++---- actions/govbot/src/wizard.rs | 17 +- ...i_example_snaps__snapshot@govbot_help.snap | 2 +- 6 files changed, 335 insertions(+), 86 deletions(-) diff --git a/AGENT.md b/AGENT.md index a3343b01..3542602a 100644 --- a/AGENT.md +++ b/AGENT.md @@ -63,7 +63,7 @@ govbot remove # remove datasets from govbot.yml govbot ls # list manifest + locally-cached datasets govbot pull # clone/update datasets (git repos) into the cache govbot source # stream legislative activity as JSON Lines -govbot apply # persist fastclass results into the dataset +govbot apply # persist fastclass results under /tags/ govbot publish # run the manifest's publishers govbot run # the full pipeline: pull -> source|classify|apply -> publish fastclass classify - # score a JSON-Lines doc stream from stdin @@ -322,6 +322,9 @@ BLUESKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx .govbot/ dist/ docs/ +# Classification output from `govbot apply` — regenerated each run. +# Remove this line to commit classification provenance. +tags/ # Secrets — never commit. .env ``` @@ -350,8 +353,17 @@ Project layout: - `summarizer/` — framing prompt for a future summarize stage - `.env` — Bluesky credentials (git-ignored; see `.env.example`) +Tool-managed dirs (all git-ignored by default): +- `.govbot/` — the tool's CACHE (cloned datasets, ledgers); the + `node_modules/` equivalent. Never edit by hand; + `rm -rf .govbot/` is always safe. +- `tags/` — classification OUTPUT from `govbot apply` + (`tags//country:.../sessions//.tag.json`). + Remove `tags/` from `.gitignore` if you want + classification provenance committed. +- `dist/` / `docs/` — publisher output from `govbot publish`. + To tune the classifier, use the fastclass plugin: `/fastclass:improve`. -Generated dirs (`.govbot/`, `dist/`, `docs/`) are git-ignored. ``` #### `.claude/settings.json` — import the fastclass plugin @@ -581,5 +593,9 @@ the eval. The improvement loop only ever reads `rolling.yml`. - Credentials are environment-only. Never write a secret into `govbot.yml`, `.env.example`, or any committed file. - Bluesky: dry-run before every first live run after a config change. -- Generated dirs (`.govbot/`, `dist/`, `docs/`) are git-ignored; the project - is a dozen small text files plus tool artifacts. +- Three tool-managed dirs, each with a distinct role: `.govbot/` is the + CACHE (the `node_modules/` equivalent — never edited, fully regenerable), + `tags/` is `govbot apply`'s classification OUTPUT + (`tags//country:.../sessions//.tag.json`), and + `dist/` / `docs/` are publisher output. All four are git-ignored by + default; the project is a dozen small text files plus tool artifacts. diff --git a/CLAUDE.md b/CLAUDE.md index 967a4fff..dbd4797a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -167,8 +167,26 @@ govbot source --select docs | fastclass classify - classifier=./classifier | gov bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` + `eval/`). govbot passes only the bundle path; it never reads the bundle. - **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag - `.tag.json` files into the dataset — the files `govbot publish` turns into - feeds. It classifies nothing itself; it is purely the persistence sink. + `.tag.json` files under `/tags//country:.../sessions//` + — the files `govbot publish` turns into feeds. It classifies nothing + itself; it is purely the persistence sink. + +### Project layout — `tags/` vs `.govbot/` vs `dist/` + +A govbot project has three top-level tool-managed dirs, each with a +distinct role; do not conflate them: + +- **`.govbot/`** — the tool's **cache**, the `node_modules/` equivalent. + Cloned datasets, content-addressed sync state, ledgers, an optional + registry override. Fully regenerable; safe to `rm -rf` to start fresh. + **Never edited by hand, never written to by `apply`.** +- **`tags/`** — **classification output**, written by `govbot apply`. The + layout mirrors the source path with a dataset prefix: + `tags//country:.../state:.../sessions//.tag.json`. + Regenerated by every classify run; the dataset prefix is what isolates + same-named tag files across jurisdictions in a multi-dataset project. +- **`dist/`** — **publisher output**, written by `govbot publish` (RSS / + HTML / JSON feeds). **`govbot.yml` is NOT the classifier — it is a manifest.** It declares `datasets`, `transforms`, `publish`, and `pipelines`; it has **no `tags:` diff --git a/README.md b/README.md index e8c47541..b938b75a 100644 --- a/README.md +++ b/README.md @@ -78,7 +78,7 @@ govbot pull all # clone/update every dataset govbot pull il ca ny # clone/update specific datasets govbot source # stream dataset records as JSON Lines govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply -govbot apply # persist a fastclass result stream into the dataset +govbot apply # persist a fastclass result stream under /tags/ govbot publish # run every configured publisher (RSS / HTML / JSON / DuckDB / Bluesky) govbot publish --publisher bluesky --dry-run # ALWAYS dry-run Bluesky first govbot run --dry-run # full pipeline, every publisher dry-run (recommended first run) @@ -105,6 +105,19 @@ references its path. See [`AGENT.md`](AGENT.md) for the end-to-end newsbot playbook (make / manage / update) and the [stream protocol](schemas/STREAM_PROTOCOL.md) for the wire format. +### Project layout + +A govbot project has three tool-managed directories, each with a distinct +role; all are git-ignored by default: + +| Dir | Owner | Contents | +|---|---|---| +| `.govbot/` | the tool's **cache** (`node_modules/` equivalent) | cloned datasets, ledgers, sync state. Fully regenerable. Never edit. | +| `tags/` | `govbot apply` (**classification output**) | `tags//country:.../sessions//.tag.json` | +| `dist/` (or `docs/`) | `govbot publish` (**publisher output**) | RSS / HTML / JSON feeds | + +Remove `tags/` from `.gitignore` to commit classification provenance. + # 🏛️ Govbot Legislation Data Catalogs govbot pulls data from a registry of git-repo datasets. The bundled default diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 9a854ec5..822d436c 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -237,17 +237,23 @@ enum Command { govbot_dir: Option, }, - /// Persist fastclass classification results into the dataset as tag files. - /// Reads `fastclass classify` result JSON from stdin — the apply sink of + /// Persist fastclass classification results as tag files under the + /// project's `tags/` output directory. Reads `fastclass classify` result + /// JSON from stdin — the apply sink of /// `govbot source --select docs | fastclass classify - | govbot apply` — - /// and writes per-tag `.tag.json` files under each bill's session - /// directory, the files `govbot publish` turns into feeds. Classification - /// itself is done by fastclass; `govbot apply` only stores the results. + /// and writes per-tag `.tag.json` files under + /// `/tags//country:.../sessions//`, the files + /// `govbot publish` turns into feeds. Classification itself is done by + /// fastclass; `govbot apply` only stores the results. `tags/` is a + /// project-rooted classification-output dir — peer to `dist/` (publisher + /// output) and distinct from `.govbot/` (the tool's regenerable cache). Apply { /// Optional tag name: persist only this tag's matches tag_name: Option, - /// Output directory (defaults to the directory containing govbot.yml) + /// Output directory (default: `/tags/`). Overrides the + /// default routing entirely — the dataset short-name is dropped and + /// tag files land under `/country:.../sessions/.../tags/`. #[arg(long = "output-dir")] output_dir: Option, @@ -1161,15 +1167,18 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { // Join tags if requested. // - // Wave 6 (`80617ac`) moved tag files from - // the project root *into* the dataset, so - // the primary lookup walks up from the log - // file's actual path to find the dataset's - // own `tags/` dir. The cwd-rooted layout is - // kept as a fallback so any pre-existing - // project-root layouts (and any explicit - // `--output-dir` overrides that landed - // there) still resolve. + // `.govbot/` is the tool's cache — tag + // files no longer live inside it. The + // primary lookup is the project-rooted + // `/tags//...` layout + // `govbot apply` writes today. Two + // read-only fallbacks stay live for + // migration: the in-cache `/ + // tags/` location Bug 6 added, and the + // cwd-rooted `country:.../sessions// + // tags/` layout that pre-dates Bug 1. + // First non-empty match wins; an empty + // result on every candidate is silent. if join_tags { if let Some(ref bill_id) = bill_id_opt { let mut matched_tags: serde_json::Map< @@ -1177,26 +1186,27 @@ async fn run_source_command(cmd: Command) -> anyhow::Result<()> { serde_json::Value, > = serde_json::Map::new(); - // Primary: dataset-rooted tags dir - // sibling of the bills/ dir the log - // file lives under. - if let Some(dataset_tags_dir) = resolve_tags_dir(&path) + let cwd = std::env::current_dir() + .unwrap_or_else(|_| PathBuf::from(".")); + for candidate in + resolve_tags_dir_candidates(&path, &cwd) { matched_tags = - match_tags_in_dir(&dataset_tags_dir, bill_id); + match_tags_in_dir(&candidate, bill_id); + if !matched_tags.is_empty() { + break; + } } - // Fallback: legacy project-root - // layout (`cwd/country:.../state:.../ - // sessions//tags/`). Only - // consulted when the dataset-rooted - // lookup found nothing. + // Final fallback: pre-Bug-1 + // cwd-rooted layout. Only + // consulted when the dataset- + // aware candidates all came up + // empty. if matched_tags.is_empty() { if let Some((country, state, session_id)) = extract_path_info(&source_path_str) { - let cwd = std::env::current_dir() - .unwrap_or_else(|_| PathBuf::from(".")); let legacy_tags_dir = cwd .join(format!("country:{}", country)) .join(format!("state:{}", state)) @@ -1723,26 +1733,132 @@ fn extract_path_info(path: &str) -> Option<(String, String, String)> { Some((country, state, session_id)) } -/// Resolve the dataset-rooted `tags/` directory for a given log file path. +/// The session directory of a log file path — the ancestor whose immediate +/// child is `bills/` — together with the path segments that uniquely place it +/// inside its dataset. /// -/// Wave 6 (`80617ac`) moved tag files from the project root into the dataset -/// (`/country:.../state:.../sessions//tags/`). This walks **up** -/// from the log file path until it finds a directory whose immediate child is -/// `bills/` — that ancestor is the session dir, and `tags/` is its sibling of -/// `bills/`. Returns `None` if no such ancestor exists (e.g. a path outside -/// the canonical dataset layout). -fn resolve_tags_dir(log_path: &Path) -> Option { +/// Why pulled out: `resolve_tags_dir` needs the path twice, once to look at +/// the project-rooted `tags//...` layout and once for the in-cache +/// `/tags/` fallback. Computing it in one place keeps both lookups +/// in sync with the canonical dataset layout. +struct SessionAnchor { + /// The session directory itself (the `bills/`-bearing ancestor). + session_dir: PathBuf, + /// The dataset's `short_name` — the first path segment under the repos + /// dir (e.g. `wy-legislation`). `None` if the path is not inside a + /// recognisable `//country:.../sessions/...` layout, in + /// which case the project-rooted lookup is skipped. + dataset: Option, + /// The `country:` segment as-is (e.g. `country:us`). + country_segment: String, + /// The `state:` segment as-is (e.g. `state:wy`). + state_segment: String, + /// The session id (the segment after `sessions/`). + session_id: String, +} + +/// Walk up from `log_path` to its session directory (the `bills/`-bearing +/// ancestor) and capture every segment needed to plant a tag file under +/// `/tags//country:.../state:.../sessions//`. Returns +/// `None` when the path is not inside the canonical dataset layout. +fn parse_session_anchor(log_path: &Path) -> Option { let mut cursor = log_path.parent(); while let Some(dir) = cursor { - let bills_child = dir.join("bills"); - if bills_child.is_dir() { - return Some(dir.join("tags")); + if dir.join("bills").is_dir() { + // Found the session dir. Walk *down* its components to recover + // the dataset short_name and jurisdiction segments — they are + // the same segments `parse_doc_route` extracts on the writer + // side, so the two halves stay symmetric. + let mut country_segment: Option = None; + let mut state_segment: Option = None; + let mut session_id: Option = None; + let mut dataset: Option = None; + let mut prev_was_sessions = false; + let mut country_seen = false; + for component in dir.components() { + let seg = component.as_os_str().to_string_lossy().to_string(); + if seg.starts_with("country:") { + country_segment = Some(seg.clone()); + country_seen = true; + } else if seg.starts_with("state:") { + state_segment = Some(seg.clone()); + } else if seg == "sessions" { + prev_was_sessions = true; + continue; + } else if prev_was_sessions { + session_id = Some(seg.clone()); + } + // The dataset short_name is the path segment immediately + // before the first `country:` segment. For typical layouts + // (`//country:.../...`) that is one segment; + // we only need the most recent non-pathy segment before + // `country:` was first seen. + if !country_seen + && !seg.is_empty() + && seg != "/" + && !seg.starts_with("country:") + && !seg.starts_with("state:") + && seg != "sessions" + && seg != "bills" + { + dataset = Some(seg); + } + prev_was_sessions = false; + } + return Some(SessionAnchor { + session_dir: dir.to_path_buf(), + dataset, + country_segment: country_segment?, + state_segment: state_segment?, + session_id: session_id?, + }); } cursor = dir.parent(); } None } +/// Resolve every `tags/`-equivalent directory we are willing to read a tag +/// file from, in the order the caller should consult them. +/// +/// `.govbot/` is the tool's cache (the `node_modules/` equivalent) — tag +/// files belong outside it, in a project-rooted classification-output dir. +/// The primary lookup is therefore `/tags//country:.../ +/// state:.../sessions//`. Two fallbacks stay live for migration: +/// +/// 1. **Primary**: `/tags//country:.../sessions//` +/// — where `govbot apply` writes today. +/// 2. **Fallback A** (Bug 6 / `6cbb12e`): the in-cache +/// `/tags/` sibling-of-`bills/` — kept read-only so a +/// working tree mid-migration still resolves. +/// 3. **Fallback B** (pre-Bug-1): the cwd-rooted +/// `/country:.../state:.../sessions//tags/` — kept for layouts +/// that pre-date the dataset-rooted move (and for explicit +/// `--output-dir` overrides that landed there). +/// +/// The chain is read-only — `apply` itself never touches anything but the +/// primary location. +fn resolve_tags_dir_candidates(log_path: &Path, project_dir: &Path) -> Vec { + let mut candidates: Vec = Vec::new(); + if let Some(anchor) = parse_session_anchor(log_path) { + // Primary: /tags//country:.../state:.../sessions// + if let Some(ref dataset) = anchor.dataset { + candidates.push( + project_dir + .join("tags") + .join(dataset) + .join(&anchor.country_segment) + .join(&anchor.state_segment) + .join("sessions") + .join(&anchor.session_id), + ); + } + // Fallback A: in-cache session/tags/ (Bug 6 layout, read-only). + candidates.push(anchor.session_dir.join("tags")); + } + candidates +} + /// Read every `*.json` / `*.tag.json` file in `tags_dir`, parse each as a /// `TagFile`, and return the subset whose `bills` map contains `bill_id`, /// keyed by tag name (file stem with any `.tag` suffix stripped). Returns an @@ -1832,8 +1948,10 @@ struct BillRoute { /// is not a dataset bill path (e.g. a document from a non-govbot source). /// /// The leading `` segment is the dataset's `short_name` (e.g. -/// `wy-legislation`); it is what lets `govbot apply` route each tag file back -/// to `/.govbot/repos//` by default. +/// `wy-legislation`); it is what lets `govbot apply` route each tag file under +/// `/tags//...` by default — the dataset prefix is what +/// disambiguates same-named tag files across jurisdictions in a multi-dataset +/// project. fn parse_doc_route(doc: &str) -> Option { let segments: Vec<&str> = doc.split('/').collect(); let (mut country, mut state, mut session, mut bill_id) = (None, None, None, None); @@ -1913,8 +2031,14 @@ fn new_tag_file(tag_key: &str, tag_defs: &[govbot::TagDefinition], now: &str) -> /// stdin — the apply sink of /// `govbot source --select docs | fastclass classify - | govbot apply` — and /// for every matched tag writes the bill into the per-tag `.tag.json` file -/// under the dataset's `sessions//tags/` directory. Those are the +/// under `/tags//country:.../sessions//`. Those are the /// files `govbot publish` later turns into feeds. +/// +/// **Why `tags/` and not `.govbot/`:** `.govbot/` is the tool's cache — the +/// equivalent of `node_modules/` — and must stay user-edit-free so a fresh +/// `rm -rf .govbot/` never destroys the bot's classification work. Tag files +/// are derived classification *outputs*, not cache contents; they live in +/// their own dedicated, project-rooted directory peer to `dist/`. async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { let Command::Apply { tag_name, @@ -1927,12 +2051,15 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { let current_dir = std::env::current_dir()?; // Tag files land under --output-dir when given. When unset, each tag file - // is routed back to its source dataset under - // `/.govbot/repos//country:.../sessions/.../tags/` - // — mirroring the path the bill's `metadata.json` came from — using the - // first segment of the fastclass result's `doc` field. The explicit - // `--output-dir` override stays a verbatim root for back-compat. + // is routed under the project's classification-output directory + // `/tags//country:.../sessions/.../.tag.json` + // — the dataset short_name comes from the first segment of the fastclass + // result's `doc` field, mirroring where the bill's `metadata.json` came + // from. The explicit `--output-dir` override stays a verbatim root (the + // dataset prefix is dropped), which is the back-compat escape hatch for + // callers that want to write into a custom layout. let explicit_output_dir = output_dir.as_ref().map(PathBuf::from); + let default_tags_root = current_dir.join("tags"); // The taxonomy now lives in a fastclass classifier bundle, not in // govbot.yml — each `.tag.json` is stamped with a stub `tag_config` @@ -1992,22 +2119,26 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { // Resolve where this bill's tag files land. With an explicit // `--output-dir`, that path is the root and the dataset short_name is - // dropped (back-compat). With no override, route the file back to its - // source dataset under `/.govbot/repos//...` so the - // file lands alongside the bill's `metadata.json`. If the `doc` id - // lacks a recognisable dataset prefix (a non-govbot source), fall - // back to the project directory so the record is still persisted. + // dropped (back-compat escape hatch). With no override, route the file + // under the project's `tags//...` output dir so the dataset + // prefix disambiguates same-named tags across jurisdictions. If the + // `doc` id lacks a recognisable dataset prefix (a non-govbot source), + // fall back to a no-prefix `tags/` so the record is still persisted — + // never write into `.govbot/`, which is the tool's cache. let base_output_dir = match (&explicit_output_dir, &route.dataset) { (Some(root), _) => root.clone(), - (None, Some(dataset)) => current_dir.join(".govbot").join("repos").join(dataset), - (None, None) => current_dir.clone(), + (None, Some(dataset)) => default_tags_root.join(dataset), + (None, None) => default_tags_root.clone(), }; + // Inside the dataset prefix, mirror the source's jurisdiction path + // exactly — no trailing `/tags/` segment, because the project-level + // `tags/` directory already names the kind. The shape on disk is + // `//country:.../state:.../sessions//.tag.json`. let tags_dir = base_output_dir .join(format!("country:{}", route.country)) .join(format!("state:{}", route.state)) .join("sessions") - .join(&route.session) - .join("tags"); + .join(&route.session); fs::create_dir_all(&tags_dir)?; written_dirs.insert(base_output_dir.clone()); @@ -2049,7 +2180,7 @@ async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { explicit_output_dir .as_ref() .map(|d| d.display().to_string()) - .unwrap_or_else(|| current_dir.display().to_string()) + .unwrap_or_else(|| default_tags_root.display().to_string()) } else { written_dirs .iter() @@ -2584,8 +2715,8 @@ mod tests { use super::*; /// A typical `govbot source --select docs` id — the leading dataset - /// `short_name` is what `govbot apply` uses to route the `.tag.json` back - /// to `/.govbot/repos//...` by default. + /// `short_name` is what `govbot apply` uses to route the `.tag.json` under + /// `/tags//...` by default. #[test] fn parse_doc_route_extracts_dataset_prefix() { let route = @@ -2616,17 +2747,20 @@ mod tests { assert!(parse_doc_route("wy-legislation/country:us").is_none()); } - /// Regression for the Wave 6 follow-up: `source --join tags` must read - /// from the dataset-rooted `tags/` dir (sibling of the `bills/` the log - /// lives under), not from a cwd-rooted layout. After `80617ac` moved - /// `govbot apply` to write tag files into the dataset, the consumer side - /// stayed pointed at the old project-root path and the publishers saw 0 - /// entries. `resolve_tags_dir` is what closes that loop. + /// `.govbot/` is the cache; tag files belong outside it in the project- + /// rooted `tags/` output dir. The resolver's primary candidate must + /// therefore be `/tags//country:.../state:.../sessions/ + /// /`, with the in-cache `/tags/` location kept only as a + /// read-only fallback for working trees mid-migration. This regression + /// pins both — Bug 1's revisit must not silently restore the cache as + /// the primary location. #[test] - fn resolve_tags_dir_finds_sibling_of_bills() { + fn resolve_tags_dir_candidates_prefer_project_tags_then_cache_fallback() { let tmp = tempfile::tempdir().expect("tempdir"); - let session = tmp - .path() + let project = tmp.path().join("project"); + let session = project + .join(".govbot") + .join("repos") .join("wy-legislation") .join("country:us") .join("state:wy") @@ -2639,23 +2773,80 @@ mod tests { .join("2025-01-15T12:00:00Z.json"); fs::create_dir_all(log_path.parent().unwrap()).unwrap(); fs::write(&log_path, "{}").unwrap(); - // The `bills/` sibling must exist for the resolver to find this - // ancestor — that is the whole signal that says "this is a session - // dir". Creating the log under `bills/HB0001/logs/` already does it. - let resolved = resolve_tags_dir(&log_path).expect("resolver should find a tags dir"); - assert_eq!(resolved, session.join("tags")); + let candidates = resolve_tags_dir_candidates(&log_path, &project); + // Primary is the project-rooted output dir. + assert_eq!( + candidates.first().expect("primary candidate"), + &project + .join("tags") + .join("wy-legislation") + .join("country:us") + .join("state:wy") + .join("sessions") + .join("2025"), + ); + // Fallback A is the Bug-6 in-cache layout — read-only for migration. + assert!(candidates.iter().any(|c| c == &session.join("tags"))); + // And critically: the cache is NOT the primary location. + assert_ne!(candidates.first().unwrap(), &session.join("tags")); } /// A log file outside any dataset layout (no `bills/` ancestor) yields - /// `None`, letting the caller fall back to the legacy cwd-rooted lookup. + /// no candidates, letting the caller fall back to the legacy cwd-rooted + /// lookup. #[test] - fn resolve_tags_dir_returns_none_outside_dataset_layout() { + fn resolve_tags_dir_candidates_empty_outside_dataset_layout() { let tmp = tempfile::tempdir().expect("tempdir"); let stray = tmp.path().join("loose").join("file.json"); fs::create_dir_all(stray.parent().unwrap()).unwrap(); fs::write(&stray, "{}").unwrap(); - assert!(resolve_tags_dir(&stray).is_none()); + assert!(resolve_tags_dir_candidates(&stray, tmp.path()).is_empty()); + } + + /// Dataset isolation — the whole reason the `` segment lives at + /// the top of `tags/`. Two datasets sharing a `country:us/state:xx` + /// jurisdiction must write the same-named tag file to *different* files + /// on disk, keyed by short_name, so a project tracking multiple + /// jurisdictions never has one dataset's classification clobber + /// another's. + #[test] + fn tag_paths_are_dataset_isolated() { + // Synthesise the per-dataset destinations the way `run_apply_command` + // does, against two short_names that share a country/state/session. + let project = std::path::PathBuf::from("/tmp/project"); + let tags_root = project.join("tags"); + + let short_a = "wy-legislation"; + let short_b = "wy-counties"; + let country = "country:us"; + let state = "state:wy"; + let session = "2025"; + let tag = "clean_energy"; + + let path_a = tags_root + .join(short_a) + .join(country) + .join(state) + .join("sessions") + .join(session) + .join(format!("{}.tag.json", tag)); + let path_b = tags_root + .join(short_b) + .join(country) + .join(state) + .join("sessions") + .join(session) + .join(format!("{}.tag.json", tag)); + + assert_ne!(path_a, path_b, "dataset prefix must split the tag file"); + // Both must share the `tags/` prefix — the project's + // classification-output dir — never `.govbot/`. + assert!(path_a.starts_with(&tags_root)); + assert!(path_b.starts_with(&tags_root)); + let govbot_cache = project.join(".govbot"); + assert!(!path_a.starts_with(&govbot_cache)); + assert!(!path_b.starts_with(&govbot_cache)); } /// End-to-end of the helper: a tag file in the dataset-rooted `tags/` diff --git a/actions/govbot/src/wizard.rs b/actions/govbot/src/wizard.rs index 959a6500..149e0ad5 100644 --- a/actions/govbot/src/wizard.rs +++ b/actions/govbot/src/wizard.rs @@ -283,9 +283,17 @@ pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { /// Write .gitignore with govbot's generated dirs and secret-bearing files. /// /// Everything under `.govbot/` (cloned datasets, ledgers, lockfile state), -/// every publisher output dir (`dist/`, `docs/`), and any local `.env` is -/// untracked. The userland repo is a few dozen text files plus tool artifacts; -/// the artifacts never belong in git. +/// every publisher output dir (`dist/`, `docs/`), the classification-output +/// dir `tags/`, and any local `.env` is untracked. The userland repo is a +/// few dozen text files plus tool artifacts; the artifacts never belong in +/// git. +/// +/// **`tags/` trade-off.** `govbot apply` writes per-tag `.tag.json` files +/// under `tags//country:.../sessions//`. The file count grows +/// with the catalog and most bots regenerate from raw data on every run — +/// so it is git-ignored by default. Users who want classification +/// provenance committed (e.g. for offline review or auditability) can +/// remove the `tags/` line from this file. pub fn write_gitignore(cwd: &Path) -> Result<()> { let gitignore_path = cwd.join(".gitignore"); // Single canonical block — easy to grep, easy to update. @@ -294,6 +302,9 @@ pub fn write_gitignore(cwd: &Path) -> Result<()> { .govbot/ dist/ docs/ +# Classification output from `govbot apply` — regenerated each run. +# Remove this line if you want classification provenance committed. +tags/ # Secrets — never commit .env diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index 5f75b259..6d444261 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -17,7 +17,7 @@ Commands: load Load bill metadata into a DuckDB database for SQL analysis. Walks every linked dataset's `metadata.json` files, creates a `bills` table + a `bills_summary` view, and writes the database into the base govbot directory (default `./.govbot/govbot.duckdb`). Requires the `duckdb` CLI on PATH update Update the installed govbot binary to the latest nightly build from GitHub releases. Installs into `~/.govbot/bin/govbot` and prefers the platform-native `.tar.gz` asset publish Run one or more publishers from `govbot.yml: publish:`. A publisher consumes the tagged result stream and emits artifacts: `rss`/`html`/`json` write feed/index/dump files, `duckdb` loads records into a database, `bluesky` posts matches to a Bluesky account (always dry-run first with `--dry-run`) - apply Persist fastclass classification results into the dataset as tag files. Reads `fastclass classify` result JSON from stdin — the apply sink of `govbot source --select docs | fastclass classify - | govbot apply` — and writes per-tag `.tag.json` files under each bill's session directory, the files `govbot publish` turns into feeds. Classification itself is done by fastclass; `govbot apply` only stores the results + apply Persist fastclass classification results as tag files under the project's `tags/` output directory. Reads `fastclass classify` result JSON from stdin — the apply sink of `govbot source --select docs | fastclass classify - | govbot apply` — and writes per-tag `.tag.json` files under `/tags//country:.../sessions//`, the files `govbot publish` turns into feeds. Classification itself is done by fastclass; `govbot apply` only stores the results. `tags/` is a project-rooted classification-output dir — peer to `dist/` (publisher output) and distinct from `.govbot/` (the tool's regenerable cache) run Run the full pipeline against the current directory's `govbot.yml`: pull/update datasets → `source --select docs | fastclass classify - | apply` (the classify transform) → publish every configured publisher. `govbot` with no arguments is equivalent (and falls back to `init` if no `govbot.yml` is present) init Scaffold a new govbot.yml in the current directory (the setup wizard). Interactive in a TTY; writes sensible defaults when non-interactive add Add one or more datasets to the project's `govbot.yml` `datasets:` list. Each id is validated against the registry before it is added From 49d5d56a2f2db935e4af35014871e43758a4ea7d Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 00:11:57 -0500 Subject: [PATCH 12/32] bluesky: route {link} to companion html publisher's landing page MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: with a configured bluesky publisher whose base_url was set (e.g. https://example.org/climate-tracker), the rendered post's {link} placeholder resolved to the raw metadata.json path under that base_url: https://example.org/climate-tracker/wy-legislation/.../HB9999/metadata.json which sends an activist's reader to a JSON file, not the human-readable HTML index the manifest's `site` (html) publisher already produces. Root cause: render_post passed only the bluesky publisher's own base_url to rss::extract_link, which always appends the bill's dataset sources.bill path (a metadata.json path). The bluesky publisher had no awareness of the manifest's other publishers. Fix (Option A — couple via the manifest, not new fields): the publish-command flow finds the manifest's `type: html` publisher once, takes its base_url, and threads it into every PublishJob as `html_entry_url`. The bluesky publisher's {link} resolves in this priority: 1. html_entry_url — the html publisher's landing page; 2. base_url joined to sources.bill (the historical default). Falls back to the historical shape when no html publisher exists in the manifest, so this is purely additive. No new manifest fields. Why not Option B (entry_path_template) or C (always-prefer-html in extract_link): B adds new manifest surface every project has to learn; C entangles the html publisher's own output behavior with bluesky's link resolution. Option A is the smallest change that gives activists the URL they actually want, defaulting from data already in the manifest. Reproduction (in climate-activist-gov-news-bot, after seeding wy mocks + a synthetic HB9999 with climate text and a 2025-12 passage log, then `govbot apply` against a synthetic fastclass result): before: clean_energy · https://example.org/climate-tracker/wy-legislation/country:us/state:wy/sessions/2025/bills/HB9999/metadata.json after: clean_energy · https://example.org/climate-tracker AGENT.md §2.2 and the publisher schema are updated to describe the new default. New regression test render_link_prefers_html_publisher_landing_page locks the priority in. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 17 +++-- actions/govbot/src/bluesky.rs | 136 +++++++++++++++++++++++++++++++--- actions/govbot/src/main.rs | 12 +++ actions/govbot/src/publish.rs | 7 ++ schemas/govbot.schema.json | 2 +- 5 files changed, 158 insertions(+), 16 deletions(-) diff --git a/AGENT.md b/AGENT.md index 3542602a..fe982d01 100644 --- a/AGENT.md +++ b/AGENT.md @@ -177,10 +177,10 @@ publish: type: bluesky select: [transit_funding, transit_safety] # tag names from classifier.yml min_score: 0.6 # calibrated final_score threshold; 0..1 - # base_url is the prefix used by the `{link}` placeholder below — set it - # to wherever the bill page actually lives (e.g. the GitHub Pages URL of - # the rss/html publisher), otherwise `{link}` falls back to the bill's - # bill.sources[0].url (if any) or renders empty. + # `{link}` defaults to the companion `html` publisher's base_url (the + # human landing page), so the cleanest setup is to declare an `html` + # publisher in this manifest. `base_url` here is only the fallback when + # no `html` publisher is configured. base_url: "https://.github.io/" post_template: "{title}\n\n{tags} · {link}" # ledger: .govbot/bluesky-bluesky.ledger # default; tracks posted bills @@ -439,10 +439,17 @@ Under `govbot.yml: publish:` (see the template in §1.3): | `type: bluesky` | selects the Bluesky publisher | | `select` | tag names to post — must exist in the classifier bundle | | `min_score` | minimum calibrated `final_score` (0..1) to post; default `0.6` | -| `base_url` | prefix used to render `{link}` in `post_template`; same shape as the rss/html publishers' `base_url`. Falls back to `bill.sources[0].url` when unset | +| `base_url` | fallback prefix for `{link}` when no companion `html` publisher is configured; same shape as the rss/html publishers' `base_url` | | `post_template` | post text; placeholders `{title} {tags} {link} {identifier} {session} {score}`; truncated to 300 chars | | `ledger` | posted-state ledger path; default `.govbot/bluesky-.ledger` | +`{link}` resolves in this order: (1) the manifest's `html` publisher's +`base_url` — the **human-readable landing page** activists actually click +through to; (2) the bluesky publisher's own `base_url` joined to the bill's +dataset path; (3) the bill's first upstream source URL. Configuring an +`html` publisher alongside `bluesky` makes the default useful — without it, +`{link}` resolves to a raw `metadata.json` path under `base_url`. + Credentials are **never** config fields — they are env-only. ### 2.3 Dry-run first — always diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs index 784ca0a8..e67a5494 100644 --- a/actions/govbot/src/bluesky.rs +++ b/actions/govbot/src/bluesky.rs @@ -22,6 +22,21 @@ //! - `BLUESKY_APP_PASSWORD` — an app password (Settings → App Passwords), //! never the main account password //! - `BLUESKY_SERVICE` — optional PDS base URL (default `https://bsky.social`) +//! +//! ### `{link}` resolution +//! +//! `{link}` in `post_template` resolves with this priority: +//! 1. the manifest's companion `html` publisher's `base_url` — the +//! human-readable landing page activists actually want to click through +//! to (computed once in `run_publish_command` and passed in via +//! `PublishJob::html_entry_url`); +//! 2. the bluesky publisher's own `base_url` joined to the bill's dataset +//! `sources.bill` path — the historical default, which points at the +//! raw `metadata.json` file (rarely what an activist wants); +//! 3. the bill's `bill.sources[0].url` (the upstream legislature page). +//! +//! Declaring an `html` publisher alongside `bluesky` is what makes the +//! default useful. See AGENT.md §2.2. use crate::publish::PublishJob; use anyhow::{Context, Result}; @@ -63,11 +78,28 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { let ledger_path = resolve_ledger_path(job); // Select records: a `select`ed tag must clear the calibrated threshold. + // + // `{link}` resolves with this priority: + // 1. the companion `html` publisher's landing-page URL (the human page); + // 2. the bill's `bill.sources[0].url` (the upstream legislature page); + // 3. the bluesky publisher's own `base_url` joined to the bill source + // path (the historical default — `metadata.json`, the JSON file). + // Most useful default with no new manifest surface: when the manifest + // carries an html publisher, route activists to that human page rather + // than to the raw JSON that the rss/html publishers' `extract_link` + // emits. let posts: Vec = job .entries .iter() .filter(|e| record_clears_threshold(e, &select, min_score)) - .map(|e| render_post(e, p.post_template.as_deref(), p.base_url.as_deref())) + .map(|e| { + render_post( + e, + p.post_template.as_deref(), + p.base_url.as_deref(), + job.html_entry_url.as_deref(), + ) + }) .collect(); if posts.is_empty() { @@ -213,12 +245,24 @@ fn record_clears_threshold(entry: &Value, select: &[String], min_score: f64) -> /// Render a record into post text, applying the template and truncating to /// the Bluesky character limit. /// -/// `base_url` — the publisher's `base_url` field — is the prefix prepended to -/// the bill's source-relative path when assembling `{link}`. Mirrors the -/// rss/html publishers' shape so a user can put a public URL into post text. -/// If `base_url` is `None`, the link falls back to the bill's -/// `bill.sources[0].url` (if present); otherwise `{link}` renders empty. -fn render_post(entry: &Value, template: Option<&str>, base_url: Option<&str>) -> RenderedPost { +/// `{link}` resolution order: +/// 1. `html_entry_url` — the manifest's companion `html` publisher's +/// landing-page URL (the human-readable index activists actually want +/// to click through to); +/// 2. the bill's `bill.sources[0].url` (the upstream legislature page); +/// 3. `base_url` joined to the bill's `sources.bill` dataset path +/// (the historical default — a raw `metadata.json` link); +/// 4. empty. +/// +/// The html-publisher route is the *useful default* — without it, `{link}` +/// resolves to `//.../metadata.json`, which renders an +/// activist's reader landing on a JSON file. See Bug 7. +fn render_post( + entry: &Value, + template: Option<&str>, + base_url: Option<&str>, + html_entry_url: Option<&str>, +) -> RenderedPost { let id = crate::rss::extract_guid(entry); let template = template.unwrap_or(DEFAULT_TEMPLATE); @@ -228,7 +272,7 @@ fn render_post(entry: &Value, template: Option<&str>, base_url: Option<&str>) -> .and_then(|t| t.as_object()) .map(|m| m.keys().cloned().collect::>().join(", ")) .unwrap_or_default(); - let link = crate::rss::extract_link(entry, base_url).unwrap_or_default(); + let link = resolve_link(entry, base_url, html_entry_url).unwrap_or_default(); let identifier = entry .get("bill") .and_then(|b| b.get("identifier")) @@ -260,6 +304,34 @@ fn render_post(entry: &Value, template: Option<&str>, base_url: Option<&str>) -> } } +/// Resolve `{link}` for a bluesky post. +/// +/// Priority: +/// 1. the companion `html` publisher's landing-page URL — the +/// human-readable index page the manifest already promised activists +/// (the fix for Bug 7); +/// 2. the historical default — `extract_link`: bluesky's own `base_url` +/// joined to the dataset `sources.bill` path, falling back to the +/// bill's first upstream source URL. +/// +/// (1) is the useful default: without it, `{link}` pointed at the raw +/// `metadata.json` path under the bluesky `base_url`, which sent an +/// activist's reader to a JSON file. The html publisher's landing page is +/// the human page an activist actually wants to click. +fn resolve_link( + entry: &Value, + base_url: Option<&str>, + html_entry_url: Option<&str>, +) -> Option { + if let Some(url) = html_entry_url { + let trimmed = url.trim(); + if !trimmed.is_empty() { + return Some(trimmed.trim_end_matches('/').to_string()); + } + } + crate::rss::extract_link(entry, base_url) +} + /// The highest calibrated `final_score` across a record's tags. fn top_score(entry: &Value) -> Option { entry @@ -620,7 +692,12 @@ mod tests { "bill": { "title": "Renewable energy storage act", "identifier": "HB 1" }, "tags": { "clean_energy": { "final_score": 0.92 } } }); - let post = render_post(&entry, Some("{title} [{identifier}] {tags} {score}"), None); + let post = render_post( + &entry, + Some("{title} [{identifier}] {tags} {score}"), + None, + None, + ); assert!(post.text.contains("Renewable energy storage act")); assert!(post.text.contains("[HB 1]")); assert!(post.text.contains("clean_energy")); @@ -642,6 +719,7 @@ mod tests { &entry, Some("{title} {link}"), Some("https://example.org/climate-tracker"), + None, // no companion html publisher ); assert!( post.text.contains( @@ -667,11 +745,49 @@ mod tests { }, "tags": { "clean_energy": { "final_score": 0.9 } } }); - let post = render_post(&entry, Some("{title} -> {link}"), None); + let post = render_post(&entry, Some("{title} -> {link}"), None, None); assert!( post.text.contains("https://wyoleg.gov/2025/Bills/HB0001"), "expected bill.sources[0].url to render as {{link}}; got: {}", post.text ); } + + /// Bug 7 regression: when the manifest has a companion `html` publisher, + /// `{link}` resolves to that publisher's landing-page URL — not to the + /// raw `metadata.json` path under bluesky's own `base_url`. + /// + /// Before this fix, with bluesky `base_url: + /// https://example.org/climate-tracker` set, a userland dry-run rendered: + /// https://example.org/climate-tracker/wy-legislation/.../HB9999/metadata.json + /// which is a JSON file, not a human page. + #[test] + fn render_link_prefers_html_publisher_landing_page() { + let entry = json!({ + "id": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB9999", + "bill": { "title": "Clean energy tax credit", "identifier": "HB9999" }, + "sources": { + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB9999/metadata.json" + }, + "tags": { "clean_energy": { "final_score": 0.91 } } + }); + let post = render_post( + &entry, + Some("{title} -> {link}"), + Some("https://example.org/climate-tracker"), // bluesky's own base_url + Some("https://example.org/climate-tracker"), // companion html publisher's base_url + ); + // Must NOT route activists at the raw JSON path. + assert!( + !post.text.contains("metadata.json"), + "expected {{link}} to skip the metadata.json path when a companion html publisher exists; got: {}", + post.text + ); + // Must land at the html publisher's URL — the human-readable index. + assert!( + post.text.contains("https://example.org/climate-tracker"), + "expected {{link}} to resolve to the html publisher's landing-page URL; got: {}", + post.text + ); + } } diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 822d436c..ff1dc445 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -2326,6 +2326,17 @@ async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { } }); + // Resolve the companion html-publisher landing URL once: the bluesky + // publisher uses it as the default for `{link}` so a post links to the + // human-readable HTML index, not the raw metadata.json path under its + // own `base_url`. None when the manifest has no html publisher. + let html_entry_url: Option = manifest + .publish + .values() + .find(|p| p.kind == govbot::PublisherKind::Html) + .and_then(|p| p.base_url.clone()) + .filter(|u| !u.trim().is_empty()); + // Run each named publisher against its filtered/sorted/limited stream. for name in &names_to_run { let publisher = manifest.publish.get(name).expect("checked above"); @@ -2375,6 +2386,7 @@ async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { output_file_override: output_file.clone(), project_dir: current_dir.clone(), dry_run, + html_entry_url: html_entry_url.clone(), }; govbot::publish::run_publisher(&job)?; } diff --git a/actions/govbot/src/publish.rs b/actions/govbot/src/publish.rs index b0ade976..b1f37159 100644 --- a/actions/govbot/src/publish.rs +++ b/actions/govbot/src/publish.rs @@ -31,6 +31,13 @@ pub struct PublishJob<'a> { /// `--dry-run`: render but do not emit. The `bluesky` publisher honours /// this by touching no network and no ledger. pub dry_run: bool, + /// The companion `html` publisher's public landing-page URL, if the + /// manifest declares one (e.g. `https://example.org/climate-tracker`). + /// The `bluesky` publisher uses this as the default for `{link}` so a + /// post links to the *human-readable* HTML index — not the raw + /// `metadata.json` path that the rss/html publishers' `extract_link` + /// emits by default. None when the manifest has no `html` publisher. + pub html_entry_url: Option, } /// Run a single publisher against its result stream and emit artifacts. diff --git a/schemas/govbot.schema.json b/schemas/govbot.schema.json index bc2d65e1..d56127ac 100644 --- a/schemas/govbot.schema.json +++ b/schemas/govbot.schema.json @@ -98,7 +98,7 @@ } }, "base_url": { - "description": "Base URL for generated links. Required for rss/html (e.g. the GitHub Pages URL). Optional for bluesky -- when set, it is the prefix used to render the '{link}' placeholder in 'post_template'; when omitted, '{link}' falls back to the bill's 'bill.sources[0].url' (or renders empty).", + "description": "Base URL for generated links. Required for rss/html (e.g. the GitHub Pages URL). For bluesky, '{link}' defaults to the manifest's companion 'html' publisher's base_url (the human-readable landing page activists click through to); when no 'html' publisher is configured, '{link}' falls back to this 'base_url' joined to the bill's dataset path, and then to 'bill.sources[0].url'.", "type": "string", "format": "uri" }, From a908227de81ee3f817f8e09e258098d64a572047 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 00:15:39 -0500 Subject: [PATCH 13/32] publish: one publisher type, one artifact (rss != html) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Symptom: a manifest declaring both an rss publisher and an html publisher into the same output_dir silently overwrote files. The userland e2e produced: === Publisher 'feed' (Rss) ... === ✓ Generated RSS feed: dist/climate-feed.xml ✓ Generated HTML index: dist/index.html ← rss publisher wrote HTML === Publisher 'site' (Html) ... === ✓ Generated RSS feed: dist/feed.xml ← html publisher wrote RSS ✓ Generated HTML index: dist/index.html ← COLLISION, last writer wins Root cause: `emit_rss_html` was one function handling both kinds — each call wrote both feed.xml and index.html regardless of `type`. The second publisher to run (sorted by name) silently overwrote the first's index.html. Fix: split into two functions. - `type: rss` -> emit_rss writes only /feed.xml - `type: html` -> emit_html writes only /index.html Each publisher's `output_file` defaults to its own kind's filename. Declaring both gets both, side-by-side, no collision. Reproduction (climate-activist-gov-news-bot, after seeding wy mocks + a synthetic HB9999 + apply'ing a synthetic fastclass result): before: dist/ contained climate-feed.xml AND feed.xml (the rss publisher's output_file vs the html publisher's cross-emitted feed.xml), and index.html was whichever publisher ran last. after: dist/ contains climate-feed.xml + index.html, written by their respective publishers — no cross-emission, no collision. Side effects: - The wizard now scaffolds BOTH a `feed` (type: rss) and a `site` (type: html) publisher so a fresh project still gets both artifacts by default. Snapshots regenerated. - AGENT.md §1.3 now documents both publisher kinds in the template and calls out the "one type, one artifact" rule. - schemas/govbot.schema.json describes the per-kind output_file defaults. Userland breaking change: a manifest that relied on the double-emission (e.g. declaring only `type: rss` and expecting index.html to also appear) will need to add a peer `type: html` publisher. The climate-activist userland already declares both under `feed` and `site`, so it picks up the fix transparently — its `dist/feed.xml` cross-emission (which collided with the rss output_file `climate-feed.xml`) goes away. Three regression tests in src/publish.rs lock the split in: - rss_publisher_writes_only_feed_xml - html_publisher_writes_only_index_html - rss_and_html_publishers_coexist_in_one_output_dir Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 17 +- actions/govbot/src/publish.rs | 198 +++++++++++++++++- actions/govbot/src/wizard.rs | 20 +- .../snapshots/wizard_tests__wizard_all.snap | 7 + .../wizard_tests__wizard_session_all.snap | 13 +- ...rd_tests__wizard_session_single_state.snap | 13 +- ...wizard_tests__wizard_session_specific.snap | 13 +- .../wizard_tests__wizard_single.snap | 7 + .../wizard_tests__wizard_specific.snap | 7 + schemas/govbot.schema.json | 8 +- 10 files changed, 273 insertions(+), 30 deletions(-) diff --git a/AGENT.md b/AGENT.md index fe982d01..ed6df78a 100644 --- a/AGENT.md +++ b/AGENT.md @@ -186,20 +186,31 @@ publish: # ledger: .govbot/bluesky-bluesky.ledger # default; tracks posted bills feed: - type: rss + type: rss # writes /feed.xml (only) + select: [transit_funding, transit_safety] + base_url: "https://.github.io/" + output_dir: docs + + site: + type: html # writes /index.html (only) select: [transit_funding, transit_safety] base_url: "https://.github.io/" output_dir: docs pipelines: - default: [classify, bluesky, feed] + default: [classify, bluesky, feed, site] ``` Notes: - **No `tags:` key.** It is retired; a manifest carrying it fails to parse. +- **One publisher type, one artifact.** `type: rss` writes only the RSS + feed; `type: html` writes only the HTML index. Declare both to get both + (an earlier release wrote both files from each — a silent + last-writer-wins collision on `index.html`). - `publish..select` lists tag names — they must exist in the classifier bundle. Validate later with `fastclass describe`. -- Drop the `feed` publisher if the user only wants Bluesky, and vice versa. +- Drop the `feed` / `site` publishers if the user only wants Bluesky, and + vice versa. - Prefer `govbot add ` over hand-editing the `datasets:` list — it validates each id against the registry first. Use `govbot init` to scaffold the whole `govbot.yml` interactively. diff --git a/actions/govbot/src/publish.rs b/actions/govbot/src/publish.rs index b1f37159..64cef389 100644 --- a/actions/govbot/src/publish.rs +++ b/actions/govbot/src/publish.rs @@ -42,10 +42,19 @@ pub struct PublishJob<'a> { /// Run a single publisher against its result stream and emit artifacts. /// -/// govbot's built-in publishers each consume the result stream and emit -/// artifacts: `rss`/`html` write a feed + HTML index, `json` writes a JSON -/// dump, `duckdb` loads the records into a DuckDB database, and `bluesky` -/// posts matched bills to a Bluesky account (see `crate::bluesky`). +/// **One publisher type, one artifact.** Each built-in publisher writes +/// exactly one kind of file: +/// +/// - `type: rss` writes the RSS feed (default `feed.xml`); +/// - `type: html` writes the HTML index (default `index.html`); +/// - `type: json` writes a JSON dump; +/// - `type: duckdb` loads records into a DuckDB database; +/// - `type: bluesky` posts matched bills to a Bluesky account. +/// +/// Before this split, `rss` and `html` each emitted *both* a feed.xml and +/// an index.html — declaring both in one manifest produced a silent +/// last-writer-wins collision on `index.html`. Declare both explicitly to +/// get both artifacts. pub fn run_publisher(job: &PublishJob) -> Result<()> { let p = job.publisher; let select = p.select.clone().unwrap_or_default(); @@ -58,7 +67,8 @@ pub fn run_publisher(job: &PublishJob) -> Result<()> { ); match p.kind { - PublisherKind::Rss | PublisherKind::Html => emit_rss_html(job, &select, &output_dir), + PublisherKind::Rss => emit_rss(job, &select, &output_dir), + PublisherKind::Html => emit_html(job, &output_dir), PublisherKind::Json => emit_json(job, &output_dir), PublisherKind::Duckdb => emit_duckdb(job, &output_dir), PublisherKind::Bluesky => crate::bluesky::run_bluesky(job, job.dry_run), @@ -80,8 +90,13 @@ fn titlecase_tag(tag: &str) -> String { .join(" ") } -/// The `rss`/`html` publisher: a combined RSS feed + HTML index. -fn emit_rss_html(job: &PublishJob, select: &[String], output_dir: &Path) -> Result<()> { +/// The `rss` publisher: emits the RSS feed (and *only* the RSS feed). +/// +/// Default output: `/feed.xml`. Pair with a peer `type: html` +/// publisher to also get an `index.html`. Before this split, `rss` also +/// wrote `index.html` — which collided with the `html` publisher's +/// `index.html` and made the rendering last-writer-wins. +fn emit_rss(job: &PublishJob, select: &[String], output_dir: &Path) -> Result<()> { let p = job.publisher; let output_file = job @@ -141,15 +156,36 @@ fn emit_rss_html(job: &PublishJob, select: &[String], output_dir: &Path) -> Resu let rss_path = output_dir.join(&output_file); fs::write(&rss_path, rss_xml)?; eprintln!("✓ Generated RSS feed: {}", rss_path.display()); + Ok(()) +} + +/// The `html` publisher: emits the HTML index (and *only* the HTML index). +/// +/// Default output: `/index.html`. Pair with a peer `type: rss` +/// publisher to also get an RSS feed. Before this split, `html` also wrote +/// a `feed.xml` — which collided with the `rss` publisher's `feed.xml`. +fn emit_html(job: &PublishJob, output_dir: &Path) -> Result<()> { + let p = job.publisher; + + let feed_link = p.base_url.as_deref().unwrap_or("https://example.com"); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; eprintln!( "Generating HTML index with {} entries...", job.entries.len() ); + let output_file = job + .output_file_override + .clone() + .or_else(|| p.output_file.clone()) + .unwrap_or_else(|| "index.html".to_string()); + // Only pass an explicit (configured) title to the HTML header. let html_title = p.title.as_deref().filter(|s| !s.trim().is_empty()); let html = rss::json_to_html(job.entries.clone(), html_title, feed_link, Some(feed_link)); - let html_path = output_dir.join("index.html"); + let html_path = output_dir.join(&output_file); fs::write(&html_path, html)?; eprintln!("✓ Generated HTML index: {}", html_path.display()); Ok(()) @@ -273,3 +309,149 @@ pub fn sort_by_timestamp(mut entries: Vec) -> Vec { }); entries } + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + use tempfile::tempdir; + + /// Build a `PublishJob` over a tempdir for the given publisher kind. + fn job_for_kind<'a>( + name: &'a str, + publisher: &'a Publisher, + project_dir: PathBuf, + ) -> PublishJob<'a> { + PublishJob { + name, + publisher, + entries: vec![json!({ + "id": "wy-legislation/.../HB0001", + "timestamp": "20250101T000000Z", + "bill": { "title": "Sample bill", "identifier": "HB0001" }, + "sources": { "bill": "wy-legislation/.../HB0001/metadata.json" }, + "tags": { "clean_energy": { "final_score": 0.9 } }, + })], + output_dir_override: None, + output_file_override: None, + project_dir, + dry_run: false, + html_entry_url: None, + } + } + + /// Bug 8 regression: `type: rss` writes ONLY the RSS feed, not an + /// HTML index. Before the split, this publisher kind also produced + /// `index.html`, colliding with the html publisher's `index.html`. + #[test] + fn rss_publisher_writes_only_feed_xml() { + let dir = tempdir().unwrap(); + let p = Publisher { + kind: PublisherKind::Rss, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(dir.path().join("out").to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let job = job_for_kind("feed", &p, dir.path().to_path_buf()); + run_publisher(&job).expect("rss publisher should run"); + + let out = dir.path().join("out"); + assert!(out.join("feed.xml").exists(), "expected feed.xml"); + assert!( + !out.join("index.html").exists(), + "rss publisher must NOT emit index.html — that's the html publisher's job" + ); + } + + /// Bug 8 regression: `type: html` writes ONLY the HTML index, not an + /// RSS feed. Before the split, this publisher kind also produced + /// `feed.xml`, colliding with the rss publisher's `feed.xml`. + #[test] + fn html_publisher_writes_only_index_html() { + let dir = tempdir().unwrap(); + let p = Publisher { + kind: PublisherKind::Html, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(dir.path().join("out").to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let job = job_for_kind("site", &p, dir.path().to_path_buf()); + run_publisher(&job).expect("html publisher should run"); + + let out = dir.path().join("out"); + assert!(out.join("index.html").exists(), "expected index.html"); + assert!( + !out.join("feed.xml").exists(), + "html publisher must NOT emit feed.xml — that's the rss publisher's job" + ); + } + + /// Declaring both `rss` and `html` publishers into the SAME output_dir + /// produces both artifacts side-by-side. Before the split, running + /// `rss` then `html` (or vice versa) produced a silent + /// last-writer-wins collision on `index.html`. + #[test] + fn rss_and_html_publishers_coexist_in_one_output_dir() { + let dir = tempdir().unwrap(); + let out_dir = dir.path().join("out"); + + let rss = Publisher { + kind: PublisherKind::Rss, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: Some("RSS publisher title".to_string()), + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let html = Publisher { + kind: PublisherKind::Html, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: Some("HTML publisher title".to_string()), + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + + let job_rss = job_for_kind("feed", &rss, dir.path().to_path_buf()); + run_publisher(&job_rss).unwrap(); + let job_html = job_for_kind("site", &html, dir.path().to_path_buf()); + run_publisher(&job_html).unwrap(); + + let feed_xml = std::fs::read_to_string(out_dir.join("feed.xml")).unwrap(); + let index_html = std::fs::read_to_string(out_dir.join("index.html")).unwrap(); + // Each publisher's own title must be in its own artifact — proves + // neither publisher overwrote the other's output. + assert!( + feed_xml.contains("RSS publisher title"), + "feed.xml should carry the rss publisher's title" + ); + assert!( + index_html.contains("HTML publisher title"), + "index.html should carry the html publisher's title (not the rss publisher's)" + ); + } +} diff --git a/actions/govbot/src/wizard.rs b/actions/govbot/src/wizard.rs index 149e0ad5..87d62884 100644 --- a/actions/govbot/src/wizard.rs +++ b/actions/govbot/src/wizard.rs @@ -57,10 +57,10 @@ impl WizardSession { display.push_str("docs to build one.\n\n"); // Step 3: Publishing - display.push_str("Publishing is configured for an RSS feed by default.\n"); - display.push_str("Your feed will be generated in the \"docs\" directory.\n\n"); + display.push_str("Publishing is configured for an RSS feed + HTML index by default.\n"); + display.push_str("Both land in the \"docs\" directory (feed.xml + index.html).\n\n"); display.push_str(&format!( - "? Base URL for your feed: {}\n\n", + "? Base URL for your feeds: {}\n\n", choices.base_url )); @@ -218,8 +218,8 @@ fn prompt_sources() -> Result> { fn prompt_publishing() -> Result { eprintln!(); - eprintln!("Publishing is configured for RSS feeds by default."); - eprintln!("Your feeds will be generated in the \"docs\" directory."); + eprintln!("Publishing is configured for an RSS feed + HTML index by default."); + eprintln!("Both land in the \"docs\" directory (feed.xml + index.html)."); eprintln!(); let base_url: String = Input::new() @@ -262,13 +262,20 @@ pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { yml.push_str(" classifier: ./classifier\n"); yml.push('\n'); - // publish — a publisher consumes the result stream and emits artifacts. + // publish — one publisher type, one artifact. + // - `feed` (type: rss) writes /feed.xml + // - `site` (type: html) writes /index.html yml.push_str("publish:\n"); yml.push_str(" feed:\n"); yml.push_str(" type: rss\n"); yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); yml.push_str(" output_dir: \"docs\"\n"); yml.push_str(" output_file: \"feed.xml\"\n"); + yml.push_str(" site:\n"); + yml.push_str(" type: html\n"); + yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); + yml.push_str(" output_dir: \"docs\"\n"); + yml.push_str(" output_file: \"index.html\"\n"); yml.push('\n'); // pipelines — named `govbot run` targets, npm-script style. @@ -276,6 +283,7 @@ pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { yml.push_str(" default:\n"); yml.push_str(" - classify\n"); yml.push_str(" - feed\n"); + yml.push_str(" - site\n"); yml } diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap index 553154a9..d2b6d0f0 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap @@ -1,5 +1,6 @@ --- source: tests/wizard_tests.rs +assertion_line: 58 expression: "&yml" --- # Govbot Manifest @@ -23,8 +24,14 @@ publish: base_url: "https://myuser.github.io/my-govbot" output_dir: "docs" output_file: "feed.xml" + site: + type: html + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "index.html" pipelines: default: - classify - feed + - site diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap index 7799b348..fa4c60af 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap @@ -1,5 +1,6 @@ --- source: tests/wizard_tests.rs +assertion_line: 18 expression: "&session.to_snapshot()" --- === Wizard Session === @@ -15,10 +16,10 @@ Point the manifest's `transforms.classify.classifier` at your bundle directory (containing classifier.yml). See the fastclass docs to build one. -Publishing is configured for an RSS feed by default. -Your feed will be generated in the "docs" directory. +Publishing is configured for an RSS feed + HTML index by default. +Both land in the "docs" directory (feed.xml + index.html). -? Base URL for your feed: https://myuser.github.io/my-govbot +? Base URL for your feeds: https://myuser.github.io/my-govbot ✓ Created govbot.yml ✓ Created .gitignore @@ -49,11 +50,17 @@ publish: base_url: "https://myuser.github.io/my-govbot" output_dir: "docs" output_file: "feed.xml" + site: + type: html + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "index.html" pipelines: default: - classify - feed + - site === Generated: .github/workflows/build.yml === diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap index 2ecec7d9..c1d95854 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap @@ -1,5 +1,6 @@ --- source: tests/wizard_tests.rs +assertion_line: 44 expression: "&session.to_snapshot()" --- === Wizard Session === @@ -19,10 +20,10 @@ Point the manifest's `transforms.classify.classifier` at your bundle directory (containing classifier.yml). See the fastclass docs to build one. -Publishing is configured for an RSS feed by default. -Your feed will be generated in the "docs" directory. +Publishing is configured for an RSS feed + HTML index by default. +Both land in the "docs" directory (feed.xml + index.html). -? Base URL for your feed: https://sartaj.me/govbot +? Base URL for your feeds: https://sartaj.me/govbot ✓ Created govbot.yml ✓ Created .gitignore @@ -53,11 +54,17 @@ publish: base_url: "https://sartaj.me/govbot" output_dir: "docs" output_file: "feed.xml" + site: + type: html + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "index.html" pipelines: default: - classify - feed + - site === Generated: .github/workflows/build.yml === diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap index e96ed593..013135da 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap @@ -1,5 +1,6 @@ --- source: tests/wizard_tests.rs +assertion_line: 31 expression: "&session.to_snapshot()" --- === Wizard Session === @@ -19,10 +20,10 @@ Point the manifest's `transforms.classify.classifier` at your bundle directory (containing classifier.yml). See the fastclass docs to build one. -Publishing is configured for an RSS feed by default. -Your feed will be generated in the "docs" directory. +Publishing is configured for an RSS feed + HTML index by default. +Both land in the "docs" directory (feed.xml + index.html). -? Base URL for your feed: https://activist.github.io/legislation +? Base URL for your feeds: https://activist.github.io/legislation ✓ Created govbot.yml ✓ Created .gitignore @@ -55,11 +56,17 @@ publish: base_url: "https://activist.github.io/legislation" output_dir: "docs" output_file: "feed.xml" + site: + type: html + base_url: "https://activist.github.io/legislation" + output_dir: "docs" + output_file: "index.html" pipelines: default: - classify - feed + - site === Generated: .github/workflows/build.yml === diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap index 0ba08307..c504f67e 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap @@ -1,5 +1,6 @@ --- source: tests/wizard_tests.rs +assertion_line: 81 expression: "&yml" --- # Govbot Manifest @@ -23,8 +24,14 @@ publish: base_url: "https://sartaj.me/govbot" output_dir: "docs" output_file: "feed.xml" + site: + type: html + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "index.html" pipelines: default: - classify - feed + - site diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap index cf8c6819..1ae3458c 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap @@ -1,5 +1,6 @@ --- source: tests/wizard_tests.rs +assertion_line: 71 expression: "&yml" --- # Govbot Manifest @@ -25,8 +26,14 @@ publish: base_url: "https://example.com" output_dir: "docs" output_file: "feed.xml" + site: + type: html + base_url: "https://example.com" + output_dir: "docs" + output_file: "index.html" pipelines: default: - classify - feed + - site diff --git a/schemas/govbot.schema.json b/schemas/govbot.schema.json index d56127ac..4b4db0fd 100644 --- a/schemas/govbot.schema.json +++ b/schemas/govbot.schema.json @@ -27,7 +27,7 @@ } }, "publish": { - "description": "Named publishers. A publisher consumes a result stream and emits artifacts (a feed, an HTML index, a JSON dump, a DuckDB database, or Bluesky posts). Each publisher declares a 'type' plus type-specific configuration.", + "description": "Named publishers. A publisher consumes a result stream and emits one artifact: an RSS feed, an HTML index, a JSON dump, a DuckDB database, or Bluesky posts. Each publisher declares a 'type' plus type-specific configuration. To emit multiple artifact kinds (e.g. both an RSS feed and an HTML index), declare one publisher per kind.", "type": "object", "additionalProperties": { "$ref": "#/definitions/publisher" @@ -83,10 +83,10 @@ }, "publisher": { "type": "object", - "description": "A single publisher stage. The required fields depend on 'type'.", + "description": "A single publisher stage. The required fields depend on 'type'. Each publisher kind emits exactly ONE artifact: 'rss' writes the RSS feed (default feed.xml), 'html' writes the HTML index (default index.html), 'json' writes a JSON dump, 'duckdb' loads records into a DuckDB database, 'bluesky' posts to a Bluesky account. To get both a feed and an HTML index, declare both an 'rss' and an 'html' publisher.", "properties": { "type": { - "description": "The publisher kind. 'rss' and 'html' emit feed/index files; 'json' emits a JSON dump; 'duckdb' loads results into a DuckDB database; 'bluesky' posts to a Bluesky account.", + "description": "The publisher kind. 'rss' writes the RSS feed only; 'html' writes the HTML index only; 'json' emits a JSON dump; 'duckdb' loads results into a DuckDB database; 'bluesky' posts to a Bluesky account.", "type": "string", "enum": ["rss", "html", "json", "duckdb", "bluesky"] }, @@ -108,7 +108,7 @@ "default": "docs" }, "output_file": { - "description": "Output filename for the primary artifact (e.g. the RSS feed file, the JSON dump, or the DuckDB database file).", + "description": "Output filename for the publisher's single artifact. Defaults by 'type': 'rss' -> 'feed.xml', 'html' -> 'index.html', 'json' -> 'feed.json', 'duckdb' -> 'feed.duckdb'.", "type": "string" }, "title": { From 7eec08b0e5b551dc17325d781c347e6e95c358bf Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 00:16:58 -0500 Subject: [PATCH 14/32] hygiene: rename src/embeddings.rs to src/tagfile.rs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The module no longer has anything to do with embeddings — the ONNX machinery moved out when classification was delegated to fastclass. What remained was a misnomer: the in-process apply sink for fastclass result JSON. The on-disk artifact this module models is the per-tag `.tag.json` file `govbot apply` writes; `tagfile` is what it actually does. Pure rename — no behavior change: - src/embeddings.rs -> src/tagfile.rs (git mv keeps history). - lib.rs swaps `pub mod embeddings;` -> `pub mod tagfile;` and the re-export `pub use embeddings::{...}` -> `pub use tagfile::{...}`. - Module-level docstring rewritten to drop the "what used to be here" framing and describe what the module currently does; retains a single historical sentence for archaeology. All 45 tests stay green; no `use govbot::embeddings::*` sites elsewhere needed updating (lib.rs was the only consumer; main.rs already imports through the `govbot::*` re-export). A trailing `grep -rn '\bembeddings\b' actions/govbot/src actions/govbot/tests` now matches only the historical sentence in tagfile.rs's module-level doc comment — no live code references the old name. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/lib.rs | 8 ++++---- .../govbot/src/{embeddings.rs => tagfile.rs} | 19 ++++++++++++------- 2 files changed, 16 insertions(+), 11 deletions(-) rename actions/govbot/src/{embeddings.rs => tagfile.rs} (74%) diff --git a/actions/govbot/src/lib.rs b/actions/govbot/src/lib.rs index a3087719..26971c4a 100644 --- a/actions/govbot/src/lib.rs +++ b/actions/govbot/src/lib.rs @@ -6,7 +6,6 @@ pub mod bluesky; pub mod cache; pub mod config; -pub mod embeddings; pub mod error; pub mod filter; pub mod git; @@ -17,6 +16,7 @@ pub mod publish; pub mod registry; pub mod rss; pub mod selectors; +pub mod tagfile; pub mod types; pub mod wizard; @@ -24,14 +24,14 @@ pub use config::{ Command_, Config, ConfigBuilder, JoinOption, Manifest, Publisher, PublisherKind, SortOrder, Transform, }; -pub use embeddings::{ - hash_text, BillTagResult, ScoreBreakdown, TagDefinition, TagFile, TagFileMetadata, -}; pub use error::{Error, Result}; pub use filter::{FilterAlias, FilterManager, FilterResult, LogFilter}; pub use lock::LockFile; pub use processor::PipelineProcessor; pub use registry::{DatasetEntry, Registry, ResolvedDataset}; +pub use tagfile::{ + hash_text, BillTagResult, ScoreBreakdown, TagDefinition, TagFile, TagFileMetadata, +}; pub use types::{LogContent, LogEntry, Metadata, VoteEventResult}; /// Re-export commonly used types for convenience diff --git a/actions/govbot/src/embeddings.rs b/actions/govbot/src/tagfile.rs similarity index 74% rename from actions/govbot/src/embeddings.rs rename to actions/govbot/src/tagfile.rs index 619194b9..728886aa 100644 --- a/actions/govbot/src/embeddings.rs +++ b/actions/govbot/src/tagfile.rs @@ -1,11 +1,16 @@ -//! Per-bill tag-file persistence types. +//! Per-bill tag-file persistence types — the on-disk `.tag.json` format. //! -//! govbot no longer classifies bills itself — classification is delegated to -//! `fastclass` (an external transform). What remains here is the on-disk -//! `.tag.json` format: `govbot apply` deserializes a `fastclass classify` -//! result and writes these structs into the dataset; `govbot publish` reads -//! them back. The ONNX embedding machinery that used to live in this module -//! has been removed. +//! `govbot apply` (the apply sink of `govbot source --select docs | +//! fastclass classify - | govbot apply`) deserializes a `fastclass classify` +//! result and writes these structs into `/tags/...`; `govbot +//! publish` reads them back as input to the publishers. +//! +//! This module used to be `embeddings.rs` and housed the in-process ONNX +//! embedding pipeline. govbot no longer classifies bills itself — +//! classification is now delegated to `fastclass` over a process boundary +//! (see `schemas/STREAM_PROTOCOL.md`) — so the ONNX machinery has been +//! removed and what remains is just the tag-file shape. Renamed to +//! `tagfile.rs` to match what it actually contains. use serde::{Deserialize, Serialize}; use sha2::{Digest, Sha256}; From 7592418cecae9e6bfae067382eac65ebceef433d Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 09:19:37 -0500 Subject: [PATCH 15/32] source: id docs by bill, not session, for symlinked OCD log layouts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `govbot source --select docs` was emitting session-level ids when the on-disk log lived at `/.../sessions//logs/.json` (the common windycivi-cloned shape across most states). The walker reports the symlink path, not its canonical per-bill target, so the old `sources.log.split("/logs/").next()` builder stopped at the session and dropped the `/bills/` segment from the id. Real-world impact: `govbot pull all` across 55 jurisdictions produced 4916 records compressed to 97 unique ids — every bill in a session collapsed onto one id. Downstream, `parse_doc_route` refuses ids without a `bills` segment, so `govbot apply` skipped every collapsed record entirely. STREAM_PROTOCOL §1 mandates `id` be the bill's dataset path; the projection was silently violating the contract. Fix: when the path stripped of `/logs/...` doesn't already end in `/bills/`, append `/bills/` from `log.bill_id`. This normalises both layouts to the same `/country:/ state:/sessions//bills/` shape, restoring the contract `apply` and the bluesky publisher rely on. Why the mocks missed it: `actions/govbot/mocks/.govbot/repos/wy- legislation/` ships its logs directly under each `bills//logs/` dir (no symlinks). The buggy `split("/logs/").next()` already produced the correct bill path for that layout, so no existing test exercised the session-level-symlink shape real `govbot pull` writes. Three new unit tests pin the contract: the per-bill layout round-trips through `parse_doc_route`, the session-level layout gets its bill_id appended, and 4 sibling bills in one session hash to 4 distinct ids. Reproduced end-to-end against the cached 55-state corpus: 4916/97 → 4916/3989 (remaining collisions are legitimate per-bill log re-emissions, not session-wide collapse). Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 176 ++++++++++++++++++++++++++++++++++--- 1 file changed, 163 insertions(+), 13 deletions(-) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index ff1dc445..266d38c1 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -854,26 +854,60 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { /// `{"id","text","kind":"docs"}` document the govbot stream protocol defines /// (`STREAM_PROTOCOL.md` §1) — the record `fastclass classify -` consumes. /// -/// `id` is the bill's dataset-relative directory path (derived from -/// `sources.log` by dropping the `/logs/.json` tail), so a classified -/// result can be routed back to the right place when `govbot apply` writes it. +/// `id` is the bill's dataset-relative directory path of the form +/// `/country:/state:/sessions//bills/` so the +/// classified result can be routed back to the right *bill* (not session) +/// when `govbot apply` writes it. Two real-world dataset layouts feed into +/// this: +/// +/// 1. **Per-bill log directory** — `sources.log` is already +/// `/.../sessions//bills//logs/.json`. +/// Stripping the `/logs/...` tail yields the bill path directly. +/// 2. **Session-level log directory** (the common case for OCD-files +/// datasets cloned from windycivi) — the on-disk log lives at +/// `/.../sessions//logs/.json` and is a *symlink* +/// to `.../sessions//bills//logs/.json`. The walker +/// reports the symlink path, so stripping `/logs/...` would stop at +/// the *session* and collide every bill in that session onto one id +/// (real bug surfaced by `govbot pull all` over the 55-state corpus: +/// 4916 records collapsed to 97 ids). The fix appends the bill_id +/// from `log.bill_id` whenever the stripped path doesn't already end +/// in `/bills/`. +/// /// `text` is the **full** bill text assembled from `metadata.json` (not just /// titles) — the `docs` projection joins the complete bill so this is whole. fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { - let id = entry + let bill_id = entry + .get("log") + .and_then(|l| l.get("bill_id").or_else(|| l.get("bill_identifier"))) + .and_then(|v| v.as_str()) + .map(|s| s.to_string()); + + let stripped = entry .get("sources") .and_then(|s| s.get("log")) .and_then(|v| v.as_str()) .and_then(|log_path| log_path.split("/logs/").next()) - .map(|s| s.to_string()) - .or_else(|| { - entry - .get("log") - .and_then(|l| l.get("bill_id").or_else(|| l.get("bill_identifier"))) - .and_then(|v| v.as_str()) - .map(|s| s.to_string()) - }) - .unwrap_or_default(); + .map(|s| s.to_string()); + + let id = match (stripped, bill_id.as_deref()) { + (Some(path), Some(bid)) => { + let suffix = format!("/bills/{}", bid); + if path.ends_with(&suffix) { + // Layout 1: log already lived under bills//logs/. + path + } else { + // Layout 2: session-level log (symlink to per-bill log). + // The stripped path stops at `.../sessions/`; + // append the bill_id from the log entry to identify the + // bill, not the session. + format!("{}/bills/{}", path, bid) + } + } + (Some(path), None) => path, + (None, Some(bid)) => bid.to_string(), + (None, None) => String::new(), + }; serde_json::json!({ "id": id, "text": ocd_files_select_default(entry), "kind": "docs" }) } @@ -2759,6 +2793,122 @@ mod tests { assert!(parse_doc_route("wy-legislation/country:us").is_none()); } + /// The mock layout — logs already live under `bills//logs/` — so + /// stripping `/logs/...` from `sources.log` directly yields the bill + /// path. The `id` must be that full dataset-rooted bill path, ready + /// for `parse_doc_route` to find a `bills` segment and route the + /// `.tag.json` back to the correct bill. + #[test] + fn ocd_entry_to_doc_per_bill_log_layout_keeps_bill_suffix() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "ANY" } }, + "bill": { "title": "Mock bill", "identifier": "HB0001" }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250101T000000Z_foo.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert_eq!( + doc.get("id").and_then(|v| v.as_str()), + Some("wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001") + ); + // And it must round-trip through `parse_doc_route` — the contract + // `govbot apply` depends on. + assert_eq!( + parse_doc_route(doc.get("id").unwrap().as_str().unwrap()) + .expect("route") + .bill_id, + "HB0001" + ); + } + + /// REGRESSION (real-data bug): `govbot pull all` clones OCD-files-shaped + /// datasets whose on-disk logs live at `sessions//logs/.json` + /// as *symlinks* into per-bill `bills//logs/.json`. The walker + /// reports the symlink path, so `sources.log` does NOT contain `/bills/ + /// /` and the old `log_path.split("/logs/").next()` builder dropped + /// the bill_id, collapsing every bill in a session onto one id. Over the + /// 55-state corpus that compressed 4916 distinct bill records into 97 + /// session ids; `apply` then overwrote every tag file's `bills` map + /// repeatedly and the bluesky ledger silently marked one bill per + /// session as "done." The id must carry `/bills/` so each bill + /// hashes to a distinct slot. + #[test] + fn ocd_entry_to_doc_session_level_log_layout_appends_bill_id() { + let entry = serde_json::json!({ + "log": { "bill_id": "SB50", "action": { "description": "PASSED" } }, + "bill": { "title": "Mock bill", "identifier": "SB50" }, + "sources": { + // Realistic shape from `govbot pull ak`: session-level log + // path, no `/bills//` segment because the walker + // followed the symlink-source view, not the canonical + // target. + "log": "ak-legislation/country:us/state:ak/sessions/34/logs/20250317T000000Z.vote_event.pass.upper_SB50.json", + "bill": "../../../../.govbot/cache/ak-abc123/country:us/state:ak/sessions/34/bills/SB50/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert_eq!( + doc.get("id").and_then(|v| v.as_str()), + Some("ak-legislation/country:us/state:ak/sessions/34/bills/SB50"), + "id must include /bills/ for session-level log layouts" + ); + // The whole point: this id must round-trip through `parse_doc_route` + // so `govbot apply` keys per-bill, not per-session. + let route = parse_doc_route(doc.get("id").unwrap().as_str().unwrap()) + .expect("session-level layout must still produce a routable doc id"); + assert_eq!(route.bill_id, "SB50"); + assert_eq!(route.session, "34"); + } + + /// Two distinct bills from the same session must yield two distinct ids — + /// the precondition the apply layer and the bluesky publisher's ledger + /// rely on. This is the unit-level expression of the corpus check + /// `len(ids) == len(set(ids))`. + #[test] + fn ocd_entry_to_doc_distinct_bills_same_session_get_distinct_ids() { + let make = |bill_id: &str, log_file: &str| { + serde_json::json!({ + "log": { "bill_id": bill_id, "action": { "description": "PASSED" } }, + "bill": { "title": "Mock", "identifier": bill_id }, + "sources": { + "log": format!( + "ak-legislation/country:us/state:ak/sessions/34/logs/{}", + log_file + ), + "bill": format!( + "../../../../.govbot/cache/ak-x/country:us/state:ak/sessions/34/bills/{}/metadata.json", + bill_id + ) + } + }) + }; + let entries = vec![ + make("SB50", "20250317T000000Z.vote_event.pass.upper_SB50.json"), + make("HR2", "20250121T000000Z.vote_event.pass.lower_HR2.json"), + make("HJR20", "20250514T000000Z_h_fn1_zeroleg_HJR20.json"), + make("HB55", "20250306T000000Z_h_heard_held_HB55.json"), + ]; + let ids: Vec = entries + .iter() + .map(|e| { + ocd_entry_to_doc(e) + .get("id") + .and_then(|v| v.as_str()) + .unwrap() + .to_string() + }) + .collect(); + let unique: std::collections::HashSet<&String> = ids.iter().collect(); + assert_eq!( + ids.len(), + unique.len(), + "4 bills under one session must produce 4 distinct ids; got: {:?}", + ids + ); + } + /// `.govbot/` is the cache; tag files belong outside it in the project- /// rooted `tags/` output dir. The resolver's primary candidate must /// therefore be `/tags//country:.../state:.../sessions/ From 5ab6d3c56fb0983e3434fa6c1cb2b0666953b5b0 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 12:22:56 -0500 Subject: [PATCH 16/32] source: use canonical on-disk bill dir for docs id, not log.bill_id MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Real-data bug surfaced by the 55-state corpus: MI/WV/ND/PA logs ship `bill_id` as a *display* form with a space — "HB 5077", "SB 0001", "HB 0163" — even though the on-disk directory is `bills/HB5077/`, `bills/SB0001/`, `bills/HR0163/`. The pre-fix `ocd_entry_to_doc` appended `log.bill_id` verbatim, so the doc id became `mi-legislation/.../bills/SB 0001` (with a space). Any downstream sibling-file lookup — `os.path.join(REPOS, doc, "metadata.json")` — 404'd because no such directory exists on disk; the architect saw "(no metadata.json)" for ~30% of bills across the affected states. Fix: source the `/bills/` segment from `sources.bill` (the parent dir of the resolved `metadata.json`), which is the authoritative on-disk dir name — the bill join already canonicalized the symlinked log path to reach it. Fall back to `log.bill_id` only when `--join bill` wasn't requested. The Layout-1 detector (when `sources.log` already ends in `/bills/` because the walker landed on the per-bill log directly) must also consider the canonical dir name; otherwise states whose `log.bill_id` has whitespace would fail the Layout-1 check and double-append, producing `.../bills/HR0163/bills/HR0163`. Sample over the corpus: ~50% of mi/nd/pa records exhibited the doubled-bills id before this fix. Validation across the full 55-state pull (`govbot source --select docs --filter none --limit none`, 2,377,146 records): - 0 ids with whitespace in the bill segment (was: hundreds of thousands) - 0 doubled `bills//bills/` patterns (was: ~50% of mi/nd/pa) - 0 of 1038 sampled records missed `metadata.json` (20 per state across 52 states; was: ~30% missing for mi/wv/nd/pa) Test count: 52 -> 57 (4 new regression tests + 1 helper unit test). Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 351 +++++++++++++++++++++++++++++++++++-- 1 file changed, 336 insertions(+), 15 deletions(-) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 266d38c1..f6aa6e38 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -871,8 +871,21 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { /// the *session* and collide every bill in that session onto one id /// (real bug surfaced by `govbot pull all` over the 55-state corpus: /// 4916 records collapsed to 97 ids). The fix appends the bill_id -/// from `log.bill_id` whenever the stripped path doesn't already end -/// in `/bills/`. +/// whenever the stripped path doesn't already end in `/bills/`. +/// +/// **Bill-id source of truth.** The on-disk bill directory name (e.g. +/// `HB5109`) does **not** always equal the `log.bill_id` field (e.g. +/// `"HB 5109"`). MI/WV/ND/PA logs carry a *display* bill id with a space +/// between the chamber prefix and the number; the actual `bills//` +/// directory has no space. Using `log.bill_id` verbatim produces an `id` +/// like `.../bills/HB 5109` that no `os.path.join(REPOS, doc, +/// "metadata.json")` can resolve. The fix is to take the canonical bill +/// dir name from `sources.bill` (the parent dir of `metadata.json` — the +/// *resolved* on-disk path, set during the `bill` join) whenever +/// available, and fall back to `log.bill_id` only when the bill join is +/// absent. Layout 1 (suffix already present in `sources.log`) is left +/// untouched — that path is itself the canonical on-disk path, so the +/// bill segment is correct by construction. /// /// `text` is the **full** bill text assembled from `metadata.json` (not just /// titles) — the `docs` projection joins the complete bill so this is whole. @@ -883,6 +896,19 @@ fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { .and_then(|v| v.as_str()) .map(|s| s.to_string()); + // Canonical on-disk bill directory name, derived from `sources.bill` + // (the path to `metadata.json`, which the bill join resolves to the + // real `bills//metadata.json` on disk — even when the log was a + // session-level symlink). This is the authoritative source for the + // `/bills/` segment because `log.bill_id` may carry a display + // form (e.g. `"HB 5109"`) that differs from the directory (`HB5109`). + let canonical_bill_dir = entry + .get("sources") + .and_then(|s| s.get("bill")) + .and_then(|v| v.as_str()) + .and_then(bill_dir_from_metadata_path) + .map(|s| s.to_string()); + let stripped = entry .get("sources") .and_then(|s| s.get("log")) @@ -890,27 +916,75 @@ fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { .and_then(|log_path| log_path.split("/logs/").next()) .map(|s| s.to_string()); - let id = match (stripped, bill_id.as_deref()) { - (Some(path), Some(bid)) => { - let suffix = format!("/bills/{}", bid); - if path.ends_with(&suffix) { - // Layout 1: log already lived under bills//logs/. + // Layout 1 still trusts the stripped log path: when `sources.log` + // already ends in `/bills/` that dir name is itself canonical + // (it came from the on-disk walk). Layout 2 must prefer the + // `sources.bill`-derived dir name; only fall back to `log.bill_id` + // when the bill join wasn't requested. + // + // The Layout-1 test must consider BOTH the canonical bill dir (from + // `sources.bill`) AND `log.bill_id`. If we only checked + // `log.bill_id`, then MI/WV/ND/PA — whose log carries `"HB 0163"` + // but on-disk dir is `HB0163` — would fail the Layout-1 test even + // when `sources.log` already ends in `/bills/HB0163`, and we'd + // double-append, producing `.../bills/HB0163/bills/HB0163`. + let id = match stripped { + Some(path) => { + let already_ends_in_bill_dir = canonical_bill_dir + .as_deref() + .map(|d| path.ends_with(&format!("/bills/{}", d))) + .unwrap_or(false) + || bill_id + .as_deref() + .map(|d| path.ends_with(&format!("/bills/{}", d))) + .unwrap_or(false); + if already_ends_in_bill_dir { + // Layout 1: log lived under bills//logs/. The stripped + // path is already the canonical bill dir. path - } else { - // Layout 2: session-level log (symlink to per-bill log). - // The stripped path stops at `.../sessions/`; - // append the bill_id from the log entry to identify the - // bill, not the session. + } else if let Some(canon) = canonical_bill_dir.as_deref() { + // Layout 2 (preferred): use the on-disk dir name from the + // resolved metadata.json path, so display-form bill ids + // with whitespace (e.g. `"HB 5109"`) don't bleed into the + // doc id and break sibling-file lookups. + format!("{}/bills/{}", path, canon) + } else if let Some(bid) = bill_id.as_deref() { + // Layout 2 fallback: no bill join, so the best we have is + // the log's `bill_id`. This may be a display form; callers + // doing path lookups should treat it as advisory. format!("{}/bills/{}", path, bid) + } else { + path } } - (Some(path), None) => path, - (None, Some(bid)) => bid.to_string(), - (None, None) => String::new(), + None => canonical_bill_dir.or(bill_id).unwrap_or_else(String::new), }; serde_json::json!({ "id": id, "text": ocd_files_select_default(entry), "kind": "docs" }) } +/// Given a `sources.bill` path (`<...>/bills//metadata.json`, +/// possibly with `..` prefixes from a cache-symlinked repo), return the +/// `` segment — the canonical on-disk bill directory name. Returns +/// `None` if the path doesn't end in `bills//metadata.json`. +fn bill_dir_from_metadata_path(metadata_path: &str) -> Option<&str> { + // Strip the trailing filename. + let without_file = metadata_path.strip_suffix("/metadata.json")?; + // Take the last path segment — that's the bill dir. + let last_slash = without_file.rfind('/')?; + let dir = &without_file[last_slash + 1..]; + // Sanity check: the segment before that should be `bills`. If not, + // the path doesn't look like a bill metadata path; refuse to guess. + let before_dir = &without_file[..last_slash]; + if !before_dir.ends_with("/bills") && before_dir != "bills" { + return None; + } + if dir.is_empty() { + None + } else { + Some(dir) + } +} + async fn run_source_command(cmd: Command) -> anyhow::Result<()> { let Command::Source { govbot_dir, @@ -2909,6 +2983,253 @@ mod tests { ); } + /// REGRESSION (real-data bug, 55-state corpus): MI/WV/ND/PA legislature + /// logs ship a `bill_id` field with a *display* space — e.g. + /// `"HB 5077"`, `"SB 0001"` — even though the corresponding on-disk + /// directory is `bills/HB5077/`, `bills/SB0001/` (no space). The + /// pre-fix `ocd_entry_to_doc` for the Layout-2 (session-level symlink) + /// case appended `log.bill_id` verbatim, producing ids like + /// `mi-legislation/.../bills/SB 0001`. Downstream consumers doing a + /// sibling `metadata.json` lookup via path joining + /// (`os.path.join(REPOS, doc, "metadata.json")`) then 404'd because no + /// such directory exists on disk. The architect saw "(no metadata.json)" + /// for ~30% of bills. + /// + /// The fix sources the `/bills/` segment from the resolved + /// `sources.bill` path (the parent dir of `metadata.json`, which the + /// `bill` join produced from the canonicalized log path) — that is the + /// authoritative on-disk dir name. The id must NOT contain whitespace + /// in the bill segment, and it must point to a directory that exists. + #[test] + fn ocd_entry_to_doc_uses_canonical_bill_dir_when_log_bill_id_has_whitespace() { + let entry = serde_json::json!({ + "log": { + // Display form with a space — this is what MI/WV/ND/PA emit. + "bill_id": "SB 0001", + "action": { "description": "PASSED" } + }, + "bill": { "title": "Mock", "identifier": "SB 0001" }, + "sources": { + // Session-level symlink layout (Layout 2). `sources.log` + // stops at the session because the walker reported the + // symlink, not the canonical target. + "log": "mi-legislation/country:us/state:mi/sessions/2025-2026/logs/20250108T000000Z_referred_to_committee_of_the_whole_SB0001.json", + // `sources.bill` points at the *resolved* on-disk + // metadata.json — the parent dir is the canonical bill dir + // name (no whitespace). + "bill": "../../../../.govbot/cache/mi-ad5ea7bbd548/country:us/state:mi/sessions/2025-2026/bills/SB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + let id = doc + .get("id") + .and_then(|v| v.as_str()) + .expect("doc id must be a string"); + // The id must end at the on-disk dir, not the display bill_id. + assert_eq!( + id, "mi-legislation/country:us/state:mi/sessions/2025-2026/bills/SB0001", + "id must use the canonical on-disk bill dir name (no whitespace)" + ); + // No whitespace anywhere in the id — that's what makes + // `os.path.join(REPOS, doc, \"metadata.json\")` resolve to a real + // file on a real filesystem. + assert!( + !id.contains(' '), + "id must not carry display-form whitespace; got: {}", + id + ); + } + + /// Same data shape, all four affected states (MI/WV/ND/PA) — pins that + /// the fix isn't accidentally specific to one state's path shape. + #[test] + fn ocd_entry_to_doc_uses_canonical_bill_dir_for_all_affected_states() { + // (display_bill_id, on_disk_dir, dataset, session, log_filename) + let cases = [ + ( + "SB 0001", + "SB0001", + "mi-legislation", + "mi", + "2025-2026", + "20250108T000000Z_referred_to_committee_of_the_whole_SB0001.json", + ), + ( + "SB 458", + "SB458", + "wv-legislation", + "wv", + "2025", + "20250307T000000Z_read_2nd_time_SB458.json", + ), + ( + "SB 2262", + "SB2262", + "nd-legislation", + "nd", + "69", + "20250501T000000Z_signed_by_governor_0429_SB2262.json", + ), + ( + "HB 1271", + "HB1271", + "pa-legislation", + "pa", + "2025-2026", + "20250421T040000Z_referred_to_education_HB1271.json", + ), + ]; + for (display_id, on_disk_dir, dataset, state, session, log_file) in cases { + let entry = serde_json::json!({ + "log": { "bill_id": display_id, "action": { "description": "PASSED" } }, + "bill": { "title": "Mock", "identifier": display_id }, + "sources": { + "log": format!( + "{}/country:us/state:{}/sessions/{}/logs/{}", + dataset, state, session, log_file + ), + "bill": format!( + "../../../../.govbot/cache/{}-deadbeef/country:us/state:{}/sessions/{}/bills/{}/metadata.json", + state, state, session, on_disk_dir + ) + } + }); + let doc = ocd_entry_to_doc(&entry); + let id = doc + .get("id") + .and_then(|v| v.as_str()) + .unwrap_or_default() + .to_string(); + assert_eq!( + id, + format!( + "{}/country:us/state:{}/sessions/{}/bills/{}", + dataset, state, session, on_disk_dir + ), + "{}: id must use the on-disk dir `{}`, not the log's display id `{}`", + state, + on_disk_dir, + display_id + ); + assert!( + !id.contains(' '), + "{}: id contains whitespace; got: {}", + state, + id + ); + // Round-trip: the route's bill_id must be the on-disk dir + // name, because that's what every downstream path lookup + // (`os.path.join(REPOS, doc, ...)`) is going to hit. + let route = + parse_doc_route(&id).expect("routable doc id even for spaced bill_id inputs"); + assert_eq!( + route.bill_id, on_disk_dir, + "{}: parsed bill_id must be the on-disk dir", + state + ); + } + } + + /// REGRESSION (real-data follow-on of the whitespace fix): MI/ND/PA + /// also publish a Layout-1 view for some bills — `sources.log` is + /// `.../sessions//bills//logs/.json` because + /// the walker happened to land on the per-bill log directly. In that + /// case the stripped path already ends in `/bills/` + /// (e.g. `bills/HR0163`). But `log.bill_id` is `"HR 0163"` (display + /// form). The pre-fix Layout-1 detector compared the stripped path's + /// suffix to `log.bill_id` verbatim, which DID NOT match (no space + /// vs space), so the code fell through to the Layout-2 branch and + /// appended `/bills/HR0163` *again*, producing + /// `mi-legislation/.../bills/HR0163/bills/HR0163`. Sample over the + /// 55-state corpus: ~50% of mi/nd/pa records exhibited the + /// doubled-bills id. The Layout-1 detector must therefore consider + /// both the canonical dir name (from `sources.bill`) and + /// `log.bill_id`; a match on either means the path already names + /// the bill. + #[test] + fn ocd_entry_to_doc_layout1_with_spaced_log_bill_id_does_not_double_bills_segment() { + let entry = serde_json::json!({ + "log": { + // Display form with a space — what MI/ND/PA emit. + "bill_id": "HR 0163", + "action": { "description": "ANY" } + }, + "bill": { "title": "Mock", "identifier": "HR 0163" }, + "sources": { + // Layout 1 — the walker landed on the per-bill log dir. + // The stripped path will end in `/bills/HR0163` (no space). + "log": "mi-legislation/country:us/state:mi/sessions/2025-2026/bills/HR0163/logs/20250101T000000Z_foo.json", + "bill": "../../../../.govbot/cache/mi-x/country:us/state:mi/sessions/2025-2026/bills/HR0163/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + let id = doc.get("id").and_then(|v| v.as_str()).unwrap_or_default(); + assert_eq!( + id, "mi-legislation/country:us/state:mi/sessions/2025-2026/bills/HR0163", + "Layout 1 with spaced log.bill_id must not double-append the /bills/ segment" + ); + // The cardinal symptom of the bug: a doubled `bills//bills/` tail. + assert!( + !id.contains("/bills/HR0163/bills/"), + "id must not double the bills segment; got: {}", + id + ); + assert!( + !id.contains(' '), + "id must not contain whitespace; got: {}", + id + ); + } + + /// `bill_dir_from_metadata_path` is the helper the fix relies on. Unit- + /// test the shape boundary so future refactors don't silently break it. + #[test] + fn bill_dir_from_metadata_path_extracts_dir_segment() { + assert_eq!( + bill_dir_from_metadata_path( + "../../../../.govbot/cache/mi-x/country:us/state:mi/sessions/2025-2026/bills/HB5109/metadata.json" + ), + Some("HB5109") + ); + assert_eq!( + bill_dir_from_metadata_path( + "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + ), + Some("HB0001") + ); + // Not a bill metadata path — refuse to guess. + assert_eq!( + bill_dir_from_metadata_path("country:us/state:wy/sessions/2025/metadata.json"), + None + ); + assert_eq!(bill_dir_from_metadata_path("metadata.json"), None); + assert_eq!(bill_dir_from_metadata_path(""), None); + } + + /// When the consumer ran `govbot source --select docs` *without* + /// `--join bill`, `sources.bill` is absent and we have no canonical + /// dir to lean on. Fall back to `log.bill_id` so the id is still + /// routable — even if it carries display-form whitespace. Document + /// that this is the advisory path; the production `source --select + /// docs` invocation always joins `bill`, so this branch only fires + /// for ad-hoc invocations. + #[test] + fn ocd_entry_to_doc_falls_back_to_log_bill_id_when_bill_join_absent() { + let entry = serde_json::json!({ + "log": { "bill_id": "SB 0001", "action": { "description": "PASSED" } }, + "sources": { + "log": "mi-legislation/country:us/state:mi/sessions/2025-2026/logs/20250108T000000Z_x.json" + // No `sources.bill` — `--join bill` was not requested. + } + }); + let doc = ocd_entry_to_doc(&entry); + assert_eq!( + doc.get("id").and_then(|v| v.as_str()), + Some("mi-legislation/country:us/state:mi/sessions/2025-2026/bills/SB 0001"), + "without sources.bill we fall back to log.bill_id (advisory; may carry whitespace)" + ); + } + /// `.govbot/` is the cache; tag files belong outside it in the project- /// rooted `tags/` output dir. The resolver's primary candidate must /// therefore be `/tags//country:.../state:.../sessions/ From 02ef80e87268ab4fc41d3014651af0122a00d74d Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 14:26:44 -0500 Subject: [PATCH 17/32] doctor: corpus-level smoke test for pulled cache integrity MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two real-data bugs shipped this stack because the only test harness was the mock dataset, which happened to fit one happy-path layout: 7592418 — `source --select docs` emitted session-level ids when the on-disk log lived at `.../sessions//logs/` (the symlinked OCD layout most states use). 4916 records collapsed onto 97 distinct ids; downstream `apply` and `bluesky` silently dropped almost everything. 5ab6d3c — `source --select docs` id used `log.bill_id` (e.g. "HB 4027", with whitespace) instead of the canonical on-disk dir name (`HB4027`) on MI/WV/ND/PA. The mismatched id then resolved to a non-existent `metadata.json` path — ~30% missing-metadata hits in title lookup. Both would have been caught by a five-line check over a real pulled cache. `govbot doctor` is that check, wired to a CLI verb activists can run after `pull all` to confirm the project is coherent before flipping `bluesky` off `--dry-run`. What it checks, per dataset: - coverage — at least one record emitted (WARN, not FAIL, when `--filter default` legitimately drops every routine log) - id_distinctness — distinct-id / record-count ratio ≥ 0.03; the bug 7592418 signature was 97/4916 = 0.02 - metadata_sampleable — N sampled ids each resolve to a present, parseable `metadata.json` with at least a `title` or `identifier` (catches 5ab6d3c) - text_non_empty — sampled `text` ≥ 50 chars (catches collapsed bill joins) And globally: - dataset_links — every `*-legislation` entry in `.govbot/repos/` resolves to a real directory (catches dangling symlinks `get_local_datasets` filters out) - routable_ids — every emitted id has a recognisable `/country:.../bills/` shape Skips cleanly when `.govbot/repos/` is absent or empty — this is a smoke test against pulled data, not a unit test against mocks. Form chosen: **C (user-facing `govbot doctor` subcommand).** The A (cargo-test + env var) and B (just recipe) variants would have been smaller diffs, but they hide the check from end users. A real verb activists can run answers the right question — "is my project coherent?" — without making them learn the test harness. Exit code is non-zero on failure so it drops straight into a CI step; output defaults to a human summary, with `--output json` for machine consumers. How to invoke: govbot doctor # text summary govbot doctor --output json # machine-readable govbot doctor --sample 50 --limit 200 # more thorough sweep Acceptance: against the architect's pulled cache at `/Users/sartaj/Git/climate-activist-gov-news-bot/`, doctor finds 55 datasets, drains 4916 records to 3958 distinct ids in 16s (<60s budget), and exits PASS with three WARN states (az/gu/va — filter dropped every routine log). The MI HB 4027 case (now resolves post-5ab6d3c) passes. Regression demo: delete every metadata.json under a tmp-cloned `mi-legislation/`; doctor exits 1 with `metadata_sampleable: FAIL` and prints the missing-file paths, including the exact pre-5ab6d3c `HB 4027` (with-whitespace) id format. Real cache untouched. `cargo test --offline`: 28+19+3+1+10 = 61 green (was 57); three new unit tests pin the bug-5ab6d3c metadata-resolve leg, one pins the `dataset_short_name` prefix/suffix bridge, and one pins the `metadata.json has neither title nor identifier` rejection. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 779 ++++++++++++++++++ ...i_example_snaps__snapshot@govbot_help.snap | 1 + 2 files changed, 780 insertions(+) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index f6aa6e38..4c7a35f7 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -322,6 +322,31 @@ enum Command { #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] output: String, }, + + /// Check that the project's pulled datasets are coherent. A data-integrity smoke test, runnable after `govbot pull all` or before `govbot run` in production. Walks every linked dataset and verifies that the `govbot source --select docs` stream is well-formed: every linked dataset entry resolves to a real directory, per-dataset ids don't collapse onto a handful (the bug-7592418 signature), every sampled `id` resolves to a present and parseable `metadata.json`, and every sampled `text` is non-trivial. Zero-record datasets are surfaced as warnings rather than errors — `--filter default` can legitimately drop every routine log. Exits non-zero on any failure so it can drop straight into a CI step. Skips cleanly when the cache is empty — this is a smoke test, not a unit test. + Doctor { + /// Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var) + #[arg(long = "govbot-dir")] + govbot_dir: Option, + + /// Records to sample per dataset for the metadata.json and + /// text-length checks (default: 20). The id-distinctness and + /// coverage checks always cover every emitted record. + #[arg(long = "sample", default_value_t = 20)] + sample: usize, + + /// Per-dataset emit limit fed through to `govbot source --limit` + /// (default: 100, matching the source default — the smoke-test + /// sweet spot for a typical 55-state pull in <60s). Use "none" + /// for an exhaustive sweep at the cost of runtime. + #[arg(long = "limit", default_value = "100")] + limit: String, + + /// Emit a machine-readable JSON report instead of the human summary. + /// Suitable for piping into a CI step. + #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] + output: String, + }, } fn get_govbot_dir(govbot_dir: Option) -> anyhow::Result { @@ -2768,6 +2793,644 @@ fn run_search_command(cmd: Command) -> anyhow::Result<()> { Ok(()) } +// --------------------------------------------------------------------------- +// `govbot doctor` — corpus-level data-integrity smoke test. +// +// Why this exists: two real-data bugs (7592418, 5ab6d3c) shipped because the +// only test harness was the mock dataset, which happened to fit a single +// happy-path layout. Both bugs would have been caught by a five-line check — +// "every emitted doc id is unique" and "every id resolves to a present +// metadata.json" — over a real pulled cache. `doctor` is that check, wired +// to a CLI verb activists can run after `pull all` to confirm the project +// is coherent before flipping `bluesky` off `--dry-run`. +// +// This is a smoke test, not a unit test. It assumes pulled data and skips +// cleanly when the cache is empty. +// --------------------------------------------------------------------------- + +/// Per-record sample, captured during the source walk so the metadata.json +/// and text checks can run after the stream is fully drained. +#[derive(Debug, Clone)] +struct DoctorSample { + id: String, + text_len: usize, +} + +/// Per-dataset rollup used to build the doctor report. +#[derive(Debug, Default)] +struct DatasetSummary { + record_count: usize, + distinct_ids: std::collections::HashSet, + samples: Vec, +} + +/// Outcome of one assertion bucket for one dataset — a short label, a +/// pass flag, an optional warn flag, and the detail lines (capped so a +/// broken dataset doesn't drown the report). A `warned` check still +/// counts as passing for the overall exit code — it surfaces noteworthy +/// state (e.g. zero records under `--filter default`) without failing CI. +#[derive(Debug, Clone)] +struct DoctorCheck { + name: &'static str, + passed: bool, + warned: bool, + detail: Vec, +} + +#[derive(Debug)] +struct DatasetReport { + dataset: String, + record_count: usize, + distinct_ids: usize, + sampled: usize, + checks: Vec, +} + +impl DatasetReport { + fn passed(&self) -> bool { + self.checks.iter().all(|c| c.passed) + } + fn warned(&self) -> bool { + self.checks.iter().any(|c| c.warned) + } +} + +/// Cap how many failing ids we print per check — keeps the report scannable +/// when an entire dataset is broken. +const MAX_FAIL_DETAIL: usize = 5; + +/// Default minimum acceptable `text` length per record. Anything shorter +/// is almost certainly a join failure (metadata.json missing or empty), not +/// a legitimate short bill. +const MIN_TEXT_LEN: usize = 50; + +/// Per-dataset distinct-id / record-count ratio floor. Bug 7592418 +/// collapsed 4916 records onto 97 ids (ratio 0.02). The floor is set +/// at 0.03 — high enough to flag a 100x collision, low enough to +/// accept a dataset where a handful of active bills emit many +/// substantive log records each (e.g. a state with sustained voting +/// activity on the same few bills). Drop it further if a clean cache +/// shows legitimate sub-0.03 ratios. +const MIN_DISTINCT_RATIO: f64 = 0.03; + +/// Map a parsed `parse_doc_route` dataset prefix (e.g. `nj-legislation`) +/// to the bare short_name (`nj`) that `git::get_local_datasets` returns. +/// This is the only place where doc-id prefixes and on-disk dataset +/// short names meet; getting it wrong silently breaks the per-dataset +/// bucketing. +fn dataset_short_name(prefix: &str, suffix: &str) -> String { + if let Some(s) = prefix.strip_suffix(suffix) { + s.to_string() + } else if let Some(s) = prefix.strip_suffix("-data-pipeline") { + s.to_string() + } else { + prefix.to_string() + } +} + +fn run_doctor_command(cmd: Command) -> anyhow::Result<()> { + let Command::Doctor { + govbot_dir, + sample, + limit, + output, + } = cmd + else { + unreachable!() + }; + + let repos_dir = get_govbot_dir(govbot_dir.clone())?; + + // Skip-cleanly contract: an empty or absent cache is not a failure. + // `doctor` is a smoke test, not a unit test — it has nothing to check + // until data is pulled. Exit 0 with a clear note. + if !repos_dir.exists() { + let note = format!( + "doctor: no cache at {} — run `govbot pull all` first. Skipping.", + repos_dir.display() + ); + if output == "json" { + println!( + "{}", + serde_json::json!({ "status": "skipped", "reason": note }) + ); + } else { + eprintln!("{}", note); + } + return Ok(()); + } + + let datasets = match git::get_local_datasets(&repos_dir) { + Ok(d) => d, + Err(e) => anyhow::bail!("doctor: failed to enumerate cached datasets: {}", e), + }; + + // Stale or broken entries in `repos/` — names that look like dataset + // links (matching the configured suffix) but don't resolve to a real + // directory. A broken symlink is the canonical case; the entry sits + // in `repos/` but `get_local_datasets` filtered it out because + // `is_dir()` follows the link and returns false. Surface these so + // they're not invisible — they break `govbot source` for that state + // without any other signal. + let broken_dataset_entries = enumerate_broken_dataset_entries(&repos_dir); + + if datasets.is_empty() { + let note = format!( + "doctor: {} is empty — run `govbot pull all` first. Skipping.", + repos_dir.display() + ); + if output == "json" { + println!( + "{}", + serde_json::json!({ "status": "skipped", "reason": note }) + ); + } else { + eprintln!("{}", note); + } + return Ok(()); + } + + // Resolve the parent govbot-dir for the subprocess `--govbot-dir` arg. + // `get_govbot_dir` appends `/repos`; we pass the parent so the child + // appends its own `/repos` and lands on the same path. + let govbot_dir_arg = repos_dir + .parent() + .map(|p| p.to_string_lossy().to_string()) + .unwrap_or_else(|| ".govbot".to_string()); + + let started = std::time::Instant::now(); + + // Stream every record once, in --select docs --limit none mode. We use a + // subprocess so we exercise the same code path activists hit; doctor is + // a "what does `govbot source` actually emit?" check, not a re-derivation. + let stream = collect_doc_stream(&govbot_dir_arg, &limit) + .map_err(|e| anyhow::anyhow!("doctor: source stream failed: {}", e))?; + + // Bucket records by dataset short_name. The doc id carries the full + // `-legislation` (or legacy `-data-pipeline`) repo dir + // prefix; `get_local_datasets` returns the bare short_name, so we + // normalise both to the short form before keying. + let mut per_dataset: HashMap = HashMap::new(); + let mut unrouted: Vec = Vec::new(); + + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + + for rec in &stream { + let id = rec.id.clone(); + + // Route to a dataset via the `/country:...` prefix in the + // id. A record we can't route is recorded for the global report; it + // can't contribute to per-dataset coverage. + let dataset_short = parse_doc_route(&id) + .and_then(|r| r.dataset) + .map(|d| dataset_short_name(&d, &suffix)); + match dataset_short { + Some(d) => { + let entry = per_dataset.entry(d).or_default(); + entry.record_count += 1; + entry.distinct_ids.insert(id.clone()); + if entry.samples.len() < sample { + entry.samples.push(DoctorSample { + id, + text_len: rec.text_len, + }); + } + } + None => { + if unrouted.len() < MAX_FAIL_DETAIL { + unrouted.push(id); + } + } + } + } + + // Build per-dataset reports. The four per-dataset checks are: coverage + // (≥1 record), id-distinctness (the bug 7592418 signature — many + // records collapsing onto one id), sampled-metadata-json-resolves, + // and sampled-text-length. + let mut dataset_reports: Vec = Vec::with_capacity(datasets.len()); + for dataset in &datasets { + let prefix = git::repo_dir_name(dataset); + let dataset_repo_dir = repos_dir.join(&prefix); + let summary = per_dataset.remove(dataset.as_str()).unwrap_or_default(); + + let mut checks = Vec::new(); + + // Coverage — a zero-record dataset is reported as a warning, + // not a failure: `--filter default` legitimately drops every + // record in a dataset whose only recent logs are routine + // (introductions, committee referrals). That state is normal + // for a freshly-cloned session early in its calendar. Doctor + // surfaces it so the activist can notice — pulled but silent — + // without failing the overall smoke test. + let coverage_warned = summary.record_count == 0; + let coverage_detail = if coverage_warned { + vec![format!( + "{} is linked but produced 0 records (likely an empty session or `--filter default` dropping every log — not necessarily broken)", + prefix + )] + } else { + Vec::new() + }; + checks.push(DoctorCheck { + name: "coverage", + passed: true, + warned: coverage_warned, + detail: coverage_detail, + }); + + // ID distinctness — bug 7592418 collapsed 4916 records onto 97 + // ids (ratio 0.02). After the fix it's ~0.81. A per-log emission + // pattern legitimately produces some duplicate ids (the same + // bill emitting multiple substantive log events), so we don't + // demand uniqueness — but we do demand the ratio stay well + // above the bug-case floor. Below MIN_DISTINCT_RATIO is the + // smoking gun. + let distinct = summary.distinct_ids.len(); + let total = summary.record_count; + let ratio = if total == 0 { + 1.0 + } else { + distinct as f64 / total as f64 + }; + let distinctness_passed = total == 0 || ratio >= MIN_DISTINCT_RATIO; + let distinctness_detail = if distinctness_passed { + Vec::new() + } else { + vec![format!( + "{}/{} distinct ids (ratio {:.2}) — below the {:.2} floor; ids are likely collapsing across distinct bills (the bug-7592418 signature)", + distinct, total, ratio, MIN_DISTINCT_RATIO + )] + }; + checks.push(DoctorCheck { + name: "id_distinctness", + passed: distinctness_passed, + warned: false, + detail: distinctness_detail, + }); + + // Metadata.json resolves + let mut metadata_failures: Vec = Vec::new(); + for s in &summary.samples { + if let Err(reason) = check_metadata_json(&s.id, &dataset_repo_dir) { + if metadata_failures.len() < MAX_FAIL_DETAIL { + metadata_failures.push(format!("{} :: {}", s.id, reason)); + } + } + } + checks.push(DoctorCheck { + name: "metadata_sampleable", + passed: metadata_failures.is_empty(), + warned: false, + detail: metadata_failures, + }); + + // Text length + let mut text_failures: Vec = Vec::new(); + for s in &summary.samples { + if s.text_len < MIN_TEXT_LEN && text_failures.len() < MAX_FAIL_DETAIL { + text_failures.push(format!( + "{} :: text length {} < {}", + s.id, s.text_len, MIN_TEXT_LEN + )); + } + } + checks.push(DoctorCheck { + name: "text_non_empty", + passed: text_failures.is_empty(), + warned: false, + detail: text_failures, + }); + + dataset_reports.push(DatasetReport { + dataset: dataset.clone(), + record_count: summary.record_count, + distinct_ids: summary.distinct_ids.len(), + sampled: summary.samples.len(), + checks, + }); + } + + // Build the global report. Global "duplicate ids" check is gone — + // per-log emission legitimately produces some duplicates. The id + // collapse bug (7592418) is caught per-dataset by id_distinctness. + let elapsed = started.elapsed(); + let total_records: usize = dataset_reports.iter().map(|r| r.record_count).sum(); + let total_distinct: usize = dataset_reports.iter().map(|r| r.distinct_ids).sum(); + let all_passed = unrouted.is_empty() + && broken_dataset_entries.is_empty() + && dataset_reports.iter().all(|r| r.passed()); + + if output == "json" { + emit_doctor_json( + &dataset_reports, + total_records, + total_distinct, + &unrouted, + &broken_dataset_entries, + elapsed, + all_passed, + ); + } else { + emit_doctor_text( + &dataset_reports, + total_records, + total_distinct, + &unrouted, + &broken_dataset_entries, + elapsed, + all_passed, + ); + } + + if !all_passed { + // Non-zero exit so a CI step `govbot doctor` fails the pipeline. + std::process::exit(1); + } + Ok(()) +} + +/// Names sitting in `/` that look like dataset entries (matching +/// the configured suffix) but don't resolve to a real directory — e.g. a +/// dangling symlink left over from a hand-edited cache, or a broken +/// pull. `get_local_datasets` silently filters these out; doctor surfaces +/// them as a global failure so they don't go unnoticed. +fn enumerate_broken_dataset_entries(repos_dir: &Path) -> Vec { + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + let mut broken = Vec::new(); + let read = match std::fs::read_dir(repos_dir) { + Ok(r) => r, + Err(_) => return broken, + }; + for entry in read.flatten() { + let path = entry.path(); + let Some(name) = path.file_name().and_then(|n| n.to_str()) else { + continue; + }; + let looks_like_dataset = name.ends_with(&suffix) || name.ends_with("-data-pipeline"); + if !looks_like_dataset { + continue; + } + // `is_dir()` follows symlinks, so a dangling symlink reads false. + if !path.is_dir() { + broken.push(name.to_string()); + } + } + broken.sort(); + broken +} + +/// Minimal `{id,text,kind}` record drained from `govbot source --select docs`. +#[derive(Debug)] +struct DocRecord { + id: String, + text_len: usize, +} + +/// Invoke `govbot source --select docs --limit ` against the given +/// cache and return one `DocRecord` per emitted JSON line. We materialise +/// fully rather than streaming — the assertion set needs the whole corpus +/// before per-dataset ratios mean anything, and at the smoke-test limit +/// (default 100/repo, ~5000 records total) memory is a non-issue. +fn collect_doc_stream(govbot_dir: &str, limit: &str) -> std::io::Result> { + let exe = std::env::current_exe().unwrap_or_else(|_| PathBuf::from("govbot")); + let mut source_cmd = ProcessCommand::new(&exe); + source_cmd + .arg("source") + .arg("--select") + .arg("docs") + .arg("--limit") + .arg(limit) + .arg("--filter") + .arg("default") + .arg("--join") + .arg("bill") + .arg("--sort") + .arg("DESC") + .arg("--govbot-dir") + .arg(govbot_dir); + + let output = source_cmd.output()?; + if !output.status.success() { + let stderr_str = String::from_utf8_lossy(&output.stderr); + return Err(std::io::Error::other(format!( + "source exited with status {:?}: {}", + output.status.code(), + stderr_str + ))); + } + + let mut records = Vec::new(); + for line in output.stdout.split(|b| *b == b'\n') { + if line.is_empty() { + continue; + } + let v: serde_json::Value = match serde_json::from_slice(line) { + Ok(v) => v, + Err(_) => continue, // Best-effort — source itself logs the parse failure. + }; + let id = v + .get("id") + .and_then(|x| x.as_str()) + .unwrap_or("") + .to_string(); + let text_len = v + .get("text") + .and_then(|x| x.as_str()) + .map(|s| s.len()) + .unwrap_or(0); + // A record without an id will fail the per-dataset `unrouted` + // bucket (`parse_doc_route` returns None for the empty string), + // surfacing as a global routability failure. + records.push(DocRecord { id, text_len }); + } + Ok(records) +} + +/// Translate a doc id back to its on-disk `metadata.json` and confirm it +/// (a) exists, (b) parses as JSON, (c) has at least a `title` or `identifier` +/// field. The third leg is what would have caught 5ab6d3c — a dir-name vs +/// `log.bill_id` whitespace mismatch produces an id whose metadata.json +/// path simply doesn't exist on disk. +fn check_metadata_json(doc_id: &str, dataset_repo_dir: &Path) -> Result<(), String> { + let route = parse_doc_route(doc_id).ok_or_else(|| { + "id does not match expected `country:.../bills/` shape".to_string() + })?; + // Path: /country:/state:/sessions//bills//metadata.json + let metadata_path = dataset_repo_dir + .join(format!("country:{}", route.country)) + .join(format!("state:{}", route.state)) + .join("sessions") + .join(&route.session) + .join("bills") + .join(&route.bill_id) + .join("metadata.json"); + + if !metadata_path.exists() { + return Err(format!( + "metadata.json not found at {}", + metadata_path.display() + )); + } + let contents = fs::read_to_string(&metadata_path) + .map_err(|e| format!("cannot read {}: {}", metadata_path.display(), e))?; + let value: serde_json::Value = serde_json::from_str(&contents) + .map_err(|e| format!("invalid JSON in {}: {}", metadata_path.display(), e))?; + let has_title = value + .get("title") + .and_then(|v| v.as_str()) + .map(|s| !s.is_empty()) + .unwrap_or(false); + let has_identifier = value + .get("identifier") + .and_then(|v| v.as_str()) + .map(|s| !s.is_empty()) + .unwrap_or(false); + if !has_title && !has_identifier { + return Err(format!( + "metadata.json at {} has neither `title` nor `identifier`", + metadata_path.display() + )); + } + Ok(()) +} + +/// Human-readable doctor report. Per-dataset one-liners followed by the +/// global summary; failures get an indented detail block. +fn emit_doctor_text( + dataset_reports: &[DatasetReport], + total_records: usize, + total_distinct: usize, + unrouted: &[String], + broken_entries: &[String], + elapsed: std::time::Duration, + all_passed: bool, +) { + println!( + "govbot doctor — {} dataset(s), {} record(s), {} distinct id(s), {:.2}s", + dataset_reports.len(), + total_records, + total_distinct, + elapsed.as_secs_f64() + ); + println!(); + + for r in dataset_reports { + let status = if !r.passed() { + "FAIL" + } else if r.warned() { + "WARN" + } else { + "PASS" + }; + println!( + " [{}] {:<22} records={:<5} distinct={:<5} sampled={}", + status, r.dataset, r.record_count, r.distinct_ids, r.sampled + ); + for c in &r.checks { + if !c.passed { + println!(" - {}: FAIL", c.name); + for d in &c.detail { + println!(" • {}", d); + } + } else if c.warned { + println!(" - {}: WARN", c.name); + for d in &c.detail { + println!(" • {}", d); + } + } + } + } + + println!(); + if !broken_entries.is_empty() { + println!( + " [FAIL] global.dataset_links {} broken or non-dir entry/entries in repos/:", + broken_entries.len() + ); + for name in broken_entries.iter().take(MAX_FAIL_DETAIL) { + println!( + " • {} (likely a dangling symlink or non-directory)", + name + ); + } + if broken_entries.len() > MAX_FAIL_DETAIL { + println!( + " • ...and {} more", + broken_entries.len() - MAX_FAIL_DETAIL + ); + } + } else { + println!(" [PASS] global.dataset_links"); + } + + if !unrouted.is_empty() { + println!( + " [FAIL] global.routable_ids {} id(s) without a `/country:...` prefix:", + unrouted.len() + ); + for id in unrouted.iter().take(MAX_FAIL_DETAIL) { + println!(" • {}", id); + } + } else { + println!(" [PASS] global.routable_ids"); + } + + println!(); + if all_passed { + println!("doctor: PASS"); + } else { + println!("doctor: FAIL"); + } +} + +/// Machine-readable doctor report. Stable enough to pipe into CI. +fn emit_doctor_json( + dataset_reports: &[DatasetReport], + total_records: usize, + total_distinct: usize, + unrouted: &[String], + broken_entries: &[String], + elapsed: std::time::Duration, + all_passed: bool, +) { + let datasets: Vec = dataset_reports + .iter() + .map(|r| { + let checks: Vec = r + .checks + .iter() + .map(|c| { + serde_json::json!({ + "name": c.name, + "passed": c.passed, + "warned": c.warned, + "detail": c.detail, + }) + }) + .collect(); + serde_json::json!({ + "dataset": r.dataset, + "passed": r.passed(), + "record_count": r.record_count, + "distinct_ids": r.distinct_ids, + "sampled": r.sampled, + "checks": checks, + }) + }) + .collect(); + let report = serde_json::json!({ + "status": if all_passed { "pass" } else { "fail" }, + "elapsed_secs": elapsed.as_secs_f64(), + "total_records": total_records, + "total_distinct_ids": total_distinct, + "unrouted_ids": unrouted, + "broken_dataset_entries": broken_entries, + "datasets": datasets, + }); + println!("{}", serde_json::to_string_pretty(&report).unwrap()); +} + #[tokio::main] async fn main() -> anyhow::Result<()> { let args = Args::parse(); @@ -2811,6 +3474,7 @@ async fn main() -> anyhow::Result<()> { Some(cmd @ Command::Remove { .. }) => run_remove_command(cmd), Some(cmd @ Command::Ls { .. }) => run_ls_command(cmd), Some(cmd @ Command::Search { .. }) => run_search_command(cmd), + Some(cmd @ Command::Doctor { .. }) => run_doctor_command(cmd), None => { let cwd = std::env::current_dir()?; let config_path = cwd.join("govbot.yml"); @@ -3376,4 +4040,119 @@ mod tests { let absent = match_tags_in_dir(&tmp.path().join("no-such-dir"), "HB0001"); assert!(absent.is_empty()); } + + // ----------------------------------------------------------------- + // `govbot doctor` — the corpus-level smoke test. The full end-to-end + // path is exercised against real pulled data (see commit message for + // run details); these unit tests pin the failure-detection legs that + // would have caught bugs 7592418 and 5ab6d3c. + // ----------------------------------------------------------------- + + /// The metadata.json check is the leg that would have flagged 5ab6d3c + /// — a doc id whose dir-name was wrong (display form, with whitespace) + /// resolves to a non-existent metadata.json path. + #[test] + fn doctor_check_metadata_json_flags_missing_file() { + let tmp = tempfile::TempDir::new().unwrap(); + let dataset_dir = tmp.path().join("mi-legislation"); + let bill_dir = dataset_dir + .join("country:us") + .join("state:mi") + .join("sessions") + .join("2025-2026_103rd_Legislature") + .join("bills") + .join("HB4027"); + fs::create_dir_all(&bill_dir).unwrap(); + // Write a well-formed metadata.json — happy path. + fs::write( + bill_dir.join("metadata.json"), + serde_json::to_string(&serde_json::json!({ + "title": "An Act…", + "identifier": "HB 4027", + })) + .unwrap(), + ) + .unwrap(); + + // A clean id resolves. + let good_id = + "mi-legislation/country:us/state:mi/sessions/2025-2026_103rd_Legislature/bills/HB4027"; + assert!(check_metadata_json(good_id, &dataset_dir).is_ok()); + + // The exact pre-5ab6d3c failure: log.bill_id `"HB 4027"` (with + // whitespace) bleeds into the doc id, and the on-disk dir is + // `HB4027` — so the metadata.json path doesn't exist. + let broken_id = + "mi-legislation/country:us/state:mi/sessions/2025-2026_103rd_Legislature/bills/HB 4027"; + let err = check_metadata_json(broken_id, &dataset_dir).unwrap_err(); + assert!( + err.contains("not found"), + "expected 'not found' in error, got: {}", + err + ); + } + + /// metadata.json present but lacking both `title` and `identifier` — + /// counts as a fail. This catches stub/empty-bill clones where the + /// scraper landed but populated nothing usable. + #[test] + fn doctor_check_metadata_json_requires_title_or_identifier() { + let tmp = tempfile::TempDir::new().unwrap(); + let dataset_dir = tmp.path().join("wy-legislation"); + let bill_dir = dataset_dir + .join("country:us") + .join("state:wy") + .join("sessions") + .join("2025") + .join("bills") + .join("HB0001"); + fs::create_dir_all(&bill_dir).unwrap(); + fs::write( + bill_dir.join("metadata.json"), + // Neither title nor identifier — both empty / absent. + serde_json::to_string(&serde_json::json!({"description": "..."})).unwrap(), + ) + .unwrap(); + + let id = "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001"; + let err = check_metadata_json(id, &dataset_dir).unwrap_err(); + assert!(err.contains("neither `title` nor `identifier`")); + } + + /// `dataset_short_name` is the only place where the dataset prefix + /// in a doc id (`-legislation`) and the short_name returned by + /// `get_local_datasets` (``) meet. Getting this wrong silently + /// breaks per-dataset bucketing — every dataset shows zero coverage + /// even though records were emitted. Pin both common suffixes. + #[test] + fn doctor_dataset_short_name_strips_known_suffixes() { + assert_eq!(dataset_short_name("nj-legislation", "-legislation"), "nj"); + assert_eq!(dataset_short_name("usa-legislation", "-legislation"), "usa"); + // Legacy `-data-pipeline` layout — strip it too. + assert_eq!(dataset_short_name("wy-data-pipeline", "-legislation"), "wy"); + // Custom suffix from GOVBOT_REPO_SUFFIX is honoured. + assert_eq!(dataset_short_name("nj-pkg", "-pkg"), "nj"); + // Bare short_name (no suffix at all) passes through. + assert_eq!(dataset_short_name("wy", "-legislation"), "wy"); + } + + /// metadata.json is unreadable JSON — that's still a fail (we can't + /// trust a record whose bill metadata won't even parse). + #[test] + fn doctor_check_metadata_json_flags_unparseable() { + let tmp = tempfile::TempDir::new().unwrap(); + let dataset_dir = tmp.path().join("ca-legislation"); + let bill_dir = dataset_dir + .join("country:us") + .join("state:ca") + .join("sessions") + .join("2025-2026") + .join("bills") + .join("AB100"); + fs::create_dir_all(&bill_dir).unwrap(); + fs::write(bill_dir.join("metadata.json"), b"{ this is not json").unwrap(); + let id = "ca-legislation/country:us/state:ca/sessions/2025-2026/bills/AB100"; + let err = check_metadata_json(id, &dataset_dir).unwrap_err(); + assert!(err.contains("invalid JSON")); + } } diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index 6d444261..61cadd78 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -24,6 +24,7 @@ Commands: remove Remove one or more datasets from the project's `govbot.yml` ls List datasets — the project's manifest datasets and the ones cached locally. With no manifest, lists every dataset in the registry search Search the dataset registry. A blank query lists every dataset + doctor Check that the project's pulled datasets are coherent. A data-integrity smoke test, runnable after `govbot pull all` or before `govbot run` in production. Walks every linked dataset and verifies that the `govbot source --select docs` stream is well-formed: every linked dataset entry resolves to a real directory, per-dataset ids don't collapse onto a handful (the bug-7592418 signature), every sampled `id` resolves to a present and parseable `metadata.json`, and every sampled `text` is non-trivial. Zero-record datasets are surfaced as warnings rather than errors — `--filter default` can legitimately drop every routine log. Exits non-zero on any failure so it can drop straight into a CI step. Skips cleanly when the cache is empty — this is a smoke test, not a unit test help Print this message or the help of the given subcommand(s) Options: From 86030e9089c0e812203f16ae260b68a1cab1f818 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 18:32:37 -0500 Subject: [PATCH 18/32] AGENT.md: install the semantic Tier-2 model in the make flow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A scaffolded classifier bundle has the taxonomy and the fusion config but no embedding model — so the cascade in fusion.yml's `uncertainty_band` silently degrades to lexical-only matchers and the bot misses paraphrases and euphemisms ("energy diversity" never matches `clean_energy`). Real-data audits show this as a 10–15 point recall gap. Add a "Install the semantic Tier-2 model" step to §1.3 of the make flow, right after the fastclass plugin is wired into `.claude/settings.json`. Tells the activist to run `/fastclass:install-model` (the new plugin command in the fastclass repo), explains why, and shows the `fastclass describe` verification that the install worked. The plugin command does the actual work — fetches the recommended sentence-transformers/all-MiniLM-L6-v2 (~22 MB) into the shared `~/.govbot/models//` cache and links it into `classifier/model/`. The install path is idempotent: re-running on a warm cache hits the cache. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/AGENT.md b/AGENT.md index ed6df78a..c760bb1c 100644 --- a/AGENT.md +++ b/AGENT.md @@ -395,6 +395,37 @@ Confirm the exact plugin-source syntax against the fastclass repo's README (`plugins/fastclass/`); adjust if the user's fastclass checkout lives elsewhere. +#### Install the semantic Tier-2 model + +A scaffolded classifier bundle has the taxonomy and fusion config — but no +embedding model. Without one, the cascade in `fusion.yml`'s +`uncertainty_band` silently degrades to lexical-only matchers, which means +the bot will **miss paraphrases and euphemisms** (real-data audits typically +show this as a 10–15 point recall gap on issue-flavored language: "energy +diversity" never matches `clean_energy`, etc.). + +Fix this once, at scaffold time, by running the install-model plugin +command: + +``` +/fastclass:install-model +``` + +The command shows the vetted-model list, defaults to the recommended small +encoder (sentence-transformers/all-MiniLM-L6-v2, ~22 MB), downloads it into +the project-shared cache at `~/.govbot/models//`, and links it +into `classifier/model/` so `govbot run` picks up Tier-2 automatically on +the next pipeline pass. Verify with: + +```bash +fastclass describe classifier=./classifier +# JSON output should include a `model: {…}` block. +``` + +If the download fails (offline laptop, HuggingFace rate-limit), the CLI +prints a `curl` recipe the user can run themselves and re-invoke the +plugin command — the install path is idempotent. + ### 1.4 First run ```bash From cb3d80ff192f32b4b1a3678a9b7607813d081906 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 18:34:40 -0500 Subject: [PATCH 19/32] source: --select docs surfaces OCD subjects as optional structured signal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit OCD-files `metadata.json` carries a high-quality `subject:` array assigned by human OCD scrapers (e.g. ["ENERGY", "ENVIRONMENT", "TAXATION"]). Today the `--select docs` projection drops these on the floor; a Wave-A `concept_match` matcher in fastclass will read them as a direct, gold- standard classification input rather than re-deriving topic signals from the bill text. Schema change (additive, optional): Before: {"id": ..., "kind": "docs", "text": ...} After: {"id": ..., "kind": "docs", "text": ..., "subjects": ["..."]} `subjects` is **omitted entirely** when the bill has no `subject:`, when the array is empty (`[]`), or when every element is blank. Empty equals absent — emitting `"subjects": []` would conflate "no signal" with "explicitly no subjects" and force every consumer to handle two identical cases. Bare log records (no `--join bill`) also omit it. STREAM_PROTOCOL.md §1 now documents the new field's shape, source, optionality, and the contract that downstream transforms unaware of `subjects` ignore it (the contract is additive). Real-data intel from the 185k-bill, 55-jurisdiction corpus at `/.govbot/repos/`: - 45.4% of bills have a non-empty `subject:` (84,196 / 185,335) - Coverage is bimodal: 31 states have ≥50% coverage; 24 states have 0%. - High-coverage examples: HI 100% / 7264 distinct subjects, CA 95% / 5945 distinct, TN 100% / 409 distinct, MN 88% / 245 distinct. - Zero-coverage examples: NY (19407 bills), USA-federal (11317), OK (9262), MA (8975), IL (8648), RI (5495), PA (3578), WA (3411). - 24,854 distinct subject strings corpus-wide. Vocabulary is not unified across states (HI alone uses 7264 distinct; many are state-local terms like "Cities and Towns-Specific" or "SCH BOARDS"). - Climate-relevant subject hits: ENERGY 1608, ENVIRONMENT 1834, CLIMATE 153, EMISSIONS 57, RENEWABLE 152, SOLAR 105, CARBON 113, POLLUTION 335, CONSERVATION 431, WATER 1798. Tests (+5, 61 -> 66 total): subjects present, subject key absent, subject `[]` (explicit empty), all-blank elements, and bare-log / no-bill-join. cargo fmt clean. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 173 +++++++++++++++++++++++++++++++- actions/govbot/src/selectors.rs | 55 ++++++++++ schemas/STREAM_PROTOCOL.md | 14 ++- 3 files changed, 239 insertions(+), 3 deletions(-) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 4c7a35f7..151a43c5 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -5,7 +5,7 @@ use govbot::git; use govbot::lock::LockFile; use govbot::publish::{deduplicate_entries, filter_by_tags, load_manifest, sort_by_timestamp}; use govbot::registry::Registry; -use govbot::selectors::ocd_files_select_default; +use govbot::selectors::{ocd_files_extract_subjects, ocd_files_select_default}; use govbot::{hash_text, BillTagResult, TagFile, TagFileMetadata}; use jwalk::WalkDir; use std::collections::HashMap; @@ -914,6 +914,13 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { /// /// `text` is the **full** bill text assembled from `metadata.json` (not just /// titles) — the `docs` projection joins the complete bill so this is whole. +/// +/// `subjects` is the **optional** OCD `subject:` array, surfaced as a +/// peer of `text` so a downstream `concept_match` matcher can score against +/// the human-curated controlled vocabulary directly. The field is **omitted +/// entirely** when the bill has no `subject:` (vs. an empty `[]`, which +/// would conflate "no signal" with "explicitly empty") — see +/// `selectors::ocd_files_extract_subjects` and STREAM_PROTOCOL.md §1. fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { let bill_id = entry .get("log") @@ -984,7 +991,31 @@ fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { } None => canonical_bill_dir.or(bill_id).unwrap_or_else(String::new), }; - serde_json::json!({ "id": id, "text": ocd_files_select_default(entry), "kind": "docs" }) + let mut out = serde_json::Map::new(); + out.insert("id".to_string(), serde_json::Value::String(id)); + out.insert( + "text".to_string(), + serde_json::Value::String(ocd_files_select_default(entry)), + ); + out.insert( + "kind".to_string(), + serde_json::Value::String("docs".to_string()), + ); + // Optional `subjects:` — only emitted when the bill actually carries one + // or more non-empty OCD `subject:` entries. `None` is the unambiguous + // "no signal" form; we never emit `"subjects": []`. + if let Some(subjects) = ocd_files_extract_subjects(entry) { + out.insert( + "subjects".to_string(), + serde_json::Value::Array( + subjects + .into_iter() + .map(serde_json::Value::String) + .collect(), + ), + ); + } + serde_json::Value::Object(out) } /// Given a `sources.bill` path (`<...>/bills//metadata.json`, @@ -3894,6 +3925,144 @@ mod tests { ); } + /// A4: OCD `subject:` arrays are gold-standard human classifications that + /// fastclass's future `concept_match` matcher reads. When the bill carries + /// a populated `subject:` list, the docs projection must surface it under + /// `subjects` so it travels with the rest of the bill text. + #[test] + fn ocd_entry_to_doc_surfaces_subjects_when_present() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "An act about clean energy", + "identifier": "HB0001", + "subject": ["ENERGY", "ENVIRONMENT", "TAXATION"] + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + let subjects = doc + .get("subjects") + .and_then(|v| v.as_array()) + .expect("subjects must be present and an array when bill carries subject:"); + let actual: Vec<&str> = subjects.iter().filter_map(|v| v.as_str()).collect(); + assert_eq!( + actual, + vec!["ENERGY", "ENVIRONMENT", "TAXATION"], + "subjects must mirror the OCD subject: array verbatim and in order" + ); + // The rest of the contract — id/text/kind — must be unaffected by + // the additive field. + assert_eq!(doc.get("kind").and_then(|v| v.as_str()), Some("docs")); + assert!( + doc.get("text") + .and_then(|v| v.as_str()) + .unwrap_or("") + .contains("clean energy"), + "existing text projection must still include the bill title" + ); + } + + /// A4: When the bill has no `subject:` key at all, the docs record must + /// have **no `subjects` key** (not `"subjects": []`). Many states omit + /// the OCD subject array entirely; conflating that with "explicitly + /// empty" would force the consumer to guess. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_bill_has_no_subject_key() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "An untagged bill", + "identifier": "HB0001" + // No subject: key at all. + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted entirely when bill has no subject: field; got: {:?}", + doc.get("subjects") + ); + } + + /// A4: An explicitly empty `subject: []` is treated the same as missing — + /// no `subjects` key in the output. WY's `HB0001` mock has `subject: []` + /// for example; we don't want every WY record to ship `"subjects": []` + /// just because the OCD scraper materialized an empty list. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_subject_array_is_empty() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "An empty-subjects bill", + "identifier": "HB0001", + "subject": [] + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted for explicit empty arrays — empty conflates with \ + absent and breaks the 'present means signal' contract; got: {:?}", + doc.get("subjects") + ); + } + + /// A4: A `subject:` array with only blank strings is treated as empty — + /// the trim-then-filter pass means whitespace-only entries don't make it + /// into the projection, and a list of all-blank entries omits the field. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_subject_array_is_all_blank() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "A whitespace-only-subjects bill", + "identifier": "HB0001", + "subject": ["", " "] + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted when every subject element is blank/whitespace" + ); + } + + /// A4: When the entry is a bare `log` record (no `--join bill`), + /// `subjects` cannot be derived — there's no bill metadata to read from. + /// The field must be omitted. This is the same fallback path as the id + /// resolution above; without the bill join we have no `subject:` source. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_bill_join_absent() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json" + // No `sources.bill`, no `bill` join. + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted when the bill metadata isn't joined into the entry" + ); + } + /// `.govbot/` is the cache; tag files belong outside it in the project- /// rooted `tags/` output dir. The resolver's primary candidate must /// therefore be `/tags//country:.../state:.../sessions/ diff --git a/actions/govbot/src/selectors.rs b/actions/govbot/src/selectors.rs index 41c478ed..730e9387 100644 --- a/actions/govbot/src/selectors.rs +++ b/actions/govbot/src/selectors.rs @@ -21,6 +21,61 @@ pub fn ocd_files_select_default(value: &serde_json::Value) -> String { .join(" ") } +/// Extract the OCD `subject:` array — the gold-standard structured topic +/// classification a human OCD scraper assigned to the bill. +/// +/// This is the input the `docs` projection adds as an optional `subjects` +/// field so downstream transforms (e.g. fastclass's `concept_match` matcher) +/// can use the controlled-vocabulary signal directly instead of re-deriving +/// it from the bill text. +/// +/// Returns: +/// - `Some(non-empty Vec)` when `metadata.json` has a `subject:` array with +/// at least one non-empty string. +/// - `None` when: +/// - the entry has no bill metadata join (`--join bill` not requested), +/// - the bill metadata has no `subject:` key, +/// - the `subject:` array is empty (`[]`), or +/// - every element is a blank string. +/// +/// **Why empty == None.** Many states populate `subject:` for some bills and +/// leave it `[]` for others; emitting `"subjects": []` would conflate +/// "no signal" with "explicitly no subjects" and force the consumer to +/// distinguish them. Omitting the field entirely is the unambiguous +/// "no signal" form per STREAM_PROTOCOL §1. +pub fn ocd_files_extract_subjects(value: &serde_json::Value) -> Option> { + let bill = bill_object(value)?; + let raw = bill.get("subject")?.as_array()?; + let subjects: Vec = raw + .iter() + .filter_map(|v| v.as_str()) + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty()) + .collect(); + if subjects.is_empty() { + None + } else { + Some(subjects) + } +} + +/// Find the bill `metadata.json` object inside an entry, mirroring how +/// `collect_bill_text` routes between the three wrapping shapes: +/// - `{ "bill": { ... } }` — the joined form +/// - `{ "log": { ... } }` — bare log; no bill metadata available +/// - `{ ... }` — the map *is* a bill metadata.json +fn bill_object(value: &serde_json::Value) -> Option<&serde_json::Map> { + let map = value.as_object()?; + if let Some(bill) = map.get("bill").and_then(|v| v.as_object()) { + return Some(bill); + } + if map.contains_key("log") { + // Bare log entry — `subject:` lives on the bill, which isn't joined. + return None; + } + Some(map) +} + /// Append every text-bearing string of an OCD-files value into `texts`. fn collect_bill_text(value: &serde_json::Value, texts: &mut Vec) { match value { diff --git a/schemas/STREAM_PROTOCOL.md b/schemas/STREAM_PROTOCOL.md index e9105f4b..baa9a7b5 100644 --- a/schemas/STREAM_PROTOCOL.md +++ b/schemas/STREAM_PROTOCOL.md @@ -17,7 +17,7 @@ boundaries** (newline-delimited JSON on stdio), never as linked libraries. **newline-delimited JSON** — one object per line, UTF-8, `\n`-terminated: ```json -{"id": "", "text": "", "kind": "docs"} +{"id": "", "text": "", "kind": "docs", "subjects": ["ENERGY", "ENVIRONMENT"]} ``` - **`id`** — an opaque routing key. The transform treats it as opaque and **echoes it @@ -28,6 +28,18 @@ boundaries** (newline-delimited JSON on stdio), never as linked libraries. - **`kind`** — **required**. Tags the stream record type (`docs` today; future `summary`, etc.). A transform that does not recognize a `kind` **passes the record through untouched** rather than erroring. +- **`subjects`** — **optional**. When the source is an OCD-files bill whose + `metadata.json` carries a non-empty `subject:` array (e.g. `["ENERGY", + "ENVIRONMENT", "TAXATION"]`), govbot's `docs` projection surfaces those tags + here verbatim. These are gold-standard structured classifications assigned + by human OCD scrapers and are the canonical input a `concept_match`-style + matcher should consume rather than re-deriving topic signals from `text`. + The field is **omitted entirely** when the bill has no `subject:` key, when + the array is empty (`[]`), or when every element is blank — "no signal" + is unambiguous, so consumers never have to distinguish "absent" from + "explicitly empty". Bare log records (no bill metadata joined) also omit + it. Transforms that don't know about `subjects` ignore it; the stream + contract is additive. A transform reads this stream line-by-line and emits one result line per input line. From b2c22202d81febd49c2d25b260ac755a38070c0d Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 20:14:29 -0500 Subject: [PATCH 20/32] =?UTF-8?q?docs:=20document=20the=20optional=20`mode?= =?UTF-8?q?l`=20block=20in=20describe=20=C2=A73?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cross-domain pickup from fastclass's govbot-stack-refactor branch (commits 34b0038 + c12dc79): `fastclass model fetch` installs a Tier-2 embedding model under `/model/`, and `fastclass describe` now surfaces that install as an optional `model: {name?, sha256_prefix}` block in its JSON output. STREAM_PROTOCOL §3 is the contract govbot type-checks transforms against, so the field needs to live here too — patterned after §1's treatment of `subjects` (additive, omitted when absent, byte-identical legacy output). Doc-only — no code touched, all 66 govbot tests still pass offline. Co-Authored-By: Claude Opus 4.7 --- schemas/STREAM_PROTOCOL.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/schemas/STREAM_PROTOCOL.md b/schemas/STREAM_PROTOCOL.md index baa9a7b5..feb04a8b 100644 --- a/schemas/STREAM_PROTOCOL.md +++ b/schemas/STREAM_PROTOCOL.md @@ -84,12 +84,25 @@ type-check a transform DAG and validate that `publish.*.select:` tag names exist "writes": ["classification"], "tags": ["clean_energy", "conservation", "emissions_and_climate", "fossil_fuels"], "classifier_version": "sha256:<12-hex>", - "fusion_version": "fusion-v1" + "fusion_version": "fusion-v1", + "model": {"name": "sentence-transformers/all-MiniLM-L6-v2", "sha256_prefix": "<12-hex>"} } ``` - `tags` is the sorted list of active tag names from the bundle. - `describe` is a **subcommand** (not a `classify` flag). +- **`model`** — **optional**. Present iff an embedding model is installed at + `/model/` (the Tier-2 semantic matcher; installed via + `fastclass model fetch` or the `/fastclass:install-model` plugin command). + Shape is `{name?: string, sha256_prefix: string}`: `sha256_prefix` is the + first 12 hex chars of the model file's SHA-256; `name` is the + `KNOWN_MODELS` identifier (e.g. `sentence-transformers/all-MiniLM-L6-v2`) + when the prefix matches a vetted entry, and is **omitted** for a + user-staged custom model whose SHA isn't on the vetted list. The block is + **omitted entirely** for a lexical-only bundle — like `subjects` in §1, + this is additive: consumers that don't know about `model` ignore it, and + the lexical-only describe output is byte-identical to the pre-Tier-2 + contract. ## 4. The classifier-bundle layout From 21f1fc1d5cb0c9e59a0080aebac9d00a2e0ad1df Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sat, 23 May 2026 23:07:31 -0500 Subject: [PATCH 21/32] =?UTF-8?q?docs:=20document=20the=20optional=20`mode?= =?UTF-8?q?l=5Frerank`=20block=20in=20describe=20=C2=A73?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mirrors the `model` paragraph for the reranker peer added in fastclass A3. Also surfaces `model-rerank/` in the §4 bundle layout for symmetry with `model/`. Co-Authored-By: Claude Opus 4.7 --- schemas/STREAM_PROTOCOL.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/schemas/STREAM_PROTOCOL.md b/schemas/STREAM_PROTOCOL.md index feb04a8b..5f4290b8 100644 --- a/schemas/STREAM_PROTOCOL.md +++ b/schemas/STREAM_PROTOCOL.md @@ -85,7 +85,8 @@ type-check a transform DAG and validate that `publish.*.select:` tag names exist "tags": ["clean_energy", "conservation", "emissions_and_climate", "fossil_fuels"], "classifier_version": "sha256:<12-hex>", "fusion_version": "fusion-v1", - "model": {"name": "sentence-transformers/all-MiniLM-L6-v2", "sha256_prefix": "<12-hex>"} + "model": {"name": "sentence-transformers/all-MiniLM-L6-v2", "sha256_prefix": "<12-hex>"}, + "model_rerank": {"name": "cross-encoder/ms-marco-MiniLM-L-6-v2", "sha256_prefix": "<12-hex>"} } ``` @@ -103,6 +104,16 @@ type-check a transform DAG and validate that `publish.*.select:` tag names exist this is additive: consumers that don't know about `model` ignore it, and the lexical-only describe output is byte-identical to the pre-Tier-2 contract. +- **`model_rerank`** — **optional**. Present iff a reranker model is + installed at `/model-rerank/` (sibling of `/model/`). + Same shape as `model` (`{name?: string, sha256_prefix: string}`) and + same `name` rule: set to the `KNOWN_MODELS` row when the SHA matches a + vetted entry, **omitted** for a user-staged reranker whose SHA isn't on + the vetted list. The block is **omitted entirely** for a bundle without + a reranker installed. Additive in the same way as `model`: consumers + that don't know about `model_rerank` ignore it, and a bundle with no + reranker produces describe output byte-identical to the pre-rerank + contract. ## 4. The classifier-bundle layout @@ -119,6 +130,7 @@ passes the path (`classifier=`). `fastclass` must NOT know the word "gov rolling.yml refreshable working eval set (optional) proposals/ improvement-proposal history model/ optional embedding model + model-rerank/ optional reranker model (sibling of model/) fastclass.lock pins bundle + binary versions for lineage ``` From 99bb7eac59098ce62525da4e4c73d82e3df3995b Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sun, 24 May 2026 00:26:55 -0500 Subject: [PATCH 22/32] docs: reframe README/AGENT/CLAUDE/cli around the 4-tool govbot stack MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit govbot is a 4-tool stack for civic-data publishing — pull real legislative data, filter by what you care about, publish with receipts, all from a coding-agent-native dev experience designed to run at nearly-free cost on commodity infrastructure (GitHub Actions + laptop + local models). The first user is climate-activist; the success bar is "Bluesky posts worth reading, nearly free to run/improve". This commit re-leads every doc surface that carries govbot's positioning around that 4-tool framing, with an explicit honest gap map so the docs don't oversell what is not yet built. No code logic changes — README, AGENT, CLAUDE, Cargo.toml description, the main.rs module rustdoc and the clap `about` for `govbot --help`, plus the regenerated `govbot --help` snapshot. The honest gap map (named in every surface): - Select real gov data: sponsors + voting records are captured in metadata but not yet projected into `--select docs`; the "<1 minute" pull is the warm-cache case (cold ~3 min). - Filter/transform: fastclass tagging ships; the planned `summarize` transform (local-LLM digests of grouped bills with a deterministic trace) does not yet exist — userland holds a prompt stub. - Publish with receipts: RSS/HTML/JSON/DuckDB/Bluesky ship; X does not; AI digest publishing does not; the receipts artifact (GitHub Pages page carrying model id + source bills + fastclass reasoning + regen command behind every AI digest) is a new capability not yet built. - Coding-agent-native dev experience: AGENT.md make/manage/update flow + fastclass plugin commands + `govbot doctor` already ship; this is the one tool already shipping its vision. AGENT.md §2 (manage) introduces fastclass's `--autonomous` flag as the activist-default after first ratification — the no-ratify apply path that lets the crew run hands-off between ratifications while keeping the audit trail via `generated_by: autonomous-coverage-gap` in `fastclass.lock`. (Cross-domain pickup from fastclass commit 5f2b9c6.) Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 57 +++++++++++-- CLAUDE.md | 42 ++++++++++ README.md | 83 ++++++++++++++++++- actions/govbot/Cargo.toml | 2 +- actions/govbot/src/main.rs | 48 ++++++++++- ...i_example_snaps__snapshot@govbot_help.snap | 3 +- 6 files changed, 222 insertions(+), 13 deletions(-) diff --git a/AGENT.md b/AGENT.md index c760bb1c..4b2a925c 100644 --- a/AGENT.md +++ b/AGENT.md @@ -1,9 +1,32 @@ # AGENT.md — build a government-news bot with govbot -You are a Claude Code session helping a user stand up, operate, or evolve a -**govbot newsbot** — a project that pulls government legislation, classifies -the bills relevant to an issue the user cares about, and publishes the matches -(today, to a Bluesky account). +You are a Claude Code session helping an activist stand up, operate, or +evolve a **govbot newsbot** — a project that pulls real legislative data, +filters it down to the issue the activist cares about, and publishes the +matches (today, to a Bluesky account) at **nearly-free** running cost. + +govbot is a **4-tool stack** and the playbook below follows that shape: + +1. **Select real gov data** — `govbot pull` clones the legislation of all + 50 states, DC, the territories, and federal Congress from a content- + addressed registry of git repos. Scrapers thanks to OpenStates. +2. **Filter / transform** — fastclass tags each bill against an issue + taxonomy the activist owns; the publishers filter on those tags. The + planned `summarize` transform (local-LLM digests of grouped bills with + a trace of model + source data) is not yet built — userland keeps a + `summarizer/prompt.md` stub for when it lands. +3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a Bluesky + posting bot today; X and a "receipts" GitHub Pages artifact (the + deterministic provenance behind every AI digest: model id, source + bills, fastclass reasoning, regen recipe) are roadmap. +4. **Coding-agent-native dev experience** — *this file is tool #4*. A + fresh Claude Code session reads it and can make / manage / update a + project end-to-end with no other onboarding. + +The cost bar is climate-activist's: **nearly free to run, worth reading**. +If a choice in the playbook would push the activist toward a paid API +when a local model would do, push back; if a choice would make a post less +trustworthy, prefer the choice that ships the receipt. This file is the **end-user playbook**. A fresh session loads it by URL: @@ -27,8 +50,13 @@ assume climate. ## The three jobs -A user comes to you for one of three things. Identify which, then jump to that -section. +A user comes to you for one of three things. Identify which, then jump to +that section. Each job exercises the 4-tool stack from a different angle: +**make** scaffolds the pull+filter+publish chain (today's MVP — does NOT +yet scaffold a summarize transform or a receipts page, neither of which +exists); **manage** keeps the loop running and introduces fastclass's +`--autonomous` mode after first ratification (the activist-default for +hands-off improvement); **update** evolves the stack. | Job | The user says… | Section | |---|---|---| @@ -457,6 +485,23 @@ The `bluesky` publisher is a **posting bot**: it posts to a normal Bluesky account via the AT Protocol and runs to completion (no server). It is idempotent — a posted-state ledger keeps re-runs from double-posting. +**Activist default after first ratification: `--autonomous`.** Once the +activist has ratified one classifier proposal end-to-end (so they have +felt the loop once) — the `--autonomous` flag on +`fastclass classify --promote` becomes the recommended ongoing posture. +With `--autonomous`, proposals that pass the frozen constitution gate +apply as usual, and proposals where the constitution is silent +(coverage gap) re-test against the rolling eval set and land if rolling +proves them safe (flips at least one rolling failure to passing, +regresses nothing, no per-tag precision loss). The `fastclass.lock` +file marks autonomously-applied proposals with +`generated_by: autonomous-coverage-gap`, so the audit trail is +preserved — the receipt story extends into the classifier. This is the +mode that lets the activist crew run the bot **hands-off between +ratifications** without giving up provenance, which is the whole reason +the cost story is "nearly free to operate". See §3 for the per-proposal +flow you walked through first. + ### 2.1 Create the app password 1. In the Bluesky app: **Settings → App Passwords → Add App Password**. diff --git a/CLAUDE.md b/CLAUDE.md index dbd4797a..04739bfe 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,6 +4,48 @@ This file provides senior engineering-level guidance for Claude Code when workin ## Project Overview +**govbot is a 4-tool stack for civic-data publishing**, built so an +activist crew can run a credible news-bot at nearly-free cost on commodity +infrastructure (GitHub Actions + a laptop with local models). The stack +exists to clear one bar: the first user, the **climate-activist** userland +repo, must be able to ship Bluesky posts that are "worth reading" at +"nearly free to run/improve". Every architectural choice in this repo +should be checked against that. + +The 4 tools, with the honest state of each: + +1. **Select real gov data** — `govbot pull` over 55 OCD dataset git repos + (every US state + DC + territories + federal Congress), content- + addressed in `~/.govbot/cache/`. `govbot doctor` validates. Today + `govbot source --select docs` ships bill text + subjects; **sponsors + and voting records are captured in metadata but not yet projected + into `--select docs`** — a recall gap for sponsor-pattern signals. +2. **Filter / transform** — fastclass tagging is the shipped transform + (Wave A). The planned **`summarize` transform** (local-LLM digests + of grouped bills, emitted with model id + source bill ids + prompt + revision so the digest is reproducible) **does not exist** — + userland holds a `summarizer/prompt.md` stub. +3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a Bluesky + posting bot ship today. **X is not built. AI digest publishing is + not built.** **"Receipts" as defined in the vision** — a GitHub + Pages artifact carrying the deterministic provenance behind every + AI digest (model used, source bill ids, fastclass scores + + reasoning, regen command) — **is a new capability that does not yet + exist**. The current classification evidence chains carry most of + the data a receipt would need; they are not yet packaged into a + public artifact. +4. **Coding-agent-native dev experience** — `AGENT.md` provides the + make/manage/update flow that a fresh Claude Code session can follow + without other onboarding. The fastclass plugin + (`/fastclass:from-intent`, `/fastclass:improve`, `/fastclass:ratify`, + `/fastclass:install-model`) handles the classifier loop. `govbot + doctor` validates installations. This is the one tool that is + already shipping its vision. + +Operators: keep the gap map above honest as features land. The README's +Roadmap section is the public version of this list; this CLAUDE.md is the +internal version, biased toward what the code actually does today. + This is **govbot** - a monorepo for distributed data analysis of government updates. Git repos function as datasets, including legislation from 47+ states/jurisdictions. The `actions/` folder contains self-contained modules that can run as shell scripts or GitHub Actions. ## Senior Engineering Prompts diff --git a/README.md b/README.md index b938b75a..34101b8c 100644 --- a/README.md +++ b/README.md @@ -5,10 +5,85 @@ # 🏛️ govbot -- Download the legislation of [50+ states/jurisdictions](https://github.com/govbot-data) in under 1 minute. -- Classify and summarize bills with private/local models — runs on free GitHub Actions. - -`govbot` is a CLI for distributed analysis of government data. Git repos function as datasets — the legislation of every US state, DC, the territories, and federal Congress. It composes with [`fastclass`](#classifying-with-fastclass) (the classifier) over a Unix pipe; together they pull, classify, and publish a tagged feed of legislation to RSS, HTML, JSON, DuckDB, or a Bluesky posting bot. +**govbot is a 4-tool stack for civic-data publishing** — pull real legislative +data, filter by what you care about, publish with receipts, all from a +coding-agent-native dev experience. The whole stack is designed to run on +free GitHub Actions and a local laptop with local models, so a small +volunteer crew can stand up a credible bot and **keep it running for +~nothing**. + +The first user is **climate-activist**, a userland repo that turns the +country's legislative activity into a Bluesky feed worth reading at +nearly-free cost. Everything in this README is in service of that bar: if +climate-activist cannot ship a "worth reading, nearly free to run" post, +govbot has not earned the framing. + +### The 4 tools + +1. **Select real gov data** — pull the legislative activity of all 50 + states, DC, the territories, and federal Congress from a registry of git + repos (`govbot pull`, scrapers thanks to [OpenStates](https://openstates.org)). + Repos are content-addressed; a second pull (here or in another project) + is a cache hit, not a re-clone. `govbot doctor` validates the cache. + *Today:* bill text + subjects ship via `govbot source --select docs`. + *Honest gap:* sponsors and voting records exist in the underlying + metadata but are not yet in the `--select docs` projection; the "under + 1 minute" headline is the warm-cache case (a cold clone of all 55 + datasets is closer to 3 minutes). + +2. **Filter / transform** (map / filter / reduce — *find the relevant + bills*) — any transform over the stream. The shipped transform today is + **fastclass tagging**: a low-token, high-quality text classifier that + tags bills against an issue taxonomy the user owns, then filters to + what crosses a confidence threshold. *Honest gap:* the planned + **`summarize` transform** — a local-LLM digest of 1–n grouped bills + that emits the summary alongside its data-source trace and model + identity — is not yet built. Userland holds a `summarizer/prompt.md` + stub; the code does not exist. + +3. **Publish with receipts** — many surfaces (RSS, HTML, JSON, DuckDB, + Bluesky today; X planned). The defining idea: every AI-generated + digest links back to **deterministic provenance** — the model used, + the source data, the fastclass reasoning chain, and the recipe to + regenerate it — published as a GitHub Pages "receipt" page next to the + short Bluesky post. *Honest gap:* the AI digest publisher and the + receipt artifact are not yet built. Today's publishers carry + classification evidence chains internally but do not yet package them + into a public, auditable receipt page. + +4. **Coding-agent-native dev experience** — `AGENT.md` is a self-contained + playbook a fresh Claude Code session can follow to **make, manage, and + update** a govbot project. The fastclass plugin + (`/fastclass:from-intent`, `/fastclass:improve`, `/fastclass:ratify`, + `/fastclass:install-model`) handles the classifier loop end-to-end. + `govbot doctor` validates an installation. The "build your own + high-quality, low-cost govbot" path is the one tool that is already + working today. + +### Roadmap (honest gap map) + +Things named in the vision that **do not exist yet**, in priority order: + +- **Sponsors + voting records in `--select docs`.** The underlying scrapers + capture them; the source projection does not yet expose them to + classifiers and digesters. Closes a known recall gap on + sponsor-pattern signals. +- **The `summarize` transform.** A local-LLM digest of grouped bills that + emits the summary plus a structured trace (model id, source bill ids, + prompt revision) so the digest is reproducible. +- **Receipts.** A GitHub Pages artifact published alongside every AI + digest post: human-readable on top, full deterministic provenance + (source bills, model, fastclass scores + reasoning, regen command) + underneath. The short post links to the receipt; the receipt is the + source of trust. +- **X publisher.** Same idempotent posting pattern as the Bluesky + publisher. +- **The "under 1 minute" cold-pull headline.** Today's cold pull of all 55 + datasets is ~3 min. Caching and partial-clone improvements get it + closer to the headline. + +These are tracked as gaps so the rest of the document can be specific +about what *does* work today. ## 🤖 Build a newsbot with Claude Code diff --git a/actions/govbot/Cargo.toml b/actions/govbot/Cargo.toml index 6eb86717..2d3f398a 100644 --- a/actions/govbot/Cargo.toml +++ b/actions/govbot/Cargo.toml @@ -2,7 +2,7 @@ name = "govbot" version = "0.1.0" edition = "2021" -description = "Streaming pipeline log government events from distributed government data" +description = "4-tool civic-data stack: pull real legislative data, filter by what you care about (fastclass tagging), publish with receipts (RSS/HTML/JSON/DuckDB/Bluesky), all from a coding-agent-native dev experience designed to run at nearly-free cost on commodity infrastructure." authors = ["sartaj"] [lib] diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 151a43c5..d5215506 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -1,3 +1,49 @@ +//! # govbot — a 4-tool civic-data publishing stack +//! +//! govbot exists so a small activist crew can run a credible legislative +//! news bot at **nearly-free** cost on commodity infrastructure (GitHub +//! Actions + a laptop with local models). The first user is the +//! `climate-activist` userland repo; the success bar is "Bluesky posts +//! worth reading, nearly free to run/improve". +//! +//! The stack is four composable tools: +//! +//! 1. **Select real gov data** — `govbot pull` clones the legislation of +//! all 50 states, DC, the territories, and federal Congress from a +//! content-addressed registry of git repos (scrapers thanks to +//! OpenStates). Today `govbot source --select docs` projects bill +//! text + subjects; sponsors and voting records are captured in the +//! underlying metadata but not yet in the docs projection. +//! 2. **Filter / transform** — fastclass tagging is the shipped +//! transform: a low-token, high-quality classifier the activist +//! tunes against their own issue taxonomy, piped over the stream +//! protocol (see `schemas/STREAM_PROTOCOL.md`). The planned +//! `summarize` transform — a local-LLM digest of grouped bills +//! emitted with a deterministic trace (model id + source bill ids + +//! prompt revision) — is not yet built. +//! 3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a +//! Bluesky posting bot ship today. The defining roadmap idea is the +//! **receipt**: a GitHub Pages artifact that carries the +//! deterministic provenance behind every AI digest (model used, +//! source bills, fastclass reasoning, regen command) so the short +//! Bluesky post can link to a trustworthy long form. The AI digest +//! publisher and the receipt artifact are not yet built; the X +//! publisher is not yet built. +//! 4. **Coding-agent-native dev experience** — `AGENT.md` is a self- +//! contained playbook a fresh Claude Code session can follow to +//! make / manage / update a govbot project. The fastclass plugin +//! (`/fastclass:from-intent`, `/fastclass:improve`, +//! `/fastclass:ratify`, `/fastclass:install-model`) handles the +//! classifier loop; `govbot doctor` validates installations. The +//! "build your own govbot" path is the one tool already shipping +//! its vision. +//! +//! This binary is the gov-data CLI piece of the stack. It owns dataset +//! pull/cache/lock, the stream-protocol `source` and `apply` stages, +//! the manifest-driven `run` orchestrator, and the publisher set above. +//! Classification is intentionally a separate binary (`fastclass`) so +//! the activist can tune the taxonomy without touching this code. + use clap::{Parser, Subcommand}; use futures::stream; use futures::StreamExt; @@ -66,7 +112,7 @@ struct DatasetPin { #[derive(Parser, Debug)] #[command(name = "govbot")] #[command( - about = "Government-data tool: pull dataset repositories, run transforms over them, and publish artifacts (RSS / HTML / JSON / DuckDB / Bluesky). Configured by a `govbot.yml` manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook." + about = "govbot — a 4-tool civic-data publishing stack. (1) Select real gov data: pull the legislation of all 50 states, DC, territories, and federal Congress from a content-addressed dataset registry. (2) Filter/transform: run transforms over the stream — fastclass tagging today, local-LLM summarize on the roadmap. (3) Publish with receipts: RSS / HTML / JSON / DuckDB / Bluesky today, plus a roadmap GitHub Pages 'receipts' artifact that carries deterministic provenance behind every AI digest. (4) Coding-agent-native dev experience: AGENT.md walks Claude Code through make / manage / update of a project. Configured by a govbot.yml manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook, README for the honest gap map." )] #[command(version)] struct Args { diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index 61cadd78..3490d712 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -1,12 +1,13 @@ --- source: tests/cli_example_snaps.rs +assertion_line: 221 expression: "&formatted_stdout" --- Command: govbot --help Output: -Government-data tool: pull dataset repositories, run transforms over them, and publish artifacts (RSS / HTML / JSON / DuckDB / Bluesky). Configured by a `govbot.yml` manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook. +govbot — a 4-tool civic-data publishing stack. (1) Select real gov data: pull the legislation of all 50 states, DC, territories, and federal Congress from a content-addressed dataset registry. (2) Filter/transform: run transforms over the stream — fastclass tagging today, local-LLM summarize on the roadmap. (3) Publish with receipts: RSS / HTML / JSON / DuckDB / Bluesky today, plus a roadmap GitHub Pages 'receipts' artifact that carries deterministic provenance behind every AI digest. (4) Coding-agent-native dev experience: AGENT.md walks Claude Code through make / manage / update of a project. Configured by a govbot.yml manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook, README for the honest gap map. Usage: govbot [COMMAND] From 6649acb8e0fab99a04a7de4a4ac0c9e36fb86d84 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sun, 24 May 2026 00:30:13 -0500 Subject: [PATCH 23/32] bluesky: move ledger out of .govbot/ into state/ (publisher state, not cache) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Bluesky publisher's posted-state ledger was living under `.govbot/bluesky-.ledger` — inside the tool's cache directory. But the user's framing (the 'node_modules-equivalent' insight that already moved tag files out to `tags/`) makes the right home obvious: `.govbot/` is regenerable cache, the ledger is user-meaningful operational state. An activist who runs `rm -rf .govbot/` to reset the cache shouldn't lose their post history and start double-posting on the next run. This moves the default ledger destination to `/state/bluesky-.ledger`. The legacy `.govbot/`-rooted path is consulted as a *read-only fallback* on every run so an upgrading project doesn't double-post records it logged under the old path; writes only ever land at the new path. After a full re-run the legacy file becomes harmless and can be removed. govbot now has 4 tool-managed top-level dirs with distinct roles: .govbot/ cache (regenerable, safe to rm -rf, never edited) tags/ classification output (govbot apply writes here) state/ publisher state (govbot publish writes here) dist/ publisher output (RSS/HTML/JSON artifacts) Same shape as the tags/-out-of-.govbot/ refactor. Schema bumped, wizard generates the new layout, AGENT.md / CLAUDE.md / README all reflect the 4-dir model. Tests grew (66 vs 62) covering both the new path and the legacy fallback. Backward-compatible: any existing project keeps working. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 21 ++-- CLAUDE.md | 12 ++- README.md | 5 +- actions/govbot/src/bluesky.rs | 196 +++++++++++++++++++++++++++++++++- actions/govbot/src/config.rs | 6 +- actions/govbot/src/wizard.rs | 19 +++- schemas/govbot.schema.json | 2 +- 7 files changed, 239 insertions(+), 22 deletions(-) diff --git a/AGENT.md b/AGENT.md index 4b2a925c..cfb97af9 100644 --- a/AGENT.md +++ b/AGENT.md @@ -211,7 +211,7 @@ publish: # no `html` publisher is configured. base_url: "https://.github.io/" post_template: "{title}\n\n{tags} · {link}" - # ledger: .govbot/bluesky-bluesky.ledger # default; tracks posted bills + # ledger: state/bluesky-bluesky.ledger # default; tracks posted bills feed: type: rss # writes /feed.xml (only) @@ -393,13 +393,19 @@ Project layout: - `.env` — Bluesky credentials (git-ignored; see `.env.example`) Tool-managed dirs (all git-ignored by default): -- `.govbot/` — the tool's CACHE (cloned datasets, ledgers); the +- `.govbot/` — the tool's CACHE (cloned datasets, sync state); the `node_modules/` equivalent. Never edit by hand; `rm -rf .govbot/` is always safe. - `tags/` — classification OUTPUT from `govbot apply` (`tags//country:.../sessions//.tag.json`). Remove `tags/` from `.gitignore` if you want classification provenance committed. +- `state/` — publisher STATE from `govbot publish` (e.g. the + bluesky publisher's posted-state ledger, + `state/bluesky-.ledger`). Regenerable-but- + operational: deleting it makes the next run + double-post. Remove `state/` from `.gitignore` to + commit post history and let cold clones resume. - `dist/` / `docs/` — publisher output from `govbot publish`. To tune the classifier, use the fastclass plugin: `/fastclass:improve`. @@ -528,7 +534,7 @@ Under `govbot.yml: publish:` (see the template in §1.3): | `min_score` | minimum calibrated `final_score` (0..1) to post; default `0.6` | | `base_url` | fallback prefix for `{link}` when no companion `html` publisher is configured; same shape as the rss/html publishers' `base_url` | | `post_template` | post text; placeholders `{title} {tags} {link} {identifier} {session} {score}`; truncated to 300 chars | -| `ledger` | posted-state ledger path; default `.govbot/bluesky-.ledger` | +| `ledger` | posted-state ledger path; default `state/bluesky-.ledger` (peer to `tags/` and `dist/`; NOT under `.govbot/`, which is the tool's cache). A ledger at the legacy `.govbot/bluesky-.ledger` path is read as a fallback so upgrades don't lose history. | `{link}` resolves in this order: (1) the manifest's `html` publisher's `base_url` — the **human-readable landing page** activists actually click @@ -602,14 +608,15 @@ jobs: # Commit the ledger back so re-runs stay idempotent across CI runs: - name: Persist the posted-state ledger run: | - git add -f .govbot/*.ledger || true + git add -f state/*.ledger || true git commit -m "newsbot: update posted-state ledger" || true git push || true ``` -In CI the `.govbot/` ledger is ephemeral unless persisted — commit the -`*.ledger` file back (as above) or store it in a cache/artifact, or the bot -will re-post on every run. +In CI the `state/` ledger is ephemeral unless persisted — commit the +`*.ledger` file back (as above; you'll also want to remove the `state/` +line from `.gitignore` so the commit isn't a force-add forever) or store +it in a cache/artifact, or the bot will re-post on every run. --- diff --git a/CLAUDE.md b/CLAUDE.md index 04739bfe..a6254443 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -219,14 +219,20 @@ A govbot project has three top-level tool-managed dirs, each with a distinct role; do not conflate them: - **`.govbot/`** — the tool's **cache**, the `node_modules/` equivalent. - Cloned datasets, content-addressed sync state, ledgers, an optional - registry override. Fully regenerable; safe to `rm -rf` to start fresh. - **Never edited by hand, never written to by `apply`.** + Cloned datasets, content-addressed sync state, an optional registry + override. Fully regenerable; safe to `rm -rf` to start fresh. + **Never edited by hand, never written to by `apply`.** It does NOT + hold user-meaningful state — the bluesky publisher's posted-state + ledger lives under `state/`, not here. - **`tags/`** — **classification output**, written by `govbot apply`. The layout mirrors the source path with a dataset prefix: `tags//country:.../state:.../sessions//.tag.json`. Regenerated by every classify run; the dataset prefix is what isolates same-named tag files across jurisdictions in a multi-dataset project. +- **`state/`** — **publisher state**, written by `govbot publish`. The + bluesky publisher's posted-state ledger lives at + `state/bluesky-.ledger`. Regenerable-but-operational: deleting + it makes the next run double-post. Peer of `tags/` and `dist/`. - **`dist/`** — **publisher output**, written by `govbot publish` (RSS / HTML / JSON feeds). diff --git a/README.md b/README.md index 34101b8c..6a5792ad 100644 --- a/README.md +++ b/README.md @@ -187,11 +187,14 @@ role; all are git-ignored by default: | Dir | Owner | Contents | |---|---|---| -| `.govbot/` | the tool's **cache** (`node_modules/` equivalent) | cloned datasets, ledgers, sync state. Fully regenerable. Never edit. | +| `.govbot/` | the tool's **cache** (`node_modules/` equivalent) | cloned datasets, sync state. Fully regenerable. Never edit. | | `tags/` | `govbot apply` (**classification output**) | `tags//country:.../sessions//.tag.json` | +| `state/` | `govbot publish` (**publisher state**) | append-only ledgers (e.g. bluesky's posted-state at `state/bluesky-.ledger`). Regenerable but operational — deleting it double-posts on the next run. | | `dist/` (or `docs/`) | `govbot publish` (**publisher output**) | RSS / HTML / JSON feeds | Remove `tags/` from `.gitignore` to commit classification provenance. +Remove `state/` from `.gitignore` to commit publisher state (e.g. so a +cold CI clone resumes without double-posting). # 🏛️ Govbot Legislation Data Catalogs diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs index e67a5494..945c91c9 100644 --- a/actions/govbot/src/bluesky.rs +++ b/actions/govbot/src/bluesky.rs @@ -74,8 +74,11 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { let min_score = p.resolved_min_score(); // Resolve the ledger path (project-dir relative). Default: a per-publisher - // file under `.govbot/`. + // file under `state/`. The legacy `.govbot/`-rooted path is consulted as + // a read-only fallback for projects that ran a pre-fix govbot, so a + // version bump doesn't lose post history; see `resolve_ledger_path`. let ledger_path = resolve_ledger_path(job); + let legacy_path = legacy_ledger_path(job); // Select records: a `select`ed tag must clear the calibrated threshold. // @@ -112,8 +115,17 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { return Ok(()); } - // Idempotency: drop records already in the posted-state ledger. - let already_posted = read_ledger(&ledger_path)?; + // Idempotency: drop records already in the posted-state ledger. The set + // is the union of the new (`state/`) ledger and the legacy (`.govbot/`) + // ledger so an upgrading project doesn't double-post records it logged + // under the old path. Writes only land at the new path; the legacy file + // becomes harmless once a full re-run has copied its contents forward. + let mut already_posted = read_ledger(&ledger_path)?; + if ledger_path != legacy_path { + for id in read_ledger(&legacy_path)? { + already_posted.insert(id); + } + } let pending: Vec<&RenderedPost> = posts .iter() .filter(|post| !already_posted.contains(&post.id)) @@ -391,8 +403,27 @@ fn truncate_post(text: &str) -> String { // ============================================================ /// Resolve the ledger file path: the publisher's `ledger` field if set, -/// else `/.govbot/bluesky-.ledger`. Relative paths resolve +/// else `/state/bluesky-.ledger`. Relative paths resolve /// against the project directory (where `govbot.yml` lives). +/// +/// **Why `state/` and not `.govbot/`.** `.govbot/` is the tool's cache — +/// the `node_modules/` equivalent — and is safe to `rm -rf` to start +/// fresh. The posted-state ledger is the opposite: it is the +/// **single source of truth** for which records the bot has already +/// posted; deleting it makes the next run double-post everything. Putting +/// it under `.govbot/` invited exactly that footgun. `state/` is the +/// peer of `tags/` (classification output) and `dist/` (publisher +/// output) — an operational, non-cache dir that scales as more stateful +/// publishers land (a future `mastodon` publisher would put its ledger +/// at `state/mastodon-.ledger`). +/// +/// **Backward compatibility.** Writes always land at the new +/// `state/...` path. Reads check there first; if the file is missing, +/// they fall back to the legacy `.govbot/bluesky-.ledger` so +/// existing projects don't lose post history on upgrade. After one full +/// re-run the new ledger has everything the old one did, and the user +/// (or a future `govbot migrate`) can delete the legacy file. See +/// `read_ledger` / `legacy_ledger_path`. fn resolve_ledger_path(job: &PublishJob) -> PathBuf { match &job.publisher.ledger { Some(p) => { @@ -405,11 +436,20 @@ fn resolve_ledger_path(job: &PublishJob) -> PathBuf { } None => job .project_dir - .join(".govbot") + .join("state") .join(format!("bluesky-{}.ledger", job.name)), } } +/// The legacy `.govbot/`-rooted ledger path. Read-only fallback for +/// projects that ran a pre-fix govbot; never written. See the doc +/// comment on `resolve_ledger_path` for the migration story. +fn legacy_ledger_path(job: &PublishJob) -> PathBuf { + job.project_dir + .join(".govbot") + .join(format!("bluesky-{}.ledger", job.name)) +} + /// Read the set of already-posted record ids from the ledger. A missing /// ledger is an empty set (first run). The ledger is append-only, /// newline-delimited, one record id per line. @@ -790,4 +830,150 @@ mod tests { post.text ); } + + // ------------------------------------------------------------ + // Ledger-path regression tests (Bug: ledger in `.govbot/`) + // ------------------------------------------------------------ + + use crate::config::{Publisher, PublisherKind}; + use tempfile::tempdir; + + /// Build a minimal bluesky `Publisher` with `ledger = None` so the + /// default-path resolution is exercised. + fn bluesky_publisher_default() -> Publisher { + Publisher { + kind: PublisherKind::Bluesky, + select: None, + base_url: None, + output_dir: None, + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + } + } + + fn job_for_publisher<'a>( + name: &'a str, + publisher: &'a Publisher, + project_dir: PathBuf, + ) -> PublishJob<'a> { + PublishJob { + name, + publisher, + entries: vec![], + output_dir_override: None, + output_file_override: None, + project_dir, + dry_run: false, + html_entry_url: None, + } + } + + /// The default ledger path lands under `state/`, NOT `.govbot/`. + /// `.govbot/` is the tool's regenerable cache (node_modules/-style); + /// the ledger is user-meaningful state — deleting `.govbot/` to + /// reset the cache must not destroy post history. + #[test] + fn default_ledger_path_lives_under_state_not_govbot_cache() { + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + let resolved = resolve_ledger_path(&job); + assert_eq!( + resolved, + dir.path().join("state").join("bluesky-bluesky.ledger"), + "default ledger must be /state/bluesky-.ledger, not under .govbot/" + ); + // Cross-check: it must NOT be under the cache dir. + assert!( + !resolved.starts_with(dir.path().join(".govbot")), + "default ledger must never resolve under .govbot/ (the cache); got: {}", + resolved.display() + ); + } + + /// An explicit `ledger:` field in `govbot.yml` is honoured verbatim + /// (relative to the project dir) — including absolute paths — so a + /// user who deliberately wants a specific location can pin it. + #[test] + fn explicit_ledger_field_overrides_default() { + let dir = tempdir().unwrap(); + let mut p = bluesky_publisher_default(); + p.ledger = Some("custom/posted.ledger".to_string()); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + assert_eq!( + resolve_ledger_path(&job), + dir.path().join("custom/posted.ledger") + ); + + // Absolute paths pass through untouched. + let abs = dir.path().join("abs.ledger"); + p.ledger = Some(abs.to_string_lossy().to_string()); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + assert_eq!(resolve_ledger_path(&job), abs); + } + + /// Backward-compat: an existing pre-fix ledger at the legacy + /// `.govbot/bluesky-.ledger` path is read so upgrading users + /// don't lose their post history. `read_ledger` is the unit-level + /// surface; `run_bluesky` unions the two on read. + #[test] + fn legacy_govbot_ledger_is_readable_as_fallback() { + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + + // Seed only the legacy path; new path stays absent. + let legacy = legacy_ledger_path(&job); + std::fs::create_dir_all(legacy.parent().unwrap()).unwrap(); + std::fs::write(&legacy, "wy-legislation/.../HB9999\n").unwrap(); + + // The new path resolves under state/ and has no file yet — the + // primary read is empty, the legacy read carries the history. + let new_path = resolve_ledger_path(&job); + assert!(!new_path.exists()); + assert!(read_ledger(&new_path).unwrap().is_empty()); + + let legacy_seen = read_ledger(&legacy).unwrap(); + assert!( + legacy_seen.contains("wy-legislation/.../HB9999"), + "legacy ledger must be readable so upgrades preserve post history" + ); + } + + /// Writes always land at the *new* path even when a legacy ledger + /// exists — so the legacy file becomes harmless after one full + /// re-run and the user (or a future `govbot migrate`) can delete it. + #[test] + fn appends_land_at_new_path_not_legacy() { + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + + // Pre-populate the legacy ledger to simulate an upgrading project. + let legacy = legacy_ledger_path(&job); + std::fs::create_dir_all(legacy.parent().unwrap()).unwrap(); + std::fs::write(&legacy, "old-id\n").unwrap(); + let legacy_before = std::fs::read_to_string(&legacy).unwrap(); + + // Append via the resolved (new) path — the production code path. + let new_path = resolve_ledger_path(&job); + append_ledger(&new_path, "new-id").unwrap(); + + // New path now holds the new id. + let new_contents = std::fs::read_to_string(&new_path).unwrap(); + assert!(new_contents.contains("new-id")); + // Legacy is untouched — we never write there. + let legacy_after = std::fs::read_to_string(&legacy).unwrap(); + assert_eq!( + legacy_before, legacy_after, + "writes must never land at the legacy .govbot/ ledger path" + ); + // The new path is under state/, not .govbot/. + assert!(new_path.starts_with(dir.path().join("state"))); + } } diff --git a/actions/govbot/src/config.rs b/actions/govbot/src/config.rs index 769c10b7..6bf1e808 100644 --- a/actions/govbot/src/config.rs +++ b/actions/govbot/src/config.rs @@ -134,7 +134,11 @@ pub struct Publisher { /// Path to the append-only posted-state ledger that makes the publisher /// idempotent — re-runs never double-post. Relative to the project - /// directory; defaults to `.govbot/bluesky-.ledger`. + /// directory; defaults to `state/bluesky-.ledger` (peer to + /// `tags/` and `dist/`; NOT under `.govbot/`, which is the tool's + /// regenerable cache). On upgrade, a legacy + /// `.govbot/bluesky-.ledger` is read as a fallback so post + /// history survives; writes always land at the new path. #[serde(default)] pub ledger: Option, diff --git a/actions/govbot/src/wizard.rs b/actions/govbot/src/wizard.rs index 87d62884..4d61b93a 100644 --- a/actions/govbot/src/wizard.rs +++ b/actions/govbot/src/wizard.rs @@ -290,11 +290,11 @@ pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { /// Write .gitignore with govbot's generated dirs and secret-bearing files. /// -/// Everything under `.govbot/` (cloned datasets, ledgers, lockfile state), +/// Everything under `.govbot/` (cloned datasets, sync state — the cache), /// every publisher output dir (`dist/`, `docs/`), the classification-output -/// dir `tags/`, and any local `.env` is untracked. The userland repo is a -/// few dozen text files plus tool artifacts; the artifacts never belong in -/// git. +/// dir `tags/`, the operational-state dir `state/`, and any local `.env` is +/// untracked. The userland repo is a few dozen text files plus tool +/// artifacts; the artifacts never belong in git. /// /// **`tags/` trade-off.** `govbot apply` writes per-tag `.tag.json` files /// under `tags//country:.../sessions//`. The file count grows @@ -302,6 +302,13 @@ pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { /// so it is git-ignored by default. Users who want classification /// provenance committed (e.g. for offline review or auditability) can /// remove the `tags/` line from this file. +/// +/// **`state/` trade-off.** The bluesky publisher writes its posted-state +/// ledger under `state/bluesky-.ledger` — the append-only record of +/// which bills have already been posted. Ignored by default to keep the +/// repo clean; remove the `state/` line to commit the post history and +/// let a cold clone (e.g. a fresh CI runner) resume without double-posts. +/// Same regenerable-but-operational shape as `tags/`. pub fn write_gitignore(cwd: &Path) -> Result<()> { let gitignore_path = cwd.join(".gitignore"); // Single canonical block — easy to grep, easy to update. @@ -313,6 +320,10 @@ docs/ # Classification output from `govbot apply` — regenerated each run. # Remove this line if you want classification provenance committed. tags/ +# Publisher state — append-only ledgers (e.g. bluesky's posted-state). +# Regenerable-but-operational: deleting it makes the next run double-post. +# Remove this line to commit post history and let cold clones resume cleanly. +state/ # Secrets — never commit .env diff --git a/schemas/govbot.schema.json b/schemas/govbot.schema.json index 4b4db0fd..6b75b45b 100644 --- a/schemas/govbot.schema.json +++ b/schemas/govbot.schema.json @@ -140,7 +140,7 @@ "default": 0.6 }, "ledger": { - "description": "Bluesky publisher only. Path to the append-only posted-state ledger that makes the publisher idempotent -- it records the id of every record posted so re-runs never double-post. Relative paths resolve against the project directory. Defaults to '.govbot/bluesky-.ledger'.", + "description": "Bluesky publisher only. Path to the append-only posted-state ledger that makes the publisher idempotent -- it records the id of every record posted so re-runs never double-post. Relative paths resolve against the project directory. Defaults to 'state/bluesky-.ledger' (peer to 'tags/' and 'dist/'); NOT under '.govbot/', which is the tool's regenerable cache. If a ledger file exists at the legacy '.govbot/bluesky-.ledger' location from a pre-fix install, it is read as a fallback so post history survives the upgrade; writes always land at the new path.", "type": "string" }, "post_template": { From 3682760de6f354a8c783698e17159245b9d8d341 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sun, 24 May 2026 00:36:19 -0500 Subject: [PATCH 24/32] docs: name /fastclass:improve autonomous as activist post-ratify default MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fastclass shipped autonomous mode (the no-ratify apply path gated by constitution-passes + rolling-coverage-gap proofs). The cross-domain pickup for govbot: name it as the activist-facing default after the first ratification pass, in the activist-facing slash-command form (`/fastclass:improve autonomous`), not the raw CLI form (`fastclass classify --promote ... --autonomous`). - AGENT.md §2 callout: lead with the slash-command form, keep the gate semantics (constitution sovereign; rolling re-test for coverage gaps; precision-regression always refuses), and tell the reader when to drop back to the reviewed path (new tag, ratification sweep after a flurry of autonomous lands). - AGENT.md §3 step 3: show both `/fastclass:improve` and `/fastclass:improve autonomous` side by side so the activist knows which form to run for each situation. - AGENT.md §"three jobs" nav blurb: switch from "fastclass's `--autonomous` mode" to the slash-command form for consistency. - CLAUDE.md "Classifying with fastclass": one-sentence pointer to the autonomous form + the `generated_by: autonomous-coverage-gap` lock marker, so a senior engineer reading the contributor guide knows the loop has a beginner-default path. README.md untouched — its #2 and "Classifying with fastclass" sections defer the improvement loop to AGENT.md; adding an autonomous-mode mention there would surface a concept the README doesn't introduce. A fresh activist reading AGENT.md §2 today can answer: "I ratified once, what do I run now?" — `/fastclass:improve autonomous`, with the constitution still sovereign and the audit trail intact. cargo test --offline: 66 passing. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 59 +++++++++++++++++++++++++++++++++++++------------------ CLAUDE.md | 7 ++++++- 2 files changed, 46 insertions(+), 20 deletions(-) diff --git a/AGENT.md b/AGENT.md index cfb97af9..6b687af7 100644 --- a/AGENT.md +++ b/AGENT.md @@ -54,9 +54,9 @@ A user comes to you for one of three things. Identify which, then jump to that section. Each job exercises the 4-tool stack from a different angle: **make** scaffolds the pull+filter+publish chain (today's MVP — does NOT yet scaffold a summarize transform or a receipts page, neither of which -exists); **manage** keeps the loop running and introduces fastclass's -`--autonomous` mode after first ratification (the activist-default for -hands-off improvement); **update** evolves the stack. +exists); **manage** keeps the loop running and introduces +`/fastclass:improve autonomous` after first ratification (the +activist-default for hands-off improvement); **update** evolves the stack. | Job | The user says… | Section | |---|---|---| @@ -491,22 +491,37 @@ The `bluesky` publisher is a **posting bot**: it posts to a normal Bluesky account via the AT Protocol and runs to completion (no server). It is idempotent — a posted-state ledger keeps re-runs from double-posting. -**Activist default after first ratification: `--autonomous`.** Once the +**Activist default after first ratification: autonomous mode.** Once the activist has ratified one classifier proposal end-to-end (so they have -felt the loop once) — the `--autonomous` flag on -`fastclass classify --promote` becomes the recommended ongoing posture. -With `--autonomous`, proposals that pass the frozen constitution gate -apply as usual, and proposals where the constitution is silent -(coverage gap) re-test against the rolling eval set and land if rolling -proves them safe (flips at least one rolling failure to passing, -regresses nothing, no per-tag precision loss). The `fastclass.lock` -file marks autonomously-applied proposals with -`generated_by: autonomous-coverage-gap`, so the audit trail is -preserved — the receipt story extends into the classifier. This is the -mode that lets the activist crew run the bot **hands-off between -ratifications** without giving up provenance, which is the whole reason -the cost story is "nearly free to operate". See §3 for the per-proposal -flow you walked through first. +felt the loop once and seen what the constitution gate does) — the +beginner-default ongoing posture is the **autonomous** form of the +improvement loop, invoked as: + +``` +/fastclass:improve autonomous +``` + +Under the hood this runs `fastclass classify --promote .yml +--autonomous`. The constitution stays sovereign: proposals that pass the +frozen constitution gate apply as usual, and proposals where the +constitution is silent (a **coverage gap** — the gate cannot prove the +change good or bad) re-test against the rolling eval set and land only +if rolling proves them safe (flips at least one rolling failure to +passing, regresses no rolling case, no per-tag precision loss). A +bad-fix reject (counts moved on the constitution but F1 did not improve) +or any rolling regression always refuses — the rolling gate is strictly +weaker than the constitution and cannot overrule a precision-regression +reject. The `fastclass.lock` file marks autonomously-applied proposals +with `generated_by: autonomous-coverage-gap`, so the audit trail is +preserved — the receipt story extends into the classifier. + +This is the mode that lets the activist crew run the bot **hands-off +between ratifications** without giving up provenance, which is the whole +reason the cost story is "nearly free to operate". Use it as the +ongoing-improvement default; drop back to the reviewed +`/fastclass:improve` path (§3) when you want to see and ratify a +specific proposal — e.g. when widening scope with a new tag, or after a +flurry of autonomous coverage-gap lands you want to read through. ### 2.1 Create the app password @@ -668,8 +683,14 @@ each change against the frozen gold set. drafts a proposal under `classifier/proposals/`, and is the supported way to tune the bundle: ``` - /fastclass:improve + /fastclass:improve # reviewed: you ratify each proposal + /fastclass:improve autonomous # hands-off: constitution-passing applies, + # coverage-gap re-tests against rolling ``` + Use the reviewed form for the first pass (so you see what the gate does) + and when widening scope with a new tag; switch to `autonomous` as the + ongoing default once you've felt the loop — see §2's autonomous-mode + callout for the gate semantics. 4. **Backtest** the proposal — proves it against the frozen constitution: ```bash fastclass classify --backtest classifier/proposals/prop-0001.yml classifier=./classifier diff --git a/CLAUDE.md b/CLAUDE.md index a6254443..0a4539c1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -247,7 +247,12 @@ classifier bundle's `classifier.yml` answers *"what's relevant"*. To run the self-improving loop, work inside the classifier bundle directory and use the fastclass Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`) and the fastclass `classify --eval` / `--backtest` / `--promote` primitives. The -retired `fastclass --propose` flag no longer exists. +retired `fastclass --propose` flag no longer exists. For activists who have +ratified one proposal end-to-end, `/fastclass:improve autonomous` becomes the +ongoing default — constitution-passing proposals apply as usual, coverage-gap +proposals re-test against the rolling eval set and land only if rolling proves +them safe (`generated_by: autonomous-coverage-gap` in `fastclass.lock`). +AGENT.md §2 carries the activist-facing framing. **Prerequisite**: the `fastclass` binary must be resolvable on `PATH`, `~/.cargo/bin`, or `~/.govbot/bin` (`cargo install --path `). From 6fd614f6da8f5b8a7ec0734deaf8a7a8c9235138 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sun, 24 May 2026 17:46:38 -0500 Subject: [PATCH 25/32] docs: switch fastclass CLI examples to compile umbrella The fastclass `compile evaluate|backtest|ratify` sub-subcommands replaced the old `classify --eval|--backtest|--promote` flag forms. Deprecation aliases keep the old shape working, but all govbot docs should teach the new shape going forward. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 12 ++++++------ CLAUDE.md | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/AGENT.md b/AGENT.md index 6b687af7..727bf2f9 100644 --- a/AGENT.md +++ b/AGENT.md @@ -96,7 +96,7 @@ govbot publish # run the manifest's publishers govbot run # the full pipeline: pull -> source|classify|apply -> publish fastclass classify - # score a JSON-Lines doc stream from stdin fastclass describe classifier= # print a bundle's tags + interface -fastclass classify --eval / --backtest / --promote # the tuning primitives +fastclass compile evaluate / backtest / ratify # the tuning primitives ``` Datasets are resolved at runtime through a **dataset registry** — an @@ -501,7 +501,7 @@ improvement loop, invoked as: /fastclass:improve autonomous ``` -Under the hood this runs `fastclass classify --promote .yml +Under the hood this runs `fastclass compile ratify .yml --autonomous`. The constitution stays sovereign: proposals that pass the frozen constitution gate apply as usual, and proposals where the constitution is silent (a **coverage gap** — the gate cannot prove the @@ -672,8 +672,8 @@ each change against the frozen gold set. 1. **Measure** where the classifier stands: ```bash - fastclass classify --eval constitution classifier=./classifier - fastclass classify --eval rolling classifier=./classifier + fastclass compile evaluate --eval constitution classifier=./classifier + fastclass compile evaluate --eval rolling classifier=./classifier ``` 2. **Find misses.** Add bills the classifier gets wrong to `classifier/eval/rolling.yml` with their correct `expected_tags`. To widen @@ -693,11 +693,11 @@ each change against the frozen gold set. callout for the gate semantics. 4. **Backtest** the proposal — proves it against the frozen constitution: ```bash - fastclass classify --backtest classifier/proposals/prop-0001.yml classifier=./classifier + fastclass compile backtest classifier/proposals/prop-0001.yml classifier=./classifier ``` 5. **Promote** a passing proposal into the bundle: ```bash - fastclass classify --promote classifier/proposals/prop-0001.yml classifier=./classifier + fastclass compile ratify classifier/proposals/prop-0001.yml classifier=./classifier ``` 6. **Re-run** the bot: `govbot run`. diff --git a/CLAUDE.md b/CLAUDE.md index 0a4539c1..ae9ecaa8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -246,7 +246,7 @@ classifier bundle's `classifier.yml` answers *"what's relevant"*. To run the self-improving loop, work inside the classifier bundle directory and use the fastclass Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`) -and the fastclass `classify --eval` / `--backtest` / `--promote` primitives. The +and the fastclass `compile evaluate` / `compile backtest` / `compile ratify` primitives. The retired `fastclass --propose` flag no longer exists. For activists who have ratified one proposal end-to-end, `/fastclass:improve autonomous` becomes the ongoing default — constitution-passing proposals apply as usual, coverage-gap From 5679c021082add35b08f6dcc477cc68e3a1aeb8c Mon Sep 17 00:00:00 2001 From: Sartaj Date: Sun, 24 May 2026 22:01:02 -0500 Subject: [PATCH 26/32] run: scope pipeline source step by manifest datasets (--repos) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Scenario A — the pipeline silently widened the source walk. Empirically, `govbot run` with `datasets: [wy]` in a project whose `.govbot/repos/` had 52 datasets cached classified ~4900 records across every state instead of the ~100 the manifest declared. `run_publish_command` already passed `--repos` to its publish-time source spawn (main.rs:2492-2497), and `run_source_command` itself scopes by `--repos` correctly (verified: `--repos wy` → 100, `--repos wy il ca` → 300, no `--repos` → 4916). The defect was in the classify pipeline's own source spawn — `pipeline::run_transform_dag` invoked `govbot source --select docs` and never appended `--repos`, so the manifest's `datasets:` was load-bearing for `pull` but ignored downstream at classify. Fix: translate `manifest.datasets` to a `--repos` argv (`[all] →` flag omitted, matching source's "every linked dataset" sentinel; any other list passed verbatim) and thread it through `run_transform_dag`. The manifest's `datasets:` mental model is now coherent end-to-end: pull, classify, and publish all honour the same scope. Tests (69 total, +3): * unit (`pipeline.rs`): `source_repos_from_manifest` translates `[all]` to empty and any other list verbatim. * integration (`tests/run_repos_scope.rs`): against a throwaway two-dataset corpus, `govbot source --select docs --repos wy` emits only `wy-legislation/...` ids, and the no-`--repos` walk visits both datasets — pinning the source-side invariants the pipeline relies on. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/pipeline.rs | 74 +++++++++- actions/govbot/tests/run_repos_scope.rs | 184 ++++++++++++++++++++++++ 2 files changed, 256 insertions(+), 2 deletions(-) create mode 100644 actions/govbot/tests/run_repos_scope.rs diff --git a/actions/govbot/src/pipeline.rs b/actions/govbot/src/pipeline.rs index 8cc16739..f61ccdd7 100644 --- a/actions/govbot/src/pipeline.rs +++ b/actions/govbot/src/pipeline.rs @@ -114,10 +114,17 @@ pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>, dry_run: bool) } // Step 2: run the transform DAG (source | transform... | apply). + // + // The source stage must honour the manifest's `datasets:` scope — without + // it, `govbot source` walks every linked dataset under `.govbot/repos/` + // (which can include datasets pulled by an earlier `datasets: [all]` run + // and never cleaned up), classifying tens of thousands of records the + // current manifest did not declare. + let source_repos = source_repos_from_manifest(&manifest.datasets); eprintln!(); eprintln!("=== Step 2/3: Running transforms (source | ... | apply) ==="); eprintln!(); - match run_transform_dag(&govbot_bin, &resolved, cwd, govbot_dir) { + match run_transform_dag(&govbot_bin, &resolved, cwd, govbot_dir, &source_repos) { Ok(false) => { eprintln!("⚠️ Transform stage had errors (continuing anyway)"); } @@ -298,13 +305,22 @@ fn run_transform_dag( transforms: &[(String, ResolvedTransform)], cwd: &Path, govbot_dir: Option<&str>, + source_repos: &[String], ) -> Result { - // Stage 0: the source — `govbot source --select docs`. + // Stage 0: the source — `govbot source --select docs`. Scope it to the + // manifest's declared datasets (an empty list means "every linked + // dataset", matching the standalone `govbot source` default). let mut source_cmd = Command::new(govbot_bin); source_cmd.arg("source").arg("--select").arg("docs"); if let Some(d) = govbot_dir { source_cmd.arg("--govbot-dir").arg(d); } + if !source_repos.is_empty() { + source_cmd.arg("--repos"); + for d in source_repos { + source_cmd.arg(d); + } + } let mut source_child = source_cmd .current_dir(cwd) .stdout(Stdio::piped()) @@ -399,10 +415,64 @@ fn dir_has_entries(p: &Path) -> bool { .unwrap_or(false) } +/// Translate the manifest's `datasets:` list to the `--repos` argv that +/// scopes the `govbot source` stage inside `run_transform_dag`. +/// +/// `datasets: [all]` becomes an empty list — `govbot source`'s own sentinel +/// for "every linked dataset", omitted from the argv so the flag is absent. +/// Any other list is passed through verbatim; `govbot source --repos ` +/// then walks only the named datasets. +/// +/// This is the load-bearing piece of [`run_pipeline`]'s step 2: forgetting +/// to pass `--repos` here caused a bug in which a manifest declaring +/// `datasets: [wy]` still classified ~4900 records across 52 states because +/// the cache held datasets from an earlier `[all]` pull. +fn source_repos_from_manifest(datasets: &[String]) -> Vec { + if datasets == ["all"] { + Vec::new() + } else { + datasets.to_vec() + } +} + #[cfg(test)] mod tests { use super::*; + /// Regression test for the `datasets:[wy]` scope leak: an `[all]` + /// manifest must produce an empty argv list (so `--repos` is omitted — + /// `govbot source`'s sentinel for "every linked dataset"), but any other + /// list must pass through verbatim so `govbot source --repos ` + /// scopes the walk. Pre-fix, `run_transform_dag` never passed `--repos`, + /// so the manifest's `datasets:` was silently ignored at the source step + /// and a `[wy]` manifest still classified ~4900 records across 52 states. + #[test] + fn source_repos_from_manifest_translates_all_and_scopes() { + assert_eq!( + source_repos_from_manifest(&["all".to_string()]), + Vec::::new(), + "`[all]` must collapse to empty so --repos is omitted" + ); + assert_eq!( + source_repos_from_manifest(&["wy".to_string()]), + vec!["wy".to_string()], + "`[wy]` must pass through verbatim" + ); + assert_eq!( + source_repos_from_manifest(&["wy".to_string(), "il".to_string()]), + vec!["wy".to_string(), "il".to_string()], + "`[wy, il]` must pass through verbatim" + ); + // An `[all, wy]` mix is not the `[all]` sentinel — pass through so + // the source step at least scopes to the named subset (and treats + // the literal `all` as a possibly-missing dataset id, surfacing the + // manifest error rather than silently widening to every dataset). + assert_eq!( + source_repos_from_manifest(&["all".to_string(), "wy".to_string()]), + vec!["all".to_string(), "wy".to_string()], + ); + } + /// `govbot run` should detect a project-local dataset seed /// (`.govbot/repos//`) and skip the cache-touching pull substep. /// We test the detector — the substep skip itself is exercised by the diff --git a/actions/govbot/tests/run_repos_scope.rs b/actions/govbot/tests/run_repos_scope.rs new file mode 100644 index 00000000..bb8360c2 --- /dev/null +++ b/actions/govbot/tests/run_repos_scope.rs @@ -0,0 +1,184 @@ +//! Regression test for the `datasets:[wy]` scope leak. +//! +//! `govbot::pipeline::run_transform_dag` spawns `govbot source --select docs` +//! as the head of the classify pipeline. Pre-fix it never passed `--repos`, +//! so the manifest's `datasets:` was silently ignored at the source step: a +//! manifest declaring `datasets: [wy]` in a project whose `.govbot/repos/` +//! held 52 datasets (left over from an earlier `[all]` pull) classified +//! ~4900 records across every state instead of ~100 Wyoming records. +//! +//! The fix translates `manifest.datasets` to a `--repos ` argv that +//! gets appended to the source spawn. This test pins the two invariants the +//! fix relies on: +//! +//! 1. `govbot source --select docs --repos ...` against a +//! multi-dataset cache emits records only from the named dataset(s). +//! This is the source-side scoping the pipeline relies on — if it ever +//! regresses, the pipeline's `--repos` plumbing is moot. +//! 2. Omitting `--repos` walks every linked dataset — the "every dataset" +//! sentinel `source_repos_from_manifest` produces for a `[all]` +//! manifest, so the pipeline can keep treating absence as "all". +//! +//! Together with the `source_repos_from_manifest` unit test in +//! `pipeline.rs` (which pins the manifest→argv translation), these +//! invariants regression-test the full fix path. + +use std::fs; +use std::path::PathBuf; +use std::process::Command; + +/// Path to the freshly-built `govbot` binary. Mirrors the helper in +/// `cli_example_snaps.rs` but kept local so the two integration test +/// binaries stay independent. +fn govbot_binary() -> PathBuf { + let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + let status = Command::new("cargo") + .args(["build", "--bin", "govbot"]) + .current_dir(&manifest_dir) + .status() + .expect("cargo build should succeed"); + assert!(status.success(), "cargo build failed"); + manifest_dir.join("target").join("debug").join("govbot") +} + +/// Build a throwaway `.govbot/repos/` tree with two datasets (`wy`, `gu`), +/// each holding one bill with one log file. Returns the absolute path to the +/// `.govbot` root (the value to pass as `--govbot-dir`). +/// +/// We materialise the on-disk corpus by hand rather than re-using +/// `actions/govbot/mocks/` because the shipped mock's `gu-legislation/` has +/// metadata but no logs — `govbot source` emits per-log entries, so a +/// "did the filter scope to wy" assertion is vacuous if `gu` has no logs to +/// scope away. +fn build_two_dataset_corpus(tmp: &std::path::Path) -> PathBuf { + let repos = tmp.join(".govbot").join("repos"); + + for (dataset, state, bill_id) in [ + ("wy-legislation", "wy", "HB0001"), + ("gu-legislation", "gu", "B1-38"), + ] { + // Layout the walker expects: `country:/state:/sessions//bills//logs/_.json`. + let session = if state == "wy" { "2025" } else { "38th" }; + let bill_dir = repos + .join(dataset) + .join("country:us") + .join(format!("state:{}", state)) + .join("sessions") + .join(session) + .join("bills") + .join(bill_id); + let logs_dir = bill_dir.join("logs"); + fs::create_dir_all(&logs_dir).expect("create logs dir"); + + // A minimal metadata.json — `source --select docs` joins it for the + // doc text. The timestamp on the log filename is what the source + // walker sorts by; the suffix names the action. + fs::write( + bill_dir.join("metadata.json"), + serde_json::json!({ + "title": format!("Test bill {}", bill_id), + "identifier": bill_id, + "subjects": ["test"], + "abstracts": [{"abstract": format!("Body of {}", bill_id)}], + }) + .to_string(), + ) + .expect("write metadata.json"); + + // A "passage" log — substantive under `--filter default`, so the + // record survives the default filter and shows up in `--select docs`. + fs::write( + logs_dir.join("20250129T022703Z_passage.json"), + serde_json::json!({ + "action": "passage", + "bill_id": bill_id, + "date": "2025-01-29", + }) + .to_string(), + ) + .expect("write log"); + } + + tmp.join(".govbot") +} + +/// Collect `govbot source --select docs` stdout against the throwaway +/// corpus, parsed into one JSON value per non-empty line. The `--filter +/// none` keeps every log entry — we want the count to depend only on +/// `--repos` scoping, not on the per-dataset action filters. +fn source_docs(govbot_dir: &std::path::Path, repos: &[&str]) -> Vec { + let bin = govbot_binary(); + let mut cmd = Command::new(&bin); + cmd.arg("source") + .arg("--select") + .arg("docs") + .arg("--filter") + .arg("none") + .arg("--govbot-dir") + .arg(govbot_dir); + if !repos.is_empty() { + cmd.arg("--repos"); + for r in repos { + cmd.arg(r); + } + } + let output = cmd.output().expect("spawn govbot source"); + assert!( + output.status.success(), + "govbot source exited with {:?}\nstderr:\n{}", + output.status.code(), + String::from_utf8_lossy(&output.stderr) + ); + String::from_utf8_lossy(&output.stdout) + .lines() + .filter(|l| !l.trim().is_empty()) + .filter_map(|l| serde_json::from_str::(l).ok()) + .collect() +} + +/// Pin invariant (1): `--repos wy` against a `wy+gu` corpus emits only `wy` +/// records. This is the source-side guarantee the pipeline relies on. +#[test] +fn source_with_repos_scopes_to_named_dataset() { + let tmp = tempfile::tempdir().expect("tmpdir"); + let govbot_dir = build_two_dataset_corpus(tmp.path()); + + let wy_only = source_docs(&govbot_dir, &["wy"]); + assert!( + !wy_only.is_empty(), + "wy corpus should emit at least one record" + ); + for record in &wy_only { + let id = record + .get("id") + .and_then(|v| v.as_str()) + .expect("doc record must have a string `id`"); + assert!( + id.starts_with("wy-legislation/"), + "--repos wy leaked a non-wy record: {}", + id + ); + } +} + +/// Pin invariant (2): omitting `--repos` walks every linked dataset. This is +/// what `source_repos_from_manifest(&["all"])` returns (empty list → flag +/// omitted), and the pipeline relies on that translation matching source's +/// own "all" sentinel. +#[test] +fn source_without_repos_walks_every_dataset() { + let tmp = tempfile::tempdir().expect("tmpdir"); + let govbot_dir = build_two_dataset_corpus(tmp.path()); + + let all = source_docs(&govbot_dir, &[]); + let datasets: std::collections::BTreeSet = all + .iter() + .filter_map(|r| r.get("id").and_then(|v| v.as_str())) + .filter_map(|id| id.split('/').next().map(str::to_string)) + .collect(); + assert!( + datasets.contains("wy-legislation") && datasets.contains("gu-legislation"), + "no-`--repos` walk should hit both datasets, got: {:?}", + datasets + ); +} From ddb6679ee3d473c1e4c52fc6c7e912603509bd19 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Mon, 25 May 2026 00:08:11 -0500 Subject: [PATCH 27/32] publishers: dedup by bill, ledger by bill-id MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A publisher's result stream emits **one record per action-log file** (committee referral, hearing, vote event …), not one per bill. The climate-tracker bluesky-pending list under `datasets: [all]` showed the cost: NV AB1 posted 6×, AK HB53 4×, plus AL SB124, IL SB2456, WY SF0015, CT HB07174 doubled — ~12 of 96 posts were the same bill. For an activist who wants a daily digest, the same bill posted six times in a row destroys credibility. This commit collapses every publisher (bluesky, rss, html, json) to **one item per (jurisdiction, bill_id)**. The collapsing key is the **bill_guid** — the canonical `/.../bills/` path, derived from `sources.log` (strip `/logs/...`, append `/bills/` when the OCD-files session-level layout omits it) or `sources.bill` (strip `/metadata.json`). The `bluesky` publisher's selection is now dedup-then-filter, by score: 1. group every entry by `bill_guid` (no score filter yet); 2. within each group, keep only logs that clear `min_score` for a selected tag; 3. pick the **highest-scoring** qualifying log as the representative — the post we render is the strongest log for the bill, not an arbitrary newest one. This means bluesky needs to see every log per bill (not a pre-dedup'd stream), so `run_publish_command` now skips the global dedup and the default `--limit 100` for bluesky publishers. RSS / HTML / JSON publishers still get the bill-level global dedup; their result is one feed item per bill. **Ledger migration.** The bluesky ledger key is now the bill-level GUID, so future action logs for an already-posted bill don't trigger a re-post. Pre-fix ledgers held per-log GUIDs; on read each entry is collapsed via `ledger_id_to_bill_key`: - **Per-bill-log layout** entries already carry `/bills/` before `/logs/`, so stripping yields the new bill key cleanly. Bills posted under that layout don't re-post. - **Session-level-log layout** entries (the OCD-files common case) end at `/sessions//logs/`; stripping yields the session prefix, which doesn't match the bill key. These bills re-post **once** on the first post-upgrade run, after which the new bill-level GUID lands in the ledger and never re-posts again. The session-level case is the honest migration cost — recovering the bill_id from a session-level log path alone would be wrong as often as right (filenames don't reliably encode the bill). Real-data check on the climate-tracker feed: pre-fix the bluesky dry-run emitted 76 log-records carrying duplicates (e.g. KY SB89 5×, WY SF0015 6×); post-fix it emits 72 unique bills with zero duplicates. Tests: +9 — 5 in `bluesky` (per-bill rep selection, score-tie pick, ledger bill-level dedup, legacy per-log GUID compat, bill_guid sanity) and 4 in `publish` (deduplicate_entries collapses N→1 per bill, keeps distinct bills distinct, rss publisher emits one `` per bill, html publisher emits one `
` per bill). Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 20 ++ actions/govbot/src/bluesky.rs | 475 +++++++++++++++++++++++++++++++++- actions/govbot/src/main.rs | 37 ++- actions/govbot/src/publish.rs | 209 ++++++++++++++- actions/govbot/src/rss.rs | 115 +++++++- 5 files changed, 823 insertions(+), 33 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index ae9ecaa8..e7d4f526 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -236,6 +236,26 @@ distinct role; do not conflate them: - **`dist/`** — **publisher output**, written by `govbot publish` (RSS / HTML / JSON feeds). +### Publishers dedup by bill, not by action log + +Every publisher (`bluesky`, `rss`, `html`, `json`) emits **one item per +(jurisdiction, bill_id)** — not one per action-log file. A single bill +typically emits many records to the source stream (one per committee +referral, hearing, vote event, …); the publishers collapse them to a +single representative so an activist sees one post per bill in their +feed. Before this fix, the climate-tracker feed posted NV AB1 six +times under six different action logs. + +The dedup key is the **bill-level GUID** (`rss::bill_guid`), of the form +`/.../sessions//bills/`. For `bluesky` the +**ledger key** is also this bill-level GUID: future log additions for an +already-posted bill do **not** trigger a re-post. The publisher reads +legacy per-log ledger entries on upgrade — entries written under the +per-bill-log layout collapse to the new bill key cleanly; entries +written under the session-level-log layout (the OCD-files common case) +incur a one-time re-post per previously-posted bill, after which the +ledger holds the new bill-level GUID and the bill never re-posts again. + **`govbot.yml` is NOT the classifier — it is a manifest.** It declares `datasets`, `transforms`, `publish`, and `pipelines`; it has **no `tags:` block**. The tag taxonomy lives in a separate **fastclass classifier bundle** diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs index 945c91c9..376918b5 100644 --- a/actions/govbot/src/bluesky.rs +++ b/actions/govbot/src/bluesky.rs @@ -41,7 +41,7 @@ use crate::publish::PublishJob; use anyhow::{Context, Result}; use serde_json::{json, Value}; -use std::collections::HashSet; +use std::collections::{HashMap, HashSet}; use std::fs; use std::io::Write; use std::path::{Path, PathBuf}; @@ -80,7 +80,18 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { let ledger_path = resolve_ledger_path(job); let legacy_path = legacy_ledger_path(job); - // Select records: a `select`ed tag must clear the calibrated threshold. + // Dedup-then-filter, by **bill** (jurisdiction, bill_id), not by + // action-log. A bill emits one record per action-log file (committee + // referral, hearing, passage vote …); without this collapse, an + // activist sees the same bill posted N times in a row (NV AB1 6×, + // AK HB53 4× on the climate-tracker feed before the fix). The rule: + // + // 1. group every entry by `bill_guid` (no score filter yet); + // 2. within each group pick the **highest-scoring qualifying log** + // as the representative — so a bill counts when *any* of its + // logs cleared `min_score` for a selected tag, and the post we + // render is the strongest one; + // 3. drop bills whose every log scored under threshold. // // `{link}` resolves with this priority: // 1. the companion `html` publisher's landing-page URL (the human page); @@ -91,10 +102,9 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { // carries an html publisher, route activists to that human page rather // than to the raw JSON that the rss/html publishers' `extract_link` // emits. - let posts: Vec = job - .entries - .iter() - .filter(|e| record_clears_threshold(e, &select, min_score)) + let representatives = pick_per_bill_representatives(&job.entries, &select, min_score); + let posts: Vec = representatives + .into_iter() .map(|e| { render_post( e, @@ -120,10 +130,26 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { // ledger so an upgrading project doesn't double-post records it logged // under the old path. Writes only land at the new path; the legacy file // becomes harmless once a full re-run has copied its contents forward. - let mut already_posted = read_ledger(&ledger_path)?; + // + // Both shapes of ledger entry are honoured: + // - **New** (bill-level GUID, `/.../bills/`) — matched + // verbatim against the post's bill-level id. + // - **Legacy** (per-log GUID) — collapsed via + // `ledger_id_to_bill_key` on read. Per-bill-log layout entries + // (`/.../bills//logs/`) suppress re-posts + // cleanly. Session-level-log layout entries + // (`/.../sessions//logs/`) — the OCD-files + // common case — strip to the session prefix and incur a + // documented one-time re-post per previously-posted bill (after + // which the new bill-level GUID is in the ledger). See + // `ledger_id_to_bill_key` for the migration story. + let mut already_posted: HashSet = HashSet::new(); + for id in read_ledger(&ledger_path)? { + already_posted.insert(ledger_id_to_bill_key(&id)); + } if ledger_path != legacy_path { for id in read_ledger(&legacy_path)? { - already_posted.insert(id); + already_posted.insert(ledger_id_to_bill_key(&id)); } } let pending: Vec<&RenderedPost> = posts @@ -236,6 +262,13 @@ pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { /// /// The `tags` field is a map `tag_name -> ScoreBreakdown`; the calibrated /// probability is `tags..final_score` (STREAM_PROTOCOL §5). +/// +/// **Note.** The production publisher path now uses +/// [`pick_per_bill_representatives`], which folds this check into a +/// per-group walk so the per-bill dedup can pick the highest-scoring +/// qualifying log as the representative. This standalone predicate is +/// kept as the simplest unit-testable surface for the threshold rule. +#[cfg_attr(not(test), allow(dead_code))] fn record_clears_threshold(entry: &Value, select: &[String], min_score: f64) -> bool { let tags = match entry.get("tags").and_then(|t| t.as_object()) { Some(t) if !t.is_empty() => t, @@ -275,7 +308,12 @@ fn render_post( base_url: Option<&str>, html_entry_url: Option<&str>, ) -> RenderedPost { - let id = crate::rss::extract_guid(entry); + // Ledger key — **bill-level** so future action logs for the same bill + // (new committee referrals, vote events, …) do not re-post the bill. + // Pre-fix this was the per-log GUID, which let a single bill trigger + // N posts as N action logs arrived; the migration story for already- + // posted bills is in `ledger_id_to_bill_key`. + let id = crate::rss::bill_guid(entry); let template = template.unwrap_or(DEFAULT_TEMPLATE); let title = bill_title(entry); @@ -356,6 +394,124 @@ fn top_score(entry: &Value) -> Option { }) } +/// The highest calibrated `final_score` across a record's tags **restricted +/// to `select`**. When `select` is empty, every tag counts; otherwise only +/// the named tags. Returns `None` when no qualifying tag carries a score. +/// +/// This is the score used to rank logs *within a bill group* when picking +/// the representative — so a bill posts under its strongest qualifying log, +/// not (arbitrarily) under its newest one. +fn top_selected_score(entry: &Value, select: &[String]) -> Option { + entry + .get("tags") + .and_then(|t| t.as_object()) + .and_then(|tags| { + tags.iter() + .filter(|(name, _)| select.is_empty() || select.iter().any(|s| s == *name)) + .filter_map(|(_, s)| s.get("final_score").and_then(|v| v.as_f64())) + .fold(None, |acc, s| Some(acc.map_or(s, |a: f64| a.max(s)))) + }) +} + +/// Collapse an entry stream to one representative per (jurisdiction, +/// bill_id), filtering and ranking by score. +/// +/// For each bill the bluesky publisher's contract is **one post**. Inputs +/// may carry many entries for the same bill (one per action log); this +/// function: +/// +/// 1. groups by [`crate::rss::bill_guid`]; +/// 2. within each group, **keeps only logs whose top `select`ed score +/// clears `min_score`** — the bill is dropped when no log qualifies; +/// 3. picks the **highest-scoring** qualifying log as the representative +/// — ties break on stream order (the input is timestamp-sorted DESC, +/// so a tie wins for the newest log). +/// +/// Returns the representatives in **input stream order** so a downstream +/// `--limit` keeps the bills the user saw first (the newest, given the +/// upstream DESC sort). +fn pick_per_bill_representatives<'a>( + entries: &'a [Value], + select: &[String], + min_score: f64, +) -> Vec<&'a Value> { + // Map bill_guid -> index into `entries` of the current best representative, + // along with its score. A `Vec` of bill_guid in first-seen order gives + // us a deterministic output order. + let mut best: HashMap = HashMap::new(); + let mut order: Vec = Vec::new(); + + for (i, e) in entries.iter().enumerate() { + // Bill counts when *any* of its logs clears the threshold for a + // selected tag — this filter is per-log, applied during the group + // walk so a bill with zero qualifying logs simply never enters the + // map. + let Some(score) = top_selected_score(e, select) else { + continue; + }; + if score < min_score { + continue; + } + let key = crate::rss::bill_guid(e); + match best.get(&key) { + Some((_, prev_score)) if *prev_score >= score => { + // The current best beats (or ties) this log on score — keep + // the existing winner (preserves stream order on ties, which + // means newest wins since input is DESC). + } + Some(_) => { + best.insert(key, (i, score)); + } + None => { + order.push(key.clone()); + best.insert(key, (i, score)); + } + } + } + + order + .into_iter() + .filter_map(|k| best.get(&k).map(|(i, _)| &entries[*i])) + .collect() +} + +/// Collapse a (possibly per-log, legacy-shape) ledger id to its bill-level +/// key — the new ledger key shape. +/// +/// Pre-fix the ledger held per-log GUIDs of the form +/// `/.../sessions//logs/.json` (session-level layout, +/// the OCD-files common case) or +/// `/.../bills//logs/.json` (per-bill-log layout). +/// Post-fix the writer emits the **bill-level** key — always +/// `/.../bills/` — and the reader compares against that. +/// +/// This function strips `/logs/...` off either shape. Two outcomes: +/// +/// - **Per-bill-log layout** — the prefix already ends in +/// `/bills/`, so the collapse cleanly matches the new +/// bill-level key. Legacy entries from this layout suppress re-posts. +/// - **Session-level-log layout** — the prefix ends at `/sessions/` +/// with no bill segment. The legacy entry preserves a session-prefix +/// in the dedup set, but a new post's bill-level key +/// (`/bills/`) won't match it. The bill therefore +/// **re-posts once** on the first post-upgrade run, after which the +/// new bill-level GUID is in the ledger and never re-posts again. +/// +/// Pre-fix users incur at most one extra post per previously-posted bill +/// in session-level-log layouts. This is the honest migration cost; the +/// alternative — guessing the bill from a session-level log path alone — +/// would be wrong as often as it would be right (the filename does not +/// reliably encode the bill). +/// +/// Entries that do not contain `/logs/` (already bill-level, or a +/// synthetic `_` fallback) pass through unchanged. +fn ledger_id_to_bill_key(id: &str) -> String { + match id.split_once("/logs/") { + Some((prefix, _)) => prefix.to_string(), + None => id.to_string(), + } +} + /// Best-effort bill title — the bill's `title`, else its identifier, else a /// generic fallback. fn bill_title(entry: &Value) -> String { @@ -976,4 +1132,305 @@ mod tests { // The new path is under state/, not .govbot/. assert!(new_path.starts_with(dir.path().join("state"))); } + + // ------------------------------------------------------------ + // Per-bill dedup regression tests (Bug: posting once per action log) + // ------------------------------------------------------------ + + /// A synthetic log entry for a single bill — the shape `govbot source + /// --join bill,tags` emits. `score` is the calibrated `final_score` + /// for the `clean_energy` tag (the test default). + fn log_entry( + dataset: &str, + session: &str, + bill_id: &str, + log_filename: &str, + score: f64, + ) -> Value { + let log_path = format!( + "{}/country:us/state:xx/sessions/{}/logs/{}", + dataset, session, log_filename + ); + json!({ + "id": bill_id, + "bill": { "title": format!("Bill {}", bill_id), "identifier": bill_id }, + "log": { "bill_id": bill_id }, + "sources": { "log": log_path }, + "tags": { "clean_energy": { "final_score": score } } + }) + } + + /// Six action-log entries for the same NV AB1 bill (the audit's worst + /// case — 6 of 96 posts were the same bill) must collapse to ONE + /// rendered post, not six. + #[test] + fn bluesky_publisher_emits_one_post_per_bill_even_with_multiple_action_logs() { + let entries: Vec = (1..=6) + .map(|i| { + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + 0.92, + ) + }) + .collect(); + + let reps = pick_per_bill_representatives(&entries, &[], 0.5); + assert_eq!( + reps.len(), + 1, + "6 action logs for the same bill must collapse to 1 representative; got {}", + reps.len() + ); + // The representative's bill_guid is the canonical bill path — + // independent of which log won, all six share it. + assert_eq!( + crate::rss::bill_guid(reps[0]), + "nv-legislation/country:us/state:xx/sessions/2025Special36/bills/AB1", + "the representative must carry the bill-level guid, not a log-level one" + ); + } + + /// When multiple logs for the same bill score above the threshold, the + /// **highest-scoring** log becomes the representative — not the first + /// or newest. The post's text comes from that representative. + #[test] + fn bluesky_publisher_picks_the_highest_scoring_log_when_multiple_score() { + let entries = vec![ + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.weak.json", + 0.55, + ), + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.strong.json", + 0.95, // highest + ), + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251113T080000Z.mid.json", + 0.70, + ), + ]; + + let reps = pick_per_bill_representatives(&entries, &[], 0.5); + assert_eq!(reps.len(), 1, "must collapse to 1 representative"); + // The picked rep must be the 0.95-scoring log (the "strong" one), + // which the test labels into the log filename so we can read it + // straight off `sources.log`. + let log_path = reps[0] + .get("sources") + .and_then(|s| s.get("log")) + .and_then(|v| v.as_str()) + .unwrap_or(""); + assert!( + log_path.contains("strong"), + "expected the highest-scoring log to be the representative; got {}", + log_path + ); + } + + /// A bill posted once writes a **bill-level** GUID to the ledger. + /// When the next run discovers a *new* action log for the same bill, + /// the bill must NOT re-post — the ledger key is the bill, not the + /// log, so future logs are deduplicated to the same key and the + /// publisher recognises the bill as already-posted. + #[test] + fn bluesky_ledger_uses_bill_level_guid_to_prevent_repost_when_new_logs_appear() { + // Round 1: post the bill once via its first action log. + let bill_path = + "nv-legislation/country:us/state:xx/sessions/2025Special36/bills/AB1".to_string(); + + let round1 = vec![log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.first.json", + 0.92, + )]; + let reps1 = pick_per_bill_representatives(&round1, &[], 0.5); + assert_eq!(reps1.len(), 1); + let post1 = render_post(reps1[0], None, None, None); + assert_eq!( + post1.id, bill_path, + "ledger id must be the bill-level guid (no /logs/...) — got {}", + post1.id + ); + + // Simulate writing the ledger. + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + let ledger = resolve_ledger_path(&job); + append_ledger(&ledger, &post1.id).unwrap(); + + // Round 2: a **new** action log for the same bill arrives. + let round2 = vec![ + // The old log is still in the stream (the source walks every + // log file on disk every run). + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.first.json", + 0.92, + ), + // Plus a freshly-arrived second log. + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.second.json", + 0.93, + ), + ]; + let reps2 = pick_per_bill_representatives(&round2, &[], 0.5); + let post2 = render_post(reps2[0], None, None, None); + assert_eq!( + post2.id, bill_path, + "representative's ledger id must still be the bill-level guid" + ); + + // The ledger already contains this bill — `run_bluesky` would + // filter it out as already-posted. Confirm at the unit-level: + let already: HashSet = read_ledger(&ledger) + .unwrap() + .into_iter() + .map(|s| ledger_id_to_bill_key(&s)) + .collect(); + assert!( + already.contains(&post2.id), + "ledger should recognise the bill as already-posted; ledger={:?}, post.id={}", + already, + post2.id + ); + + // And confirm we wouldn't append a duplicate. + let before = std::fs::read_to_string(&ledger).unwrap(); + let lines_before = before.lines().count(); + assert_eq!( + lines_before, 1, + "ledger should hold exactly one entry for the bill" + ); + } + + /// A pre-fix ledger holding **per-log** GUIDs is still read on + /// upgrade — the publisher doesn't crash, and per-bill-log-layout + /// entries cleanly suppress re-posts. The session-level-log-layout + /// case incurs the documented one-time re-post (see + /// `ledger_id_to_bill_key`). + #[test] + fn bluesky_ledger_respects_legacy_per_log_guids() { + // Per-bill-log layout: legacy GUID already carries `/bills/` + // before the `/logs/` segment, so stripping `/logs/...` yields + // the new bill-level key directly. The bill is recognised as + // already-posted and re-posts are suppressed. + let legacy_per_bill_log = + "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250101T000000Z.passage.json"; + let expected_bill_key = "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001"; + assert_eq!( + ledger_id_to_bill_key(legacy_per_bill_log), + expected_bill_key, + "per-bill-log legacy entries strip to the bill-level guid cleanly" + ); + + // Session-level-log layout (OCD-files common case): legacy GUID + // ends in `/logs/`; stripping yields the session + // prefix, which does NOT match the new bill-level key. We + // document the resulting behavior — the bill re-posts once, + // then the new bill-level GUID lands in the ledger and never + // re-posts again. + let legacy_session_log = + "nv-legislation/country:us/state:nv/sessions/2025Special36/logs/20251111T080000Z.first.json"; + assert_eq!( + ledger_id_to_bill_key(legacy_session_log), + "nv-legislation/country:us/state:nv/sessions/2025Special36", + "session-level legacy entries strip to the session prefix — the bill \ + segment isn't in the legacy path so it can't be recovered. \ + Bills under this layout re-post once on the first post-upgrade run." + ); + + // End-to-end: seed a legacy ledger with the per-bill-log entry, + // confirm the publisher reads it and recognises the bill as + // already-posted (matching the new bill-level GUID a post would + // write). + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + + let legacy = legacy_ledger_path(&job); + std::fs::create_dir_all(legacy.parent().unwrap()).unwrap(); + std::fs::write(&legacy, format!("{}\n", legacy_per_bill_log)).unwrap(); + + // Build a new HB0001 entry whose `sources.log` happens to be the + // per-bill-log path (matches what a post would render). + let entry = json!({ + "id": "HB0001", + "bill": { "title": "WY HB0001", "identifier": "HB0001" }, + "log": { "bill_id": "HB0001" }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250102T000000Z.next.json" + }, + "tags": { "clean_energy": { "final_score": 0.92 } } + }); + let post = render_post(&entry, None, None, None); + assert_eq!( + post.id, expected_bill_key, + "new post's id is the bill-level guid" + ); + + // The legacy ledger entry collapses to the same bill-level key, + // so the publisher's already-posted set contains the bill. + let already: HashSet = read_ledger(&legacy) + .unwrap() + .into_iter() + .map(|s| ledger_id_to_bill_key(&s)) + .collect(); + assert!( + already.contains(&post.id), + "the legacy per-bill-log GUID must collapse to the same bill-level \ + key the new post writes; ledger={:?}, post.id={}", + already, + post.id + ); + } + + /// `bill_guid` is the canonical bill key; it strips `/logs/...` from + /// `sources.log` and appends `/bills/` (the OCD-files common + /// case). Sanity-check the shape and the dedup the publisher relies on. + #[test] + fn bill_guid_collapses_session_level_logs_to_one_bill_key() { + let a = log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.a.json", + 0.9, + ); + let b = log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.b.json", + 0.9, + ); + let c = log_entry( + "nv-legislation", + "2025Special36", + "AB2", // different bill + "20251111T080000Z.c.json", + 0.9, + ); + assert_eq!(crate::rss::bill_guid(&a), crate::rss::bill_guid(&b)); + assert_ne!(crate::rss::bill_guid(&a), crate::rss::bill_guid(&c)); + } } diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index d5215506..6927436f 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -2564,27 +2564,50 @@ async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { ); // Filter to the publisher's selected tags, dedup, sort. + // + // The bluesky publisher does its own **score-aware** per-bill dedup + // (highest-scoring log per (jurisdiction, bill_id) becomes the + // representative — see `bluesky::run_bluesky`); the global + // first-wins dedup would force a "newest" winner that drops a + // bill whose newest log carries no qualifying tag even when an + // older log scored above the threshold. Skip the global dedup for + // bluesky so the publisher sees every log for every bill. let mut entries: Vec = all_entries .iter() .filter(|e| filter_by_tags(e, &select)) .cloned() .collect(); - entries = deduplicate_entries(entries); + if publisher.kind != govbot::PublisherKind::Bluesky { + entries = deduplicate_entries(entries); + } entries = sort_by_timestamp(entries); // Apply the limit: CLI override, else the publisher's, else 100. + // + // **The limit is a per-bill cap**, not a per-action-log cap — for + // non-bluesky publishers that's already true (entries are + // pre-dedup'd by bill above). For bluesky we skipped the + // pre-dedup, so the entry stream still carries N action-log + // records per bill; truncating it here would arbitrarily clip + // bills before bluesky's own dedup runs. Skip the limit for + // bluesky and let the publisher cap **after** its score-aware + // per-bill dedup (a future enhancement; the runtime cost of + // posting every qualifying bill is already small relative to + // the activist's daily-digest expectations). let limit_value: Option = match cli_limit { Some(v) => v, None => publisher.resolved_limit(Some(100)), }; let original_count = entries.len(); if let Some(lim) = limit_value { - entries.truncate(lim); - if original_count > lim { - eprintln!( - "Limited '{}' to {} entries. Use --limit none for all {}.", - name, lim, original_count - ); + if publisher.kind != govbot::PublisherKind::Bluesky { + entries.truncate(lim); + if original_count > lim { + eprintln!( + "Limited '{}' to {} entries. Use --limit none for all {}.", + name, lim, original_count + ); + } } } diff --git a/actions/govbot/src/publish.rs b/actions/govbot/src/publish.rs index 64cef389..8637885a 100644 --- a/actions/govbot/src/publish.rs +++ b/actions/govbot/src/publish.rs @@ -284,15 +284,28 @@ pub fn filter_by_tags(entry: &Value, tag_names: &[String]) -> bool { false } -/// Deduplicate entries by GUID +/// Deduplicate entries by **bill** (jurisdiction, bill_id) — collapse the +/// N action-log records a bill emits to a single representative, keeping +/// the **first** in the stream. +/// +/// Callers sort by timestamp DESC before this, so the first-per-bill wins +/// is also the **most recent action log**. The post / feed item / +/// HTML entry is rendered from that representative. +/// +/// Before this fix, this function dedup'd by per-log GUID — i.e. it +/// **did not collapse multiple logs for the same bill**, which let an +/// activist see the same bill posted six times in a row (NV AB1 +/// 6×, AK HB53 4× on the climate-tracker feed). The bill_guid is the +/// canonical bill path (`/.../bills/`); see +/// [`rss::bill_guid`]. pub fn deduplicate_entries(entries: Vec) -> Vec { let mut seen = HashSet::new(); let mut result = Vec::new(); for entry in entries { - let guid = rss::extract_guid(&entry); - if !seen.contains(&guid) { - seen.insert(guid); + let bill_key = rss::bill_guid(&entry); + if !seen.contains(&bill_key) { + seen.insert(bill_key); result.push(entry); } } @@ -454,4 +467,192 @@ mod tests { "index.html should carry the html publisher's title (not the rss publisher's)" ); } + + // ------------------------------------------------------------ + // Per-bill dedup regression tests (same Bug as the bluesky one) + // ------------------------------------------------------------ + + /// Build a synthetic log entry for `bill_id` whose `sources.log` + /// embeds `filename` — the shape `govbot source --join bill,tags` + /// emits. The `timestamp` is included so the upstream `sort_by_timestamp` + /// is exercised the way `run_publish_command` exercises it. + fn log(dataset: &str, session: &str, bill_id: &str, filename: &str, ts: &str) -> Value { + json!({ + "id": bill_id, + "timestamp": ts, + "bill": { "title": format!("Bill {}", bill_id), "identifier": bill_id }, + "log": { "bill_id": bill_id }, + "sources": { + "log": format!( + "{}/country:us/state:xx/sessions/{}/logs/{}", + dataset, session, filename + ) + }, + "tags": { "clean_energy": { "final_score": 0.9 } } + }) + } + + /// Six action-log entries for the same NV AB1 bill must collapse to + /// **one** entry post-dedup — the bug that put 6 NV AB1 posts on the + /// climate-tracker bluesky-pending feed under `datasets: [all]`. RSS + /// and HTML feeds share the same dedup (`deduplicate_entries`). + #[test] + fn deduplicate_entries_collapses_action_logs_to_one_per_bill() { + let entries: Vec = (1..=6) + .map(|i| { + log( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + &format!("2025111{}T080000Z", i), + ) + }) + .collect(); + + let out = deduplicate_entries(entries); + assert_eq!( + out.len(), + 1, + "6 action logs for the same bill must dedup to 1; got {}", + out.len() + ); + } + + /// The dedup keeps **distinct bills** distinct — only logs *for the + /// same bill* are collapsed. A second bill (NV AB2) survives the same + /// dedup pass. + #[test] + fn deduplicate_entries_keeps_distinct_bills() { + let entries = vec![ + log( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.a.json", + "20251111T080000Z", + ), + log( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.b.json", + "20251112T080000Z", + ), + log( + "nv-legislation", + "2025Special36", + "AB2", + "20251111T080000Z.c.json", + "20251111T080000Z", + ), + ]; + let out = deduplicate_entries(entries); + assert_eq!(out.len(), 2, "AB1 collapses to 1 record; AB2 survives"); + } + + /// The `rss` publisher emits ONE `` per bill — not one per + /// action log. End-to-end check: render an RSS feed from 6 action-log + /// records for the same bill and count `` tags. + #[test] + fn rss_publisher_emits_one_item_per_bill_even_with_multiple_action_logs() { + let dir = tempdir().unwrap(); + let out_dir = dir.path().join("out"); + let p = Publisher { + kind: PublisherKind::Rss, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + // Six action logs for NV AB1. + let entries: Vec = (1..=6) + .map(|i| { + log( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + &format!("2025111{}T080000Z", i), + ) + }) + .collect(); + let job = PublishJob { + name: "feed", + publisher: &p, + entries, + output_dir_override: None, + output_file_override: None, + project_dir: dir.path().to_path_buf(), + dry_run: false, + html_entry_url: None, + }; + run_publisher(&job).expect("rss publisher should run"); + + let feed_xml = std::fs::read_to_string(out_dir.join("feed.xml")).unwrap(); + let item_count = feed_xml.matches("").count(); + assert_eq!( + item_count, 1, + "RSS feed must contain exactly one per bill; got {} items for one bill", + item_count + ); + } + + /// The `html` publisher emits ONE `
` per bill — not one per + /// action log. End-to-end check: render the HTML index from 6 + /// action-log records for the same bill and count `
` tags. + #[test] + fn html_publisher_emits_one_article_per_bill_even_with_multiple_action_logs() { + let dir = tempdir().unwrap(); + let out_dir = dir.path().join("out"); + let p = Publisher { + kind: PublisherKind::Html, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let entries: Vec = (1..=6) + .map(|i| { + log( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + &format!("2025111{}T080000Z", i), + ) + }) + .collect(); + let job = PublishJob { + name: "site", + publisher: &p, + entries, + output_dir_override: None, + output_file_override: None, + project_dir: dir.path().to_path_buf(), + dry_run: false, + html_entry_url: None, + }; + run_publisher(&job).expect("html publisher should run"); + + let html = std::fs::read_to_string(out_dir.join("index.html")).unwrap(); + let article_count = html.matches(" per bill; got {} for one bill", + article_count + ); + } } diff --git a/actions/govbot/src/rss.rs b/actions/govbot/src/rss.rs index 524e542c..398f5518 100644 --- a/actions/govbot/src/rss.rs +++ b/actions/govbot/src/rss.rs @@ -275,7 +275,12 @@ pub fn extract_link(entry: &Value, base_url: Option<&str>) -> Option { None } -/// Extract or generate a unique GUID for the entry +/// Extract or generate a unique GUID for the entry. +/// +/// This is a **per-log** GUID — distinct for every action-log file. For a +/// **per-bill** key (the one publishers should use to dedup so an activist +/// doesn't see the same bill posted six times under six different action +/// logs), see [`bill_guid`]. pub fn extract_guid(entry: &Value) -> String { // Use source log path as GUID if available if let Some(sources) = entry.get("sources").and_then(|s| s.as_object()) { @@ -298,6 +303,84 @@ pub fn extract_guid(entry: &Value) -> String { format!("{}_{}", timestamp, bill_id) } +/// A **per-bill** GUID — the grouping key publishers use to emit one post / +/// item / row per (jurisdiction, bill_id) rather than one per action log. +/// +/// Publishers (bluesky, rss, html) receive a result stream where the same +/// bill emits one record per action-log file (committee referrals, hearings, +/// passage votes, …). Activists want one item per bill, not N. The +/// **bill_guid** collapses the N action-log records to a single (jurisdiction, +/// bill_id) key. +/// +/// Resolution order — each path produces the same canonical form +/// `/country:.../state:.../sessions//bills/`: +/// +/// 1. `sources.log` → strip `/logs/` tail; if the stripped path +/// already ends in `/bills/` (per-bill-log-directory layout) use it +/// verbatim, else append `/bills/` (session-level-log-directory +/// layout — the OCD-files common case). +/// 2. `sources.bill` → strip `/metadata.json` tail. The parent dir IS the +/// bill dir on disk. +/// 3. fall back to the per-log GUID (see [`extract_guid`]) — preserves the +/// pre-fix shape when an entry carries no `sources` block at all. +/// +/// `bill_id` is taken from `bill.identifier`, then `id`, then `log.bill_id`, +/// matching what publishers already use for rendering. +pub fn bill_guid(entry: &Value) -> String { + // Resolve the bill identifier the publishers already render with. + let bill_id = entry + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str()) + .or_else(|| entry.get("id").and_then(|v| v.as_str())) + .or_else(|| { + entry + .get("log") + .and_then(|l| l.get("bill_id")) + .and_then(|v| v.as_str()) + }) + .map(|s| s.trim().to_string()) + .unwrap_or_default(); + + // 1. `sources.log` — the dominant path. Strip `/logs/...` to get the + // enclosing dir, then ensure it ends in `/bills/`. + if let Some(log_path) = entry + .get("sources") + .and_then(|s| s.get("log")) + .and_then(|v| v.as_str()) + { + if let Some(prefix) = log_path.split("/logs/").next() { + if !bill_id.is_empty() { + let needle = format!("/bills/{}", bill_id); + if prefix.ends_with(&needle) { + return prefix.to_string(); + } + return format!("{}{}", prefix, needle); + } + // No bill_id resolved — the prefix is still a stable dedup key + // (collapses all session-level logs to one row, which is a + // reasonable fallback when bill_id is missing). + return prefix.to_string(); + } + } + + // 2. `sources.bill` — the path to `metadata.json`; its parent dir is + // `bills/`. Strip the trailing `/metadata.json`. + if let Some(bill_path) = entry + .get("sources") + .and_then(|s| s.get("bill")) + .and_then(|v| v.as_str()) + { + let trimmed = bill_path + .strip_suffix("/metadata.json") + .unwrap_or(bill_path); + return trimmed.to_string(); + } + + // 3. No sources at all — fall back to the per-log GUID. + extract_guid(entry) +} + /// Convert JSON Lines entries to RSS feed pub fn json_to_rss( entries: Vec, @@ -310,16 +393,22 @@ pub fn json_to_rss( let base_url = base_url.unwrap_or(link); let mut items = Vec::new(); - let mut seen_guids = HashSet::new(); + let mut seen_bills = HashSet::new(); for entry in entries { - let guid = extract_guid(&entry); - - // Deduplicate by GUID - if seen_guids.contains(&guid) { + // Dedup by **bill** — not by action-log path. A bill emits one + // record per action-log file (committee referral, hearing, passage + // vote …); RSS readers want one item per bill, not N. The first + // (newest, since the stream is timestamp-sorted DESC) wins; later + // action-log records for the same bill are dropped. The RSS + // `` itself still uses the per-log GUID so a feed reader + // doesn't conflate two genuinely different items across feeds. + let bill_key = bill_guid(&entry); + if seen_bills.contains(&bill_key) { continue; } - seen_guids.insert(guid.clone()); + seen_bills.insert(bill_key); + let guid = extract_guid(&entry); let mut item_builder = ItemBuilder::default(); @@ -497,16 +586,16 @@ pub fn json_to_html( let title_str = title.unwrap_or(""); let mut items_html = String::new(); - let mut seen_guids = HashSet::new(); + let mut seen_bills = HashSet::new(); for entry in entries { - let guid = extract_guid(&entry); - - // Deduplicate by GUID - if seen_guids.contains(&guid) { + // Dedup by **bill** — see `json_to_rss` for the rationale. One HTML + // entry per bill, not one per action log. + let bill_key = bill_guid(&entry); + if seen_bills.contains(&bill_key) { continue; } - seen_guids.insert(guid); + seen_bills.insert(bill_key); let entry_title = extract_title(&entry); let entry_description = extract_description(&entry); From 4056de9c50bf290f96373e90084271d72d99d4e9 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Mon, 25 May 2026 22:38:33 -0500 Subject: [PATCH 28/32] docs: credit CHN-Bluesky-Govbot framework lineage MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Govbot's civic-tech application — driving per-topic Bluesky bots from state-legislative data — was first proven by Frankie Vegliante's CHN-Bluesky-Govbot-Main framework. The per-topic config + GitHub Actions cron + per-topic state ledger + shared posting pipeline across 13 issue areas is the pattern that govbot's 4-tool architecture generalises. Credit it in README and in AGENT.md's playbook intro before any upstream PR. Co-Authored-By: Claude Opus 4.7 --- AGENT.md | 5 +++++ README.md | 14 ++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/AGENT.md b/AGENT.md index 727bf2f9..bf96ab2e 100644 --- a/AGENT.md +++ b/AGENT.md @@ -64,6 +64,11 @@ activist-default for hands-off improvement); **update** evolves the stack. | **manage** | "set up / run the Bluesky bot", "schedule it" | [§2](#2-manage--operate-the-bluesky-bot) | | **update** | "add a dataset", "the classifier misses bills" | [§3](#3-update--evolve-an-existing-project) | +Govbot's per-topic-bot pattern is owed to Frankie Vegliante's +[CHN-Bluesky-Govbot-Main](https://github.com/frankies2727/CHN-Bluesky-Govbot-Main) +framework — the framework that first ran 13 civic-issue Bluesky bots on the +same legislative-data pipeline. + --- ## The model — read this before doing anything diff --git a/README.md b/README.md index 6a5792ad..cdf22a6d 100644 --- a/README.md +++ b/README.md @@ -212,6 +212,20 @@ Coverage today: Override the registry with `GOVBOT_REGISTRY_URL=` or a project-local `.govbot/registry.json`. +## Lineage + +Govbot's civic-tech application — feeding state legislative data into +per-topic Bluesky bots — was first proven by Frankie Vegliante's +**CHN-Bluesky-Govbot-Main** framework +([github.com/frankies2727/CHN-Bluesky-Govbot-Main](https://github.com/frankies2727/CHN-Bluesky-Govbot-Main)). +That framework's design — per-topic configs, GitHub Actions cron, per-topic +state ledger, and a shared posting pipeline across 13 issue-area Bluesky +bots (transportation, housing, education, immigration, …) — is the pattern +that govbot's 4-tool architecture and the climate-activist deployment both +build on. Govbot's planned `govbot init --from-frankie-config` flag (Phase +1b, in flight) lets a CHN topic migrate to this stack with its keywords, +emoji map, and posted-state history intact. + ## Contribute ### Folder Structure From 0e63562781866a1d9dd980d36ff917ecd0b31052 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Mon, 25 May 2026 22:47:19 -0500 Subject: [PATCH 29/32] init: --from-frankie-config scaffolds project from CHN topic config MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a `govbot init --from-frankie-config [--into ]` migration tool for the 13 existing CHN-Bluesky-Govbot topics (transportation, immigration, housing, education, …) so a Frankie topic maintainer can move to the govbot+fastclass stack without rebuilding the keyword list from scratch. Purely local — no network calls. Field-to-field mapping the scaffold applies: name → classifier tag name + bluesky publisher `select: []` display_name → tag description framing + README header (falls back to title-cased `name`) default_emoji → README header + summarizer prompt voice keywords → classifier/classifier.yml tags..include_keywords (verbatim) emoji_map → classifier.yml header comment listing keyword→emoji, ready to fold into a post template later digest_title → publish.feed.title + publish.site.title (falls back to " Bills Weekly Digest") topic → tag description + summarizer prompt subject (falls back to `name`) (any extras) → absorbed via #[serde(flatten)] so Frankie configs that carry schedule/timezone/jurisdictions still parse Skeleton written into ``: govbot.yml manifest (datasets:[all], classify transform, rss/html/ bluesky publishers each selecting []) classifier/ classifier.yml one tag = the topic name, include_keywords = Frankie's keyword list verbatim, examples: [], threshold: 0.3 fusion.yml declares the portable `models:` block (encoder + reranker) so `fastclass model fetch --bundle` works eval/ constitution.yml two PLACEHOLDER items (one positive, one negative), clearly marked rolling.yml items: [] proposals/.gitkeep empty summarizer/prompt.md stub folding in the topic focus README.md activist-facing migration story + next-steps .gitignore .govbot/ tags/ state/ dist/ classifier/model[-rerank]/ fastclass.lock govbot.lock .env Pre-flight guard: refuses to overwrite if `/govbot.yml` already exists, with an explicit error message pointing the user at --into. After scaffolding, stdout prints the 5-step next-steps recipe (install Tier-2 model, seed gold, dry-run, /fastclass:improve, set BLUESKY_* env vars) so a CHN topic maintainer can land on the new stack and keep moving without re-reading the docs. Tests (+4): unit tests for the parser (minimal-config-with-extras, display-fallback, empty-name-rejected) and an integration test that runs the binary, parses the scaffolded manifest + classifier.yml + fusion.yml, asserts the keyword list survives verbatim, and confirms the overwrite guard. Total 78 → 82. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/init_from_frankie.rs | 559 ++++++++++++++++++ actions/govbot/src/lib.rs | 1 + actions/govbot/src/main.rs | 58 +- .../frankie_transportation_config.yml | 26 + actions/govbot/tests/from_frankie_config.rs | 188 ++++++ 5 files changed, 825 insertions(+), 7 deletions(-) create mode 100644 actions/govbot/src/init_from_frankie.rs create mode 100644 actions/govbot/tests/fixtures/frankie_transportation_config.yml create mode 100644 actions/govbot/tests/from_frankie_config.rs diff --git a/actions/govbot/src/init_from_frankie.rs b/actions/govbot/src/init_from_frankie.rs new file mode 100644 index 00000000..77d8643b --- /dev/null +++ b/actions/govbot/src/init_from_frankie.rs @@ -0,0 +1,559 @@ +//! `govbot init --from-frankie-config` — migration tool that scaffolds a +//! govbot+fastclass project from a Frankie-style per-topic config. +//! +//! Frankie is the original CHN-Bluesky-Govbot framework. Each topic +//! (`transportation`, `immigration`, `housing`, …) lives in a +//! `topics//config.yml` carrying a name, display name, default emoji, +//! a flat keyword list covering subdomains, a keyword→emoji map, a digest +//! title, and a topic focus. This module reads one such file and emits a +//! govbot+fastclass project skeleton — a `govbot.yml` manifest plus a +//! fastclass classifier bundle plus the supporting stubs — so an existing +//! Frankie topic maintainer can migrate to the new stack without rebuilding +//! the keyword list from scratch. +//! +//! ### Field-to-field mapping +//! +//! | Frankie field | Scaffolded output | +//! |------------------|--------------------------------------------------------| +//! | `name` | classifier tag name + bluesky `select: []` | +//! | `display_name` | tag description framing + README header | +//! | `default_emoji` | README header + summarizer prompt voice | +//! | `keywords` | `classifier/classifier.yml: tags..include_keywords` | +//! | `emoji_map` | classifier.yml comment listing the keyword→emoji map | +//! | `digest_title` | `publish.feed.title` + `publish.site.title` | +//! | `topic` | tag description + summarizer prompt subject | +//! +//! No network calls — purely a local-file transformation. +//! +//! ### Atomicity +//! +//! Refuses to overwrite if `/govbot.yml` already exists. Otherwise +//! scaffolds everything before reporting success: a failure mid-write leaves +//! a partial tree (the user can `rm -rf ` and retry), but the +//! pre-flight check is the primary guard against clobbering an existing +//! project. + +use anyhow::{anyhow, Context, Result}; +use serde::Deserialize; +use std::collections::BTreeMap; +use std::fs; +use std::path::{Path, PathBuf}; + +/// The Frankie per-topic config shape. Permissive: extra fields are absorbed +/// into `extra` so a Frankie config that carries fields we don't use yet +/// (timezone, schedule, jurisdictions, …) still parses cleanly. +#[derive(Debug, Deserialize)] +pub struct FrankieTopicConfig { + /// The machine-readable topic name (e.g. `"transportation"`). Becomes the + /// classifier's single tag name and the bluesky publisher's `select` entry. + pub name: String, + + /// Optional human-readable display name (e.g. `"Transportation"`). + /// Defaults to a title-cased `name` if absent. + pub display_name: Option, + + /// Default emoji for the topic (e.g. `"🚗"`). + pub default_emoji: Option, + + /// The flat keyword list covering the topic's subdomains. Becomes the + /// classifier tag's `include_keywords`. + #[serde(default)] + pub keywords: Vec, + + /// Keyword→emoji map (e.g. `rail` → `"🚆"`). Surfaced in the classifier + /// bundle as a comment so the migrating maintainer can fold it back into a + /// post template later. + #[serde(default)] + pub emoji_map: BTreeMap, + + /// The Frankie digest title (e.g. `"🗳️ Transportation Bills Weekly + /// Digest"`). Becomes the RSS/HTML publisher title. + pub digest_title: Option, + + /// The Frankie "topic focus" string — a short framing the summarizer + /// uses (e.g. `"transportation"`). + pub topic: Option, + + /// Catch-all for fields Frankie carries that this migration tool does not + /// translate. Held so unknown fields don't fail the parse. + #[serde(flatten)] + pub extra: serde_yaml::Value, +} + +impl FrankieTopicConfig { + /// Parse a Frankie-style `topics//config.yml` from a path. + pub fn load(path: &Path) -> Result { + let contents = fs::read_to_string(path) + .with_context(|| format!("Failed to read Frankie config: {}", path.display()))?; + let parsed: Self = serde_yaml::from_str(&contents) + .with_context(|| format!("Failed to parse Frankie config: {}", path.display()))?; + if parsed.name.trim().is_empty() { + return Err(anyhow!( + "Frankie config {} has empty `name` — required to scaffold a classifier tag", + path.display() + )); + } + Ok(parsed) + } + + /// The human-readable display name. Falls back to title-casing `name` + /// (e.g. `transportation` → `Transportation`). + pub fn display(&self) -> String { + self.display_name.clone().unwrap_or_else(|| { + let mut chars = self.name.chars(); + match chars.next() { + None => String::new(), + Some(first) => first.to_uppercase().collect::() + chars.as_str(), + } + }) + } + + /// The summarizer-prompt topic focus, defaulting to the lowercased name. + pub fn topic_focus(&self) -> String { + self.topic.clone().unwrap_or_else(|| self.name.clone()) + } +} + +/// Scaffold a govbot+fastclass project at `into` from a parsed Frankie config. +/// Returns the absolute path the project was written to. +pub fn scaffold(config: &FrankieTopicConfig, into: &Path) -> Result { + // Pre-flight guard: refuse to clobber an existing project. + let manifest_path = into.join("govbot.yml"); + if manifest_path.exists() { + return Err(anyhow!( + "{} already exists — refusing to overwrite an existing govbot project. \ + Remove it first or scaffold into a different directory with --into .", + manifest_path.display() + )); + } + + fs::create_dir_all(into) + .with_context(|| format!("Failed to create scaffold dir: {}", into.display()))?; + + // 1. govbot.yml manifest + fs::write(&manifest_path, render_govbot_yml(config)) + .with_context(|| format!("Failed to write {}", manifest_path.display()))?; + + // 2. classifier bundle + let classifier_dir = into.join("classifier"); + fs::create_dir_all(&classifier_dir)?; + fs::write( + classifier_dir.join("classifier.yml"), + render_classifier_yml(config), + )?; + fs::write(classifier_dir.join("fusion.yml"), render_fusion_yml())?; + + let eval_dir = classifier_dir.join("eval"); + fs::create_dir_all(&eval_dir)?; + fs::write( + eval_dir.join("constitution.yml"), + render_constitution_yml(config), + )?; + fs::write(eval_dir.join("rolling.yml"), render_rolling_yml())?; + + // proposals dir — empty; the improvement loop populates it. + fs::create_dir_all(classifier_dir.join("proposals"))?; + // Keep the dir tracked even though it is empty today. + fs::write(classifier_dir.join("proposals").join(".gitkeep"), "")?; + + // 3. summarizer stub + let summarizer_dir = into.join("summarizer"); + fs::create_dir_all(&summarizer_dir)?; + fs::write( + summarizer_dir.join("prompt.md"), + render_summarizer_prompt(config), + )?; + + // 4. README + fs::write(into.join("README.md"), render_readme(config))?; + + // 5. .gitignore + fs::write(into.join(".gitignore"), render_gitignore())?; + + Ok(into.to_path_buf()) +} + +/// Run the full `--from-frankie-config [--into ]` flow: parse, +/// scaffold, and print activist-facing next-steps to stdout. +pub fn run(from_config: &Path, into: Option<&Path>) -> Result<()> { + let config = FrankieTopicConfig::load(from_config)?; + let cwd = std::env::current_dir()?; + let into_path: PathBuf = into.map(|p| p.to_path_buf()).unwrap_or(cwd); + let written = scaffold(&config, &into_path)?; + print_next_steps(&written, from_config); + Ok(()) +} + +fn print_next_steps(into: &Path, from_config: &Path) { + println!( + "✓ Scaffolded govbot+fastclass project at {}.", + into.display() + ); + println!(); + println!( + "This project was created from {}. The keyword list became", + from_config.display() + ); + println!("your starter classifier; everything else is yours to refine."); + println!(); + println!("Recommended next steps:"); + println!(); + println!(" 1. Install the Tier-2 semantic model so embedding matchers fire:"); + println!(" fastclass model fetch --bundle ./classifier"); + println!(); + println!(" 2. Replace the placeholder constitution items with real labeled examples:"); + println!(" /fastclass:seed-gold ./classifier"); + println!(); + println!(" 3. Try classifying:"); + println!(" govbot run --dry-run"); + println!(); + println!(" 4. Iterate quality via the improvement loop:"); + println!(" /fastclass:improve autonomous"); + println!(); + println!(" 5. Set Bluesky credentials (env-only — never in govbot.yml):"); + println!(" export BLUESKY_HANDLE=..."); + println!(" export BLUESKY_APP_PASSWORD=..."); +} + +// --------------------------------------------------------------------------- +// File renderers — each pure function takes the Frankie config and returns +// the file contents. Keeping rendering pure makes unit testing trivial. +// --------------------------------------------------------------------------- + +fn render_govbot_yml(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let title = config + .digest_title + .clone() + .unwrap_or_else(|| format!("{} Bills Weekly Digest", display)); + + let mut yml = String::new(); + yml.push_str("# Govbot manifest — scaffolded from a Frankie-style topic config.\n"); + yml.push_str("# See README.md for the migration story. Tune the classifier bundle\n"); + yml.push_str("# (./classifier) with the fastclass improvement loop, not by hand.\n"); + yml.push_str("$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json\n\n"); + + yml.push_str("datasets:\n"); + yml.push_str(" - all\n\n"); + + yml.push_str("transforms:\n"); + yml.push_str(" classify:\n"); + yml.push_str(" command: [fastclass, classify, \"-\"]\n"); + yml.push_str(" reads: docs\n"); + yml.push_str(" writes: classification\n"); + yml.push_str(" classifier: ./classifier\n\n"); + + yml.push_str("publish:\n"); + yml.push_str(" feed:\n"); + yml.push_str(" type: rss\n"); + yml.push_str(&format!(" select: [{}]\n", config.name)); + yml.push_str(&format!(" title: {}\n", yaml_string(&title))); + yml.push_str(" base_url: \"https://example.org/your-deployment\"\n"); + yml.push_str(" output_dir: dist\n"); + yml.push_str(&format!(" output_file: {}-feed.xml\n\n", config.name)); + + yml.push_str(" site:\n"); + yml.push_str(" type: html\n"); + yml.push_str(&format!(" select: [{}]\n", config.name)); + yml.push_str(&format!(" title: {}\n", yaml_string(&title))); + yml.push_str(" base_url: \"https://example.org/your-deployment\"\n"); + yml.push_str(" output_dir: dist\n\n"); + + yml.push_str(" bluesky:\n"); + yml.push_str(" type: bluesky\n"); + yml.push_str(&format!(" select: [{}]\n", config.name)); + yml.push_str(" # Calibrated final_score threshold (0..1). 0.55 is a sensible starting\n"); + yml.push_str(" # point per the climate-activist deployment; raise to cut false\n"); + yml.push_str(" # positives, lower to widen recall.\n"); + yml.push_str(" min_score: 0.55\n"); + yml.push_str(" base_url: \"https://example.org/your-deployment\"\n"); + yml.push_str(" post_template: \"{title}\\n\\n{tags} · {link}\"\n"); + yml.push_str(" # Credentials are env-only: BLUESKY_HANDLE / BLUESKY_APP_PASSWORD.\n"); + yml.push_str(" # Never put them in this file.\n\n"); + + yml.push_str("pipelines:\n"); + yml.push_str(" default:\n"); + yml.push_str(" - classify\n"); + yml.push_str(" - feed\n"); + yml.push_str(" - site\n"); + yml.push_str(" - bluesky\n"); + + yml +} + +fn render_classifier_yml(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let topic_focus = config.topic_focus(); + + let mut yml = String::new(); + yml.push_str("# classifier.yml — the taxonomy for this govbot+fastclass project.\n"); + yml.push_str("#\n"); + yml.push_str("# Scaffolded from a Frankie-style topic config. The single tag below\n"); + yml.push_str("# carries the keyword list from that config verbatim; everything else\n"); + yml.push_str("# (exclude gates, regex, examples, HyDE queries, subjects) is yours\n"); + yml.push_str("# to grow via the /fastclass:improve loop. Never hand-tune by guessing;\n"); + yml.push_str("# every change should be proved against the frozen gold set in\n"); + yml.push_str("# eval/constitution.yml.\n"); + if !config.emoji_map.is_empty() { + yml.push_str("#\n"); + yml.push_str("# Frankie emoji_map (kept here for reference — fold into a post template\n"); + yml.push_str("# later if you want per-subdomain emoji in Bluesky posts):\n"); + for (keyword, emoji) in &config.emoji_map { + yml.push_str(&format!("# {} → {}\n", keyword, emoji)); + } + } + yml.push_str("tags:\n"); + yml.push_str(&format!(" {}:\n", config.name)); + yml.push_str(&format!( + " description: >-\n Bills about {} — scaffolded from the Frankie topic\n config for \"{}\". Refine this description as the tag evolves.\n", + topic_focus, display + )); + yml.push_str(" include_keywords:\n"); + for keyword in &config.keywords { + yml.push_str(&format!(" - {}\n", yaml_string(keyword))); + } + yml.push_str(" # examples are intentionally empty — add real labeled bills via\n"); + yml.push_str(" # /fastclass:from-intent or by curating eval/constitution.yml.\n"); + yml.push_str(" examples: []\n"); + yml.push_str(" threshold: 0.3\n"); + + yml +} + +fn render_fusion_yml() -> String { + // Mirrors the climate-activist bundle's fusion.yml — the portable + // `models:` block declares the encoder + reranker so `fastclass model + // fetch --bundle ./classifier` resolves and installs them. + let mut yml = String::new(); + yml.push_str("# fusion.yml — global fusion config for the classifier bundle.\n"); + yml.push_str("# Owned by fastclass. Per-tag overrides live INLINE in classifier.yml.\n"); + yml.push_str("version: fusion-v1\n\n"); + yml.push_str( + "# Portable model declaration. Run `fastclass model fetch --bundle ./classifier`\n", + ); + yml.push_str("# to install these into the shared ~/.govbot/models// cache.\n"); + yml.push_str("models:\n"); + yml.push_str(" encoder: sentence-transformers/all-MiniLM-L6-v2\n"); + yml.push_str(" reranker: cross-encoder/ms-marco-MiniLM-L-6-v2\n\n"); + yml.push_str("# Default fusion weight per matcher kind, applied to any tag that does not\n"); + yml.push_str("# declare its own inline `fusion_weights`.\n"); + yml.push_str("weights:\n"); + yml.push_str(" keyword: 1.0\n"); + yml.push_str(" regex: 0.8\n\n"); + yml.push_str("# Cascade uncertainty band. Documents whose fused score lands inside\n"); + yml.push_str("# [low, high] are the uncertain ones the improvement loop focuses on.\n"); + yml.push_str("uncertainty_band:\n"); + yml.push_str(" low: 0.3\n"); + yml.push_str(" high: 0.7\n\n"); + yml.push_str("splitters:\n"); + yml.push_str(" default:\n"); + yml.push_str(" strategy: whole\n"); + yml.push_str(" sections:\n"); + yml.push_str(" strategy: sections\n"); + yml.push_str(" aggregation: max\n"); + yml +} + +fn render_constitution_yml(config: &FrankieTopicConfig) -> String { + // PLACEHOLDER items per the seed-gold pattern, clearly marked. The + // activist replaces them with real labeled bills via /fastclass:seed-gold. + let mut yml = String::new(); + yml.push_str("# constitution.yml — the FROZEN gold standard for this classifier.\n"); + yml.push_str("# Never shown to an LLM. The items below are PLACEHOLDERS — replace them\n"); + yml.push_str("# with real labeled bills (use /fastclass:seed-gold ./classifier) before\n"); + yml.push_str("# relying on the improvement loop's judgement.\n"); + yml.push_str("items:\n"); + yml.push_str(&format!(" - id: placeholder-{}-positive\n", config.name)); + yml.push_str(" text: >-\n"); + yml.push_str(&format!( + " PLACEHOLDER — replace with a real {} bill the classifier should\n tag (positive example).\n", + config.topic_focus() + )); + yml.push_str(&format!(" expected_tags: [{}]\n", config.name)); + yml.push_str(&format!(" - id: placeholder-{}-negative\n", config.name)); + yml.push_str(" text: >-\n"); + yml.push_str(&format!( + " PLACEHOLDER — replace with a real bill that should NOT be tagged\n {} (negative example used to gate false-positives).\n", + config.name + )); + yml.push_str(" expected_tags: []\n"); + yml +} + +fn render_rolling_yml() -> String { + let mut yml = String::new(); + yml.push_str("# rolling.yml — the refreshable working eval set.\n"); + yml.push_str("# The improvement loop adds failing bills here and proves fixes against\n"); + yml.push_str("# the (unseen) constitution. Empty today — start by labeling a handful\n"); + yml.push_str("# of bills from `govbot source --select docs` you disagree with.\n"); + yml.push_str("items: []\n"); + yml +} + +fn render_summarizer_prompt(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let topic = config.topic_focus(); + let mut s = String::new(); + s.push_str(&format!("# {} summarizer prompt (stub)\n\n", display)); + s.push_str(&format!( + "Describe this bill in one neutral sentence, focused on {} policy.\n", + topic + )); + s.push_str( + "Avoid editorial language; let the bill text speak for itself. \ + A future `summarize` transform will read this prompt — today it is a\n\ + placeholder for the migrating maintainer to refine.\n", + ); + s +} + +fn render_readme(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let emoji = config.default_emoji.as_deref().unwrap_or(""); + let topic = config.topic_focus(); + + let mut s = String::new(); + s.push_str(&format!("# {} {} govbot deployment\n\n", emoji, display)); + s.push_str( + "This is a govbot+fastclass project scaffolded **from a Frankie-style\n\ + topic config**. The Frankie keyword list became your starter classifier;\n\ + everything else is yours to refine.\n\n", + ); + s.push_str("## What was generated\n\n"); + s.push_str("- `govbot.yml` — the project manifest (datasets, transforms, publishers).\n"); + s.push_str(&format!( + "- `classifier/` — a fastclass bundle with one tag (`{}`) carrying the\n Frankie keyword list as `include_keywords`.\n", + config.name + )); + s.push_str("- `classifier/eval/constitution.yml` — **placeholder** gold items;\n"); + s.push_str(" replace before relying on the improvement loop's judgement.\n"); + s.push_str("- `summarizer/prompt.md` — stub for the future `summarize` transform.\n\n"); + s.push_str("## What to do next\n\n"); + s.push_str("1. Install the Tier-2 semantic model:\n"); + s.push_str(" ```\n fastclass model fetch --bundle ./classifier\n ```\n"); + s.push_str("2. Replace the placeholder constitution items with real labeled examples:\n"); + s.push_str(" ```\n /fastclass:seed-gold ./classifier\n ```\n"); + s.push_str("3. Try classifying:\n"); + s.push_str(" ```\n govbot run --dry-run\n ```\n"); + s.push_str("4. Iterate quality via the improvement loop:\n"); + s.push_str(" ```\n /fastclass:improve autonomous\n ```\n"); + s.push_str("5. Set Bluesky credentials (env-only — never in `govbot.yml`):\n"); + s.push_str( + " ```\n export BLUESKY_HANDLE=...\n export BLUESKY_APP_PASSWORD=...\n ```\n\n", + ); + s.push_str(&format!( + "## Topic focus\n\n`{}` — used by the summarizer prompt and the tag\ndescription. Adjust as your editorial scope sharpens.\n", + topic + )); + s +} + +fn render_gitignore() -> String { + "# govbot — generated, reconstructed on every run\n\ + .govbot/\n\ + dist/\n\ + docs/\n\ + # Classification output from `govbot apply` — regenerated each run.\n\ + tags/\n\ + # Publisher state — append-only ledgers.\n\ + state/\n\ + # fastclass / govbot lockfiles\n\ + fastclass.lock\n\ + govbot.lock\n\ + # Bundled model artifacts (resolved by `fastclass model fetch`).\n\ + classifier/model/\n\ + classifier/model-rerank/\n\ + \n\ + # Secrets — never commit\n\ + .env\n" + .to_string() +} + +/// Quote a YAML scalar conservatively — escapes any embedded `"` and wraps in +/// double quotes. Used for keyword lines and titles, where the source can +/// carry characters that would otherwise confuse the YAML parser. +fn yaml_string(s: &str) -> String { + let escaped = s.replace('\\', "\\\\").replace('"', "\\\""); + format!("\"{}\"", escaped) +} + +// --------------------------------------------------------------------------- +// Unit tests +// --------------------------------------------------------------------------- + +#[cfg(test)] +mod tests { + use super::*; + + /// The parser must accept a minimal Frankie config and tolerate extra + /// fields the migration tool does not translate. + #[test] + fn frankie_config_parser_handles_minimal_config_with_extras() { + let yml = r#" +name: housing +display_name: Housing +default_emoji: 🏠 +keywords: + - affordable housing + - rent control + - eviction +emoji_map: + rent: 💵 + eviction: 🚪 +digest_title: "🏠 Housing Bills Weekly Digest" +topic: "housing policy" +# Extras Frankie carries that we don't translate yet: +schedule: weekly +timezone: America/Chicago +jurisdictions: + - il + - ca +"#; + let dir = tempfile::tempdir().unwrap(); + let path = dir.path().join("config.yml"); + std::fs::write(&path, yml).unwrap(); + + let parsed = FrankieTopicConfig::load(&path).expect("minimal Frankie config should parse"); + assert_eq!(parsed.name, "housing"); + assert_eq!(parsed.display(), "Housing"); + assert_eq!(parsed.default_emoji.as_deref(), Some("🏠")); + assert_eq!(parsed.keywords.len(), 3); + assert_eq!(parsed.emoji_map.get("rent").map(String::as_str), Some("💵")); + assert_eq!(parsed.topic_focus(), "housing policy"); + // Extra fields are absorbed, not rejected. + match parsed.extra { + serde_yaml::Value::Mapping(m) => { + assert!(m.contains_key(serde_yaml::Value::String("schedule".to_string()))); + assert!(m.contains_key(serde_yaml::Value::String("jurisdictions".to_string()))); + } + other => panic!("expected extras to land in a mapping, got: {:?}", other), + } + } + + /// Display falls back to title-casing `name` when `display_name` is absent. + #[test] + fn display_falls_back_to_title_case() { + let cfg = FrankieTopicConfig { + name: "transportation".to_string(), + display_name: None, + default_emoji: None, + keywords: vec![], + emoji_map: BTreeMap::new(), + digest_title: None, + topic: None, + extra: serde_yaml::Value::Null, + }; + assert_eq!(cfg.display(), "Transportation"); + } + + /// Empty `name` is rejected — the classifier needs a tag name. + #[test] + fn frankie_config_rejects_empty_name() { + let dir = tempfile::tempdir().unwrap(); + let path = dir.path().join("config.yml"); + std::fs::write(&path, "name: \"\"\nkeywords: []\n").unwrap(); + + let err = FrankieTopicConfig::load(&path).expect_err("empty name must be rejected"); + assert!(err.to_string().contains("empty `name`")); + } +} diff --git a/actions/govbot/src/lib.rs b/actions/govbot/src/lib.rs index 26971c4a..cf25daf0 100644 --- a/actions/govbot/src/lib.rs +++ b/actions/govbot/src/lib.rs @@ -9,6 +9,7 @@ pub mod config; pub mod error; pub mod filter; pub mod git; +pub mod init_from_frankie; pub mod lock; pub mod pipeline; pub mod processor; diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 6927436f..fd385490 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -329,7 +329,23 @@ enum Command { /// Scaffold a new govbot.yml in the current directory (the setup wizard). /// Interactive in a TTY; writes sensible defaults when non-interactive. - Init, + /// + /// `--from-frankie-config ` bypasses the wizard and scaffolds a + /// govbot+fastclass project skeleton from a Frankie-style + /// `topics//config.yml` — the migration tool for existing + /// CHN-Bluesky-Govbot topic maintainers moving to the new stack. + Init { + /// Path to a Frankie-style topics//config.yml. When set, govbot init + /// generates a govbot+fastclass project skeleton from the CHN-Bluesky-Govbot + /// framework's per-topic shape (keyword list + emoji map + summary focus) + /// instead of running the interactive wizard. + #[arg(long = "from-frankie-config")] + from_frankie_config: Option, + + /// Where to scaffold the project. Default: cwd. + #[arg(long = "into")] + into: Option, + }, /// Add one or more datasets to the project's `govbot.yml` `datasets:` list. /// Each id is validated against the registry before it is added. @@ -3557,17 +3573,45 @@ async fn main() -> anyhow::Result<()> { } govbot::pipeline::run_pipeline(&config_path, govbot_dir.as_deref(), dry_run) } - Some(Command::Init) => { - let cwd = std::env::current_dir()?; - let config_path = cwd.join("govbot.yml"); + Some(Command::Init { + from_frankie_config, + into, + }) => { + // Migration path: --from-frankie-config bypasses the wizard and + // scaffolds from a Frankie-style topics//config.yml. The + // init_from_frankie module handles its own pre-flight checks + // (refusing to overwrite an existing govbot.yml in ). + if let Some(frankie_path) = from_frankie_config { + let into_path = into.map(std::path::PathBuf::from); + return govbot::init_from_frankie::run( + std::path::Path::new(&frankie_path), + into_path.as_deref(), + ); + } + + // Wizard / defaults path. `--into` is honored here too so a + // non-Frankie scaffold can target a directory other than cwd. + let into_provided = into.is_some(); + let target = match into { + Some(p) => { + let path = std::path::PathBuf::from(&p); + std::fs::create_dir_all(&path)?; + path + } + None => std::env::current_dir()?, + }; + let config_path = target.join("govbot.yml"); if config_path.exists() { - eprintln!("govbot.yml already exists in {}.", cwd.display()); + eprintln!("govbot.yml already exists in {}.", target.display()); return Ok(()); } - if std::io::IsTerminal::is_terminal(&std::io::stdin()) { + // The interactive wizard always writes to cwd; only run it when + // the user did not pass --into (otherwise honor --into via the + // non-interactive default writer). + if !into_provided && std::io::IsTerminal::is_terminal(&std::io::stdin()) { govbot::wizard::run_wizard() } else { - govbot::wizard::write_default_files(&cwd) + govbot::wizard::write_default_files(&target) } } Some(cmd @ Command::Add { .. }) => run_add_command(cmd), diff --git a/actions/govbot/tests/fixtures/frankie_transportation_config.yml b/actions/govbot/tests/fixtures/frankie_transportation_config.yml new file mode 100644 index 00000000..a9238a66 --- /dev/null +++ b/actions/govbot/tests/fixtures/frankie_transportation_config.yml @@ -0,0 +1,26 @@ +# Minimal-but-realistic Frankie-style topic config for tests. +# Shape mirrors CHN-Bluesky-Govbot's topics/transportation/config.yml. +name: transportation +display_name: Transportation +default_emoji: "🚗" +keywords: + - public transit + - bus rapid transit + - light rail + - high-speed rail + - bike lane + - pedestrian safety + - vision zero + - electric vehicle + - EV charging + - road infrastructure +emoji_map: + rail: "🚆" + bus: "🚌" + bicycle: "🚲" +digest_title: "🗳️ Transportation Bills Weekly Digest" +topic: "transportation" +# Permissive — Frankie configs may carry extra fields the migration tool +# does not translate yet. They must parse without error. +schedule: weekly +timezone: America/Chicago diff --git a/actions/govbot/tests/from_frankie_config.rs b/actions/govbot/tests/from_frankie_config.rs new file mode 100644 index 00000000..3c121f05 --- /dev/null +++ b/actions/govbot/tests/from_frankie_config.rs @@ -0,0 +1,188 @@ +//! Integration test for `govbot init --from-frankie-config ` — the +//! migration tool that scaffolds a govbot+fastclass project from an existing +//! Frankie-style topic config. +//! +//! Asserts the produced skeleton: +//! - Has a valid govbot manifest (`govbot.yml` parses). +//! - Has a classifier bundle with exactly one tag named after the Frankie +//! `name`, whose `include_keywords` equal the fixture's keyword list. +//! - Has a `fusion.yml` declaring the portable `models:` block. +//! - Refuses to overwrite an existing project (idempotency guard). +//! +//! Mirrors the style of `run_repos_scope.rs` — builds the binary, runs it as +//! a subprocess, and inspects the on-disk output. + +use std::fs; +use std::path::PathBuf; +use std::process::Command; + +fn govbot_binary() -> PathBuf { + let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + let status = Command::new("cargo") + .args(["build", "--bin", "govbot"]) + .current_dir(&manifest_dir) + .status() + .expect("cargo build should succeed"); + assert!(status.success(), "cargo build failed"); + manifest_dir.join("target").join("debug").join("govbot") +} + +fn fixture_path() -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("tests") + .join("fixtures") + .join("frankie_transportation_config.yml") +} + +#[test] +fn from_frankie_config_scaffolds_a_valid_govbot_project() { + let bin = govbot_binary(); + let fixture = fixture_path(); + let tmp = tempfile::tempdir().expect("tempdir"); + let into = tmp.path().join("scratch-transport"); + + // --- Run: govbot init --from-frankie-config --into --- + let output = Command::new(&bin) + .args([ + "init", + "--from-frankie-config", + fixture.to_str().unwrap(), + "--into", + into.to_str().unwrap(), + ]) + .output() + .expect("govbot init should execute"); + assert!( + output.status.success(), + "govbot init failed: stdout={} stderr={}", + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr) + ); + + // --- 1. govbot.yml parses as a valid manifest. --- + let manifest_path = into.join("govbot.yml"); + assert!( + manifest_path.exists(), + "expected scaffolded govbot.yml at {}", + manifest_path.display() + ); + let manifest = + govbot::config::Manifest::load(&manifest_path).expect("scaffolded govbot.yml should parse"); + assert_eq!( + manifest.datasets, + vec!["all".to_string()], + "scaffolded manifest should default to datasets: [all]" + ); + assert!( + manifest.transforms.contains_key("classify"), + "manifest should declare a `classify` transform" + ); + assert!( + manifest.publish.contains_key("bluesky"), + "manifest should declare a `bluesky` publisher" + ); + // bluesky select carries the topic name (single tag = the topic). + let bluesky = manifest.publish.get("bluesky").unwrap(); + assert_eq!( + bluesky.select.as_deref().map(|s| s.to_vec()), + Some(vec!["transportation".to_string()]) + ); + + // --- 2. classifier.yml has one tag named after `name`; keywords match. --- + let classifier_yml_path = into.join("classifier").join("classifier.yml"); + assert!(classifier_yml_path.exists(), "classifier.yml should exist"); + let raw = fs::read_to_string(&classifier_yml_path).expect("read classifier.yml"); + let parsed: serde_yaml::Value = serde_yaml::from_str(&raw).expect("classifier.yml is YAML"); + let tags = parsed + .get("tags") + .and_then(|v| v.as_mapping()) + .expect("classifier.yml should carry a `tags:` mapping"); + assert_eq!( + tags.len(), + 1, + "scaffolded classifier should carry exactly one tag (the Frankie topic name)" + ); + let tag = tags + .get(serde_yaml::Value::String("transportation".to_string())) + .expect("the single tag should be named after the Frankie `name`"); + let include_keywords: Vec = tag + .get("include_keywords") + .and_then(|v| v.as_sequence()) + .expect("tag should carry include_keywords") + .iter() + .map(|v| v.as_str().expect("keyword is a string").to_string()) + .collect(); + let expected_keywords = vec![ + "public transit", + "bus rapid transit", + "light rail", + "high-speed rail", + "bike lane", + "pedestrian safety", + "vision zero", + "electric vehicle", + "EV charging", + "road infrastructure", + ]; + assert_eq!( + include_keywords, expected_keywords, + "include_keywords should mirror the Frankie keyword list verbatim" + ); + + // --- 3. fusion.yml declares the portable models: block. --- + let fusion_path = into.join("classifier").join("fusion.yml"); + assert!(fusion_path.exists(), "fusion.yml should exist"); + let fusion_raw = fs::read_to_string(&fusion_path).expect("read fusion.yml"); + let fusion: serde_yaml::Value = + serde_yaml::from_str(&fusion_raw).expect("fusion.yml should parse"); + let models = fusion + .get("models") + .and_then(|v| v.as_mapping()) + .expect("fusion.yml should declare a `models:` block"); + assert!( + models.contains_key(serde_yaml::Value::String("encoder".to_string())), + "models: should declare an encoder" + ); + assert!( + models.contains_key(serde_yaml::Value::String("reranker".to_string())), + "models: should declare a reranker" + ); + + // --- supporting files exist --- + assert!(into + .join("classifier") + .join("eval") + .join("constitution.yml") + .exists()); + assert!(into + .join("classifier") + .join("eval") + .join("rolling.yml") + .exists()); + assert!(into.join("classifier").join("proposals").exists()); + assert!(into.join("summarizer").join("prompt.md").exists()); + assert!(into.join("README.md").exists()); + assert!(into.join(".gitignore").exists()); + + // --- 4. Re-running into the same dir refuses to overwrite. --- + let rerun = Command::new(&bin) + .args([ + "init", + "--from-frankie-config", + fixture.to_str().unwrap(), + "--into", + into.to_str().unwrap(), + ]) + .output() + .expect("re-run should execute"); + assert!( + !rerun.status.success(), + "re-running --from-frankie-config into an existing project must fail" + ); + let stderr = String::from_utf8_lossy(&rerun.stderr); + assert!( + stderr.contains("already exists") || stderr.contains("refusing"), + "stderr should explain the overwrite guard; got: {}", + stderr + ); +} From 4d8e4f8cd34817f7d83706c10016effad0f549d2 Mon Sep 17 00:00:00 2001 From: Sartaj Date: Mon, 25 May 2026 23:02:42 -0500 Subject: [PATCH 30/32] cli: govbot logs alias for back-compat with chn-bluesky-govbot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The chihacknight/govbot upgrade renames `govbot logs` to `govbot source` (commit 1342ce2). It was a hard cutover — no alias. The most important downstream consumer, frankies2727/CHN-Bluesky-Govbot-Main, runs ~13 civic- issue Bluesky bots whose cron drives them with `govbot logs > bills.jsonl`. Without a back-compat alias, pushing the refactor to chihacknight/govbot main would break every bot on the next cron run. Restore `govbot logs` as a thin alias that: - mirrors `govbot source`'s flag surface (--datasets/--repos, --limit, --join, --select, --filter, --sort, --govbot-dir) verbatim, - prints a one-line deprecation warning to STDERR (stdout is the bills.jsonl payload — leaking to stdout would corrupt it), - delegates to the same `run_source_command` the canonical `Source` arm calls, with the args forwarded as-is. Deprecation policy: the alias is documented as deprecated and will be removed in a future major version. Until then any invocation that worked pre-rename keeps working. Pin the bills.jsonl shape contract with a new integration test (`tests/bills_jsonl_compat.rs`) that asserts every field path Frankie's `scripts/post_to_bluesky.py` parser reads is present on the source stream, that `\bstate:([a-z]{2})\b` state detection still works, and that the dedup_key Frankie composes is non-empty and stable across two consecutive invocations on the same mock corpus. Anyone who breaks the shape gets a red test before Frankie's bots see a broken cron. Co-Authored-By: Claude Opus 4.7 --- CLAUDE.md | 1 + README.md | 1 + actions/govbot/src/main.rs | 76 ++++ actions/govbot/tests/bills_jsonl_compat.rs | 328 ++++++++++++++++++ ...i_example_snaps__snapshot@govbot_help.snap | 1 + 5 files changed, 407 insertions(+) create mode 100644 actions/govbot/tests/bills_jsonl_compat.rs diff --git a/CLAUDE.md b/CLAUDE.md index e7d4f526..368ab498 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -126,6 +126,7 @@ govbot # Scaffold govbot.yml (interactive wizard), then run the pi govbot pull all # Download all state legislation datasets govbot pull wy il # Download specific states govbot source # Stream legislative activity as JSON Lines +govbot logs # Deprecated alias for `govbot source` (default mode); back-compat with the CHN-Bluesky-Govbot-Main framework's `govbot logs > bills.jsonl` govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply govbot load # Load bill metadata into DuckDB govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky) diff --git a/README.md b/README.md index cdf22a6d..e1509bdc 100644 --- a/README.md +++ b/README.md @@ -152,6 +152,7 @@ govbot ls # list the manifest's datasets + what is cached local govbot pull all # clone/update every dataset govbot pull il ca ny # clone/update specific datasets govbot source # stream dataset records as JSON Lines +govbot logs # deprecated alias for `govbot source` (default mode), kept for back-compat with the CHN-Bluesky-Govbot-Main framework's `govbot logs > bills.jsonl` govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply govbot apply # persist a fastclass result stream under /tags/ govbot publish # run every configured publisher (RSS / HTML / JSON / DuckDB / Bluesky) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index fd385490..4f0d0053 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -409,6 +409,54 @@ enum Command { #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] output: String, }, + + /// **Deprecated.** Alias for `govbot source` (default mode) preserved so + /// existing consumers (the CHN-Bluesky-Govbot-Main framework, anyone + /// running `govbot logs > bills.jsonl`) keep working after the + /// Logs→Source rename. Prints a deprecation warning to stderr on + /// invocation. Will be removed in a future major version. + /// + /// The flag surface mirrors `govbot source` exactly — every flag that + /// `Source` accepts is honored here and forwarded verbatim. Anything + /// Frankie's `govbot logs > bills.jsonl` invocation might pass keeps + /// working. + Logs { + /// Datasets to emit (default: every linked dataset). Mirrors + /// `govbot source --datasets/--repos`. + #[arg(long = "datasets", visible_alias = "repos", num_args = 0..)] + repos: Vec, + + /// Per repo limit (default: 100) options: `none` | number. Mirrors + /// `govbot source --limit`. + #[arg(long, default_value = "100")] + limit: String, + + /// Join additional datasets (default: `bill,tags`). Mirrors + /// `govbot source --join`. + #[arg(long, default_value = "bill,tags")] + join: String, + + /// Select/transform fields (default: `default`). Mirrors + /// `govbot source --select`. Frankie's `govbot logs > bills.jsonl` + /// runs with the default — emitting the full joined record his + /// `scripts/post_to_bluesky.py` parses. + #[arg(long, default_value = "default", value_parser = ["default", "docs"])] + select: String, + + /// Per-repo log filter (default: `default`). Mirrors + /// `govbot source --filter`. + #[arg(long, default_value = "default", value_parser = ["default", "none"])] + filter: String, + + /// Sort order (default: DESC). Mirrors `govbot source --sort`. + #[arg(long, default_value = "DESC", value_parser = ["ASC", "DESC"])] + sort: String, + + /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env + /// var). Mirrors `govbot source --govbot-dir`. + #[arg(long = "govbot-dir")] + govbot_dir: Option, + }, } fn get_govbot_dir(govbot_dir: Option) -> anyhow::Result { @@ -3619,6 +3667,34 @@ async fn main() -> anyhow::Result<()> { Some(cmd @ Command::Ls { .. }) => run_ls_command(cmd), Some(cmd @ Command::Search { .. }) => run_search_command(cmd), Some(cmd @ Command::Doctor { .. }) => run_doctor_command(cmd), + Some(Command::Logs { + repos, + limit, + join, + select, + filter, + sort, + govbot_dir, + }) => { + // Deprecation warning MUST go to stderr — stdout is the + // bills.jsonl payload `govbot logs > bills.jsonl` consumers + // (the CHN-Bluesky-Govbot-Main framework) pipe to disk. + // Printing to stdout would corrupt the JSON-Lines stream. + eprintln!( + "warning: `govbot logs` is deprecated; use `govbot source` instead. The old form will be removed in a future major version." + ); + // Delegate to the canonical source handler with identical args. + run_source_command(Command::Source { + repos, + limit, + join, + select, + filter, + sort, + govbot_dir, + }) + .await + } None => { let cwd = std::env::current_dir()?; let config_path = cwd.join("govbot.yml"); diff --git a/actions/govbot/tests/bills_jsonl_compat.rs b/actions/govbot/tests/bills_jsonl_compat.rs new file mode 100644 index 00000000..5fc0c4b8 --- /dev/null +++ b/actions/govbot/tests/bills_jsonl_compat.rs @@ -0,0 +1,328 @@ +//! Back-compat contract test for `govbot logs` and the `bills.jsonl` shape. +//! +//! The chihacknight/govbot upgrade replaces the legacy `govbot logs` command +//! with `govbot source`. The most important downstream consumer is Frankie +//! Vegliante's `CHN-Bluesky-Govbot-Main` framework, which runs ~13 civic- +//! issue Bluesky bots, each driven by `govbot logs > bills.jsonl` in cron. +//! Its `scripts/post_to_bluesky.py` parser reads each line as a JSON object +//! and accesses a specific set of field paths. Breaking any of them silently +//! breaks every bot's next cron run. +//! +//! This test pins: +//! +//! 1. `govbot logs` runs (the back-compat alias survives). +//! 2. stdout is valid JSON-Lines. +//! 3. Every field path Frankie's parser accesses is present on at least +//! one record: +//! - `record.id` +//! - `record.timestamp` +//! - `record.bill.identifier` +//! - `record.bill.title` +//! - `record.bill.legislative_session` +//! - `record.bill.abstracts[].abstract` (when any abstract is present) +//! - `record.bill.subject` (when any subject is present) +//! - `record.log.action.description` +//! - `record.log.action.date` +//! - `record.sources` (nested values contain `state:`) +//! 4. State detection works — `\bstate:([a-z]{2})\b` matches somewhere in +//! the record on at least one line (Frankie's state-extraction regex). +//! 5. The dedup_key Frankie composes +//! (`f"{state}|{identifier}|{action_date}|{action_desc[:40]}"`) is +//! non-empty and stable across two consecutive invocations against the +//! same mock corpus. +//! +//! Anyone who changes the shape `govbot source` emits (which `govbot logs` +//! aliases) gets a red test here before Frankie's bots see a broken cron. + +use std::path::PathBuf; +use std::process::Command; + +use regex::Regex; +use serde_json::Value; + +/// Path to the freshly-built `govbot` binary. Mirrors the helper in +/// `run_repos_scope.rs` to keep this test binary self-contained. +fn govbot_binary() -> PathBuf { + let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + let status = Command::new("cargo") + .args(["build", "--bin", "govbot"]) + .current_dir(&manifest_dir) + .status() + .expect("cargo build should succeed"); + assert!(status.success(), "cargo build failed"); + manifest_dir.join("target").join("debug").join("govbot") +} + +/// Path to the in-tree mock corpus — the same fixture `just govbot source` +/// uses for dev runs (`actions/govbot/mocks/.govbot`). +fn mocks_govbot_dir() -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("mocks") + .join(".govbot") +} + +/// Run `govbot logs --govbot-dir ` and return (stdout, stderr). +fn run_logs(govbot_dir: &std::path::Path) -> (String, String) { + let bin = govbot_binary(); + let output = Command::new(&bin) + .arg("logs") + .arg("--govbot-dir") + .arg(govbot_dir) + .output() + .expect("spawn govbot logs"); + assert!( + output.status.success(), + "govbot logs exited with {:?}\nstderr:\n{}", + output.status.code(), + String::from_utf8_lossy(&output.stderr) + ); + ( + String::from_utf8(output.stdout).expect("stdout utf8"), + String::from_utf8(output.stderr).expect("stderr utf8"), + ) +} + +/// Parse JSON-Lines stdout into a `Vec`, skipping blank lines. +fn parse_jsonl(stdout: &str) -> Vec { + stdout + .lines() + .filter(|l| !l.trim().is_empty()) + .map(|l| { + serde_json::from_str::(l) + .unwrap_or_else(|e| panic!("invalid JSON line: {e}: {l}")) + }) + .collect() +} + +/// Build the dedup key Frankie's `scripts/post_to_bluesky.py` composes: +/// `f"{state}|{identifier}|{action_date}|{action_desc[:40]}"`. +fn dedup_key(record: &Value) -> Option { + let state = state_from_record(record)?; + let identifier = record + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str())?; + let action_date = record + .get("log") + .and_then(|l| l.get("action")) + .and_then(|a| a.get("date")) + .and_then(|v| v.as_str())?; + let action_desc = record + .get("log") + .and_then(|l| l.get("action")) + .and_then(|a| a.get("description")) + .and_then(|v| v.as_str())?; + let desc_head: String = action_desc.chars().take(40).collect(); + Some(format!("{state}|{identifier}|{action_date}|{desc_head}")) +} + +/// Frankie's state-detection regex: `\bstate:([a-z]{2})\b` searched against +/// the JSON-encoded record (his code walks `record["sources"]` and any +/// nested strings; serializing the whole record is the same surface). +fn state_from_record(record: &Value) -> Option { + let re = Regex::new(r"\bstate:([a-z]{2})\b").expect("regex compiles"); + let flat = serde_json::to_string(record).ok()?; + re.captures(&flat).map(|c| c[1].to_string()) +} + +/// (1) `govbot logs` survives the rename, (2) stdout is JSON-Lines, and (3) +/// every field path Frankie's parser touches is present on at least one +/// record. Coverage is "at least one record" — Frankie's parser walks the +/// stream and defends individual missing fields with `.get(...)` defaults; +/// the contract is that the SHAPE exists when the data does. +#[test] +fn govbot_logs_emits_every_field_frankie_reads() { + let govbot_dir = mocks_govbot_dir(); + assert!( + govbot_dir.exists(), + "mock corpus missing at {}; run from actions/govbot/", + govbot_dir.display() + ); + + let (stdout, stderr) = run_logs(&govbot_dir); + let records = parse_jsonl(&stdout); + assert!( + !records.is_empty(), + "govbot logs against the mock corpus emitted zero records — \ + the alias is wired up but Source produced no output. stderr:\n{stderr}" + ); + + // Top-level required-on-every-record fields. + for (i, r) in records.iter().enumerate() { + assert!( + r.get("id").and_then(|v| v.as_str()).is_some(), + "record[{i}] missing `id`: {r}" + ); + assert!( + r.get("timestamp").and_then(|v| v.as_str()).is_some(), + "record[{i}] missing `timestamp`: {r}" + ); + assert!( + r.get("bill").and_then(|v| v.as_object()).is_some(), + "record[{i}] missing `bill` object: {r}" + ); + assert!( + r.get("log").and_then(|v| v.as_object()).is_some(), + "record[{i}] missing `log` object: {r}" + ); + assert!( + r.get("sources").is_some(), + "record[{i}] missing `sources`: {r}" + ); + } + + // Required bill subfields present on every record (mock corpus does + // emit these for every bill). + for (i, r) in records.iter().enumerate() { + let bill = &r["bill"]; + assert!( + bill.get("identifier").and_then(|v| v.as_str()).is_some(), + "record[{i}].bill missing `identifier`: {bill}" + ); + assert!( + bill.get("title").and_then(|v| v.as_str()).is_some(), + "record[{i}].bill missing `title`: {bill}" + ); + assert!( + bill.get("legislative_session") + .and_then(|v| v.as_str()) + .is_some(), + "record[{i}].bill missing `legislative_session`: {bill}" + ); + } + + // Required log.action subfields on every record. + for (i, r) in records.iter().enumerate() { + let action = r["log"] + .get("action") + .expect(&format!("record[{i}].log.action missing")); + assert!( + action.get("description").and_then(|v| v.as_str()).is_some(), + "record[{i}].log.action missing `description`: {action}" + ); + assert!( + action.get("date").and_then(|v| v.as_str()).is_some(), + "record[{i}].log.action missing `date`: {action}" + ); + } + + // `bill.abstracts[].abstract` — must be present on at least one + // record when the underlying corpus has any abstract. The wy mock + // has bills with `abstracts: [{abstract:..., note:"summary"}]`. + let any_with_abstract = records.iter().any(|r| { + r["bill"] + .get("abstracts") + .and_then(|a| a.as_array()) + .map(|arr| { + arr.iter() + .any(|obj| obj.get("abstract").and_then(|v| v.as_str()).is_some()) + }) + .unwrap_or(false) + }); + assert!( + any_with_abstract, + "no record exposed `bill.abstracts[].abstract` — Frankie's \ + abstract-text fallback path will never trigger. The wy mock \ + is known to carry abstracts; if this fails the source-side \ + abstracts projection has regressed." + ); + + // `bill.subject` — Frankie's parser does `record["bill"].get("subject", [])`, + // so an absent key is tolerated (interpreted as no subjects). The hard + // contract is the inverse: if `subject` IS present, it must be an array + // of strings — anything else (object, scalar) would break his loop. + // (Non-empty subjects are projected from the mocks when present; the + // wy/gu mocks happen to ship `subject:[]` which Source omits by design, + // pinned by `ocd_entry_to_doc_omits_subjects_when_subject_array_is_empty` + // in main.rs.) + for (i, r) in records.iter().enumerate() { + if let Some(subj) = r["bill"].get("subject") { + assert!( + subj.is_array(), + "record[{i}].bill.subject is not an array (type breaks \ + Frankie's parser): {subj}" + ); + for (j, s) in subj.as_array().unwrap().iter().enumerate() { + assert!( + s.is_string(), + "record[{i}].bill.subject[{j}] is not a string: {s}" + ); + } + } + } + + // `record.sources` nested strings must contain `state:` somewhere + // — this is the regex anchor Frankie's parser uses to attribute a + // record to a US state for the dedup_key and per-bot routing. + let re = Regex::new(r"\bstate:([a-z]{2})\b").unwrap(); + let any_with_state = records.iter().any(|r| { + r.get("sources") + .map(|s| serde_json::to_string(s).unwrap_or_default()) + .map(|flat| re.is_match(&flat)) + .unwrap_or(false) + }); + assert!( + any_with_state, + "no record's `sources` contained a `state:` substring; \ + Frankie's state-extraction regex will fail on every bot." + ); + + // Belt-and-suspenders: at least one record matches the regex when + // serialized whole (sources or anywhere) — the broader form Frankie's + // parser actually walks. + let any_state_anywhere = records.iter().any(|r| state_from_record(r).is_some()); + assert!( + any_state_anywhere, + "no record yielded a state from the `\\bstate:([a-z]{{2}})\\b` \ + regex; Frankie's state attribution is dead." + ); + + // Deprecation warning lands on stderr (and ONLY stderr — stdout was + // parsed as JSON-Lines above; any leakage would have failed the + // `serde_json::from_str` line-by-line above). + assert!( + stderr.contains("`govbot logs` is deprecated"), + "stderr did not carry the deprecation warning; got:\n{stderr}" + ); +} + +/// (5) Dedup keys are non-empty and stable across two consecutive +/// invocations on the same mock data. Frankie's bots persist this key in +/// a posted-state ledger; instability would re-post every bill on every +/// cron run. +#[test] +fn dedup_key_is_nonempty_and_stable_across_runs() { + let govbot_dir = mocks_govbot_dir(); + + let (stdout_a, _) = run_logs(&govbot_dir); + let (stdout_b, _) = run_logs(&govbot_dir); + + let records_a = parse_jsonl(&stdout_a); + let records_b = parse_jsonl(&stdout_b); + + let keys_a: Vec = records_a.iter().filter_map(dedup_key).collect(); + let keys_b: Vec = records_b.iter().filter_map(dedup_key).collect(); + + assert!( + !keys_a.is_empty(), + "first invocation produced zero non-empty dedup keys; Frankie's \ + ledger would be empty and every bill would re-post forever." + ); + for k in &keys_a { + let parts: Vec<&str> = k.split('|').collect(); + assert_eq!( + parts.len(), + 4, + "dedup_key not of shape state|identifier|date|desc[:40]: {k}" + ); + for (i, p) in parts.iter().enumerate() { + assert!(!p.is_empty(), "dedup_key part {i} is empty in: {k}"); + } + } + + assert_eq!( + keys_a, keys_b, + "dedup keys diverged across two consecutive runs on the same mock \ + corpus — Frankie's bots would re-post every bill on every cron run." + ); +} diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index 3490d712..5a2a0cce 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -26,6 +26,7 @@ Commands: ls List datasets — the project's manifest datasets and the ones cached locally. With no manifest, lists every dataset in the registry search Search the dataset registry. A blank query lists every dataset doctor Check that the project's pulled datasets are coherent. A data-integrity smoke test, runnable after `govbot pull all` or before `govbot run` in production. Walks every linked dataset and verifies that the `govbot source --select docs` stream is well-formed: every linked dataset entry resolves to a real directory, per-dataset ids don't collapse onto a handful (the bug-7592418 signature), every sampled `id` resolves to a present and parseable `metadata.json`, and every sampled `text` is non-trivial. Zero-record datasets are surfaced as warnings rather than errors — `--filter default` can legitimately drop every routine log. Exits non-zero on any failure so it can drop straight into a CI step. Skips cleanly when the cache is empty — this is a smoke test, not a unit test + logs **Deprecated.** Alias for `govbot source` (default mode) preserved so existing consumers (the CHN-Bluesky-Govbot-Main framework, anyone running `govbot logs > bills.jsonl`) keep working after the Logs→Source rename. Prints a deprecation warning to stderr on invocation. Will be removed in a future major version help Print this message or the help of the given subcommand(s) Options: From 24fbf9219a9c0cb5497c22dcfee90ac830bf36cc Mon Sep 17 00:00:00 2001 From: Sartaj Date: Mon, 25 May 2026 23:10:50 -0500 Subject: [PATCH 31/32] cli: govbot logs defaults to --filter none for true back-compat MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 4d8e4f8 alias defaulted to --filter default (action-based, drops routine introductions / committee referrals / "Bill Number Assigned" lines), producing ~5 records against the wy+gu mock corpus. Frankie's scripts/post_to_bluesky.py was written against the pre-Source-rename govbot logs output which did NOT filter; under that contract ~20 records flow. Default flipped to --filter none to preserve the older behavior; opt into the action filter with --filter default. Also generalizes extract_timestamp_from_path to accept either `_` or `.` between the timestamp and the action slug — OCD-files emit both shapes (e.g. `20250129T022703Z_bill_number_assigned.json` and `20250131T030931Z.classification.introduction.lower.json`). The action filter happens to drop every `.classification.*` entry, so the `_`-only extractor was sufficient under --filter default; under --filter none those entries flow through and need their timestamp projected too (Frankie's parser reads record["timestamp"]). bills_jsonl_compat now passes against the new default; the full 84-test suite is green. Co-Authored-By: Claude Opus 4.7 --- actions/govbot/src/main.rs | 41 +++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 4f0d0053..647ebbc1 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -443,9 +443,15 @@ enum Command { #[arg(long, default_value = "default", value_parser = ["default", "docs"])] select: String, - /// Per-repo log filter (default: `default`). Mirrors - /// `govbot source --filter`. - #[arg(long, default_value = "default", value_parser = ["default", "none"])] + /// Per-repo log filter (default: `none` — every log entry, for + /// back-compat with the CHN-Bluesky-Govbot-Main framework's + /// `scripts/post_to_bluesky.py`, which was written against the + /// pre-Source-rename `govbot logs` output that did not filter). + /// Opt into the action-based filter (drops routine introductions, + /// committee referrals, "Bill Number Assigned" lines, etc.) with + /// `--filter default`. Same values as `govbot source --filter`, + /// only the default differs. + #[arg(long, default_value = "none", value_parser = ["default", "none"])] filter: String, /// Sort order (default: DESC). Mirrors `govbot source --sort`. @@ -1780,20 +1786,23 @@ fn deep_prune_json(value: serde_json::Value) -> serde_json::Value { /// Extract timestamp from a path string (after "logs/" and before "_") /// Example: "path/to/logs/20250121T000000Z_filename.json" -> "20250121T000000Z" fn extract_timestamp_from_path(path: &str) -> Option { - // Find the position of "/logs/" - if let Some(logs_pos) = path.find("/logs/") { - // Get the substring after "/logs/" - let after_logs = &path[logs_pos + 6..]; - // Find the position of "_" after "logs/" - if let Some(underscore_pos) = after_logs.find('_') { - // Extract the timestamp (between "logs/" and "_") - let timestamp = &after_logs[..underscore_pos]; - if !timestamp.is_empty() { - return Some(timestamp.to_string()); - } - } + // OCD-files log filenames take two shapes: action-named entries use + // `_.json` (e.g. `20250129T022703Z_bill_number_assigned.json`) + // and OCD-classification entries use `.classification.<...>.json` + // (e.g. `20250131T030931Z.classification.introduction.lower.json`). + // The action-based filter (`--filter default`) drops the latter, so the + // `_`-only extractor used to be sufficient; once `--filter none` became + // the `govbot logs` default for Frankie back-compat, the `.`-separated + // entries flow through and need their timestamp projected too. + let logs_pos = path.find("/logs/")?; + let after_logs = &path[logs_pos + 6..]; + let separator_pos = after_logs.find(|c: char| c == '_' || c == '.')?; + let timestamp = &after_logs[..separator_pos]; + if timestamp.is_empty() { + None + } else { + Some(timestamp.to_string()) } - None } /// Compute the relative path from `git_dir` to a walked file. From 1e607318cf748dfefb9901efd236f9374ed57a1a Mon Sep 17 00:00:00 2001 From: Sartaj Date: Mon, 25 May 2026 23:41:50 -0500 Subject: [PATCH 32/32] docs: AGENT.md playbook for the data-catalog (pipeline-manager) layer Scoped agent-facing playbook at actions/pipeline-manager/AGENT.md so future Claude Code sessions can update the Python data-catalog layer (chn-openstates-{scrape,files}.yml + render.py / apply.py) without re-deriving how it works. Covers: the declarative-repo-factory mental model, the two side-by-side configs, read-order for the Python orchestration, the four common change shapes (add state, mark working, change template, add new dataset family), the dry-run verification loop, and the cross-tool sync gotcha with actions/govbot/data/registry.json. Co-Authored-By: Claude Opus 4.7 --- actions/pipeline-manager/AGENT.md | 131 ++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 actions/pipeline-manager/AGENT.md diff --git a/actions/pipeline-manager/AGENT.md b/actions/pipeline-manager/AGENT.md new file mode 100644 index 00000000..8f2fc2d2 --- /dev/null +++ b/actions/pipeline-manager/AGENT.md @@ -0,0 +1,131 @@ +# pipeline-manager — agent playbook + +Read this before editing anything in `actions/pipeline-manager/`. It is the playbook for the **data catalog layer**: the declarative YAMLs + Python orchestration that ship workflow code to the per-jurisdiction repos which actually produce govbot's legislative data. + +If you're a human, the root `AGENT.md` is the right entrypoint — this file assumes you're already oriented to the four-tool govbot stack. + +## 1. Mental model + +This directory is **a declarative repo factory**, not a runtime scraper. + +It reads two YAML catalogs (`chn-openstates-scrape.yml`, `chn-openstates-files.yml`), renders workflow templates per locale into `generated/`, and reconciles the resulting set of per-state repos against a GitHub org — create missing, update drifted, delete orphans. + +The actual scraping/formatting happens **inside the generated GitHub Actions workflows in those per-state repos**. This directory doesn't run scrapers; it ships workflow YAML to repos that do. + +Two orgs are in play: + +- **`chn-openstates-scrapers`** — raw OpenStates output, one repo per jurisdiction. Driven by `chn-openstates-scrape.yml`. +- **`chn-openstates-files`** (a.k.a. `govbot-data` post-rename) — OCD-formatted data, one repo per jurisdiction. Driven by `chn-openstates-files.yml`. Triggered from the scraper repo via `repository_dispatch: scrape-and-format-complete`. + +The `chn-openstates-files` org is what `govbot pull` actually reads — it's the user-visible side. + +## 2. The two configs, side-by-side + +| File | What it manages | Per-locale knobs | +|---|---|---| +| `chn-openstates-scrape.yml` | Scraper repos (OpenStates → raw output) | `template`, `toolkit_branch`, `name`, `disabled_jobs`, `labels` | +| `chn-openstates-files.yml` | Formatter repos (raw → OCD `metadata.json` + logs) | same | + +The `labels: [working]` flag on `chn-openstates-files.yml` is what **gates user-visible publication**. A jurisdiction missing that label ships empty/broken data even if the entry exists. As of 2026-05 the gaps are AZ, CT, TX, VA on the files side (chihacknight/govbot#33) and ~19 jurisdictions on the scrape side. + +`disabled_jobs:` is a list of workflow filenames (without extension) to skip rendering — most locales disable `extract-text` because text extraction isn't wired up yet. + +## 3. Python orchestration — read these in this order + +For any change beyond editing a single YAML line: + +1. **`render.py`** — parses YAML, walks locales, does sed-style `✏️{ var }✏️` substitution into `generated///...`. Filter flags: `--all-states`, `--test-states ak,wy`. Defaults to a 5-state sample (`al,ak,de,wy,sd`) when neither is set. + +2. **`apply.py`** — orchestrator. Key sections: + - `get_expected_repos` (lines 59–161): shells out to `render.py`, walks `generated/`, builds the set of repos that *should* exist. + - `get_actual_repos` (lines 164–187): `gh repo list `. + - `create_repo` / `update_repo` / `delete_repo` (lines 190–484): reconcile. + - `fully_override_dirs` (default `[".github"]`): which dirs in the target repo get authoritative overwrite — files there that aren't in the template get deleted. Other dirs are additive-merge, preserving user/data files. + +3. **`config.schema.json`** — JSON Schema validating both YAMLs. Read this before adding a new locale knob. + +## 4. How to make the common changes + +### A. Add a new state/territory + +1. Add a `locales.:` entry to **both** `chn-openstates-scrape.yml` and `chn-openstates-files.yml`. Crib from a neighbor. +2. Add the matching entry to `/Users/sartaj/Git/govbot/actions/govbot/data/registry.json` under `us-legislation/`. **This step is the easy one to forget — see §6.** +3. Run `./render-snapshots.sh` only if the new code is in the snapshot sample (`ak,id,mt,pr,wy`). Otherwise no snapshot churn. +4. Verify: `python3 apply.py -c chn-openstates-files.yml --test-states --dry-run`. + +### B. Mark a stuck jurisdiction as working / not-working (issue-#33 shape) + +1. Add or remove `labels: [working]` on the locale entry in `chn-openstates-files.yml`. +2. The fix doesn't live here — diagnose the underlying scraper/formatter failure by inspecting the per-state repo's **Actions tab on GitHub** (e.g. `https://github.com/chn-openstates-files/az-legislation/actions`). This directory has zero runtime logs. +3. No snapshot change needed — `labels` is metadata, doesn't flow into rendered workflows. + +### C. Change the workflow template (affects every jurisdiction) + +1. Edit `templates/openstates-to-ocd-files/.github/workflows/format.yml` (files side) or `templates/openstates-scrape/...` (scrape side). +2. Run `./render-snapshots.sh` and commit the snapshot diff. **The diff in `__snapshots__/` is the review surface** — without it, reviewers can't see what 55 repos are about to receive. +3. Then `python3 apply.py -c .yml --all-states --dry-run` to see how many repos would receive the update. + +### D. Add a new dataset family that isn't OpenStates (Councilmatic — #30, Executive Actions — #28) + +1. New template dir: `templates//` containing the workflow YAML the per-locale repos should carry. +2. New top-level config YAML next to the existing two, registering `template_markers`, `org`, `templates`, `locales`. +3. Wire the `templates:` block + `folder-name:` pattern. `apply.py` is family-agnostic — no Python changes needed. +4. Add resulting dataset IDs to `actions/govbot/data/registry.json`. If the namespace isn't `us-legislation`, set the right one (e.g. `us-executive`, `chicago-council`). + +## 5. Verification loop + +Always before any `gh repo create / update / delete` run: + +```bash +cd actions/pipeline-manager + +# 1. Render only — never hits GitHub: +python3 render.py -c chn-openstates-files.yml --test-states + +# 2. Reconcile preview — calls `gh repo list` but no mutations: +python3 apply.py -c chn-openstates-files.yml --test-states --dry-run + +# 3. Snapshot regen (only if a sample-set state changed OR a template changed): +./render-snapshots.sh +``` + +**Footgun:** the `to delete: N` line in the dry-run summary. If N is unexpectedly large, do NOT run without `--no-delete`. Some repos in the org may exist intentionally outside the catalog (e.g. issue-#32's proposed per-session repos would land that way). + +## 6. The cross-tool sync gotcha — call this out loudly + +The **Python catalog and the Rust registry are independent sources of truth** and they drift. + +- **Python side** (`chn-openstates-{scrape,files}.yml`) sets `org.username: chn-openstates-files`. This is the org repos get created in. +- **Rust side** (`/Users/sartaj/Git/govbot/actions/govbot/data/registry.json`) is hand-maintained, baked into the binary via `include_str!`, and as of 2026-05 still points every `git_url` at `chn-openstates-files/` — the **predecessor** of the `govbot-data` org. Issue chihacknight/govbot#32 flags this. + +If/when the `chn-openstates-files` → `govbot-data` org rename completes, **both** must move together. Touching only one creates a "user follows AGENT.md, lands on stale org" failure. + +Same drift risk on every add/remove: Python adds the workflow repo, Rust needs the registry entry pointing at where data actually lands. + +**Rule:** never change one without checking the other. One grep is enough: + +```bash +rg "chn-openstates-files|govbot-data" actions/ +``` + +## 7. Where this layer stops + +What this directory does **not** own (don't drift into these): + +- The actual scraper code — lives in the generated per-state repos + the upstream `openstates/openstates-scrapers` project. +- The OCD format conversion — lives in `actions/format/` and is invoked from the generated `format.yml` workflow. +- Text extraction (issue #31) — would be a new workflow step calling something under `actions/extract/`; this directory would only add the workflow-template wiring. +- The `govbot pull` cache, stream protocol, and `--select docs` projection — owned by `actions/govbot/` (Rust). + +## Critical files + +Read these first when in doubt: + +- `chn-openstates-files.yml` — the catalog +- `chn-openstates-scrape.yml` — the scrape-side catalog +- `apply.py` (lines 59–161, 307–484) — orchestration + update logic +- `render.py` — template rendering +- `config.schema.json` — locale schema +- `render-snapshots.sh` — the 5-state sample, deterministic across platforms +- `templates/openstates-to-ocd-files/.github/workflows/format.yml` — the workflow template every files-side jurisdiction receives +- `/Users/sartaj/Git/govbot/actions/govbot/data/registry.json` — the Rust-side sync target (§6)